Create an XML tree using perl

vanitham · April 27, 2011, 7:13am

Hi,

I am having an xml file which looks like this:

<Nodes>
<Node>
	<Nodename>Student</Nodename>
	<Filename>1.txt</filename>
<Node>
	<Nodename>Dummy</Nodename>
	<Filename>22.txt</filename>
</Node>
</Node>
</Nodes>

The text files will have data like this:
#1.txt
Studentid,studentname,mothertongue:language
|||ly 2.txt ----

The output will look this

<Nodes>
<Student>
<Studentid>123</Studentid>
<studentname>name1 </studentname>
<mothertongue>english (or japan) </mothertongue>

the text file data has to be taken for the appropriate node and then xml tree has to be built...

There are very huge xml files.

How can i create an xml tree which consumes less time and with better performance?

which parser do i need to use?

Help is very much required.
Regards
Vanitha

DGPickett · April 27, 2011, 1:27pm

I like JAVA for XML parse, but writing it is trivial in any language. However, for well structured input files (one source), you can usually parse with sed or awk, or even shell.

You have one node inside the prior one. Is this typical, or a typo?

You do not show a source for any of the output values?

fpmurphy · April 27, 2011, 2:27pm

Neither your example of a text file or the output document make any sense. Please provide correct examples of both.

vanitham · April 28, 2011, 5:56am

Here is the sample text file.

#student. txt
Student id,student name, student language,student address
#student.csv
122,Ashwini,English,Bangalore
123,Amith,Kannada,Hubli
.....
....

The xml file will be like this:

<Nodes>
<Node>
<Nodename>Student Details</Nodename>
<Filename>student.txt</Filename>
<DataFile>student.csv</DataFile>
</Node>
</Nodes>

The xml tree should be constructed by taking the corresponding file name , data file of that node.

Here for example: Node name Student Details file is: student. txt, data file is: student.csv

The output should be like this:

<Student Details>
<Student id>122</Student id>
<student name>Ashwini</Student name>
 <student language> english </student language>
 <student address> bangalore </student address>
</Student Details>

Any idea?

palanisvr · April 28, 2011, 8:58am

Sample file 1 :

$cat student.txt
Student id,student name, student language,student address

Sample file 2 :

$ cat student.csv
122,Ashwini,English,Bangalore
123,Amith,Kannada,Hubli

Script :

#!/bin/ksh

cat xml_file|grep "Nodename"  | sed  -e s/\<Nodename\>// |sed  s/\<\\/Nodename\>// > main_tag.txt
cat xml_file|grep "Filename"  | sed  -e s/\<Filename\>// |sed  s/\<\\/Filename\>// > file_name.txt
cat xml_file|grep "DataFile"  | sed  -e s/\<DataFile\>// |sed  s/\<\\/DataFile\>// > data_file.txt

main_tag=`cat main_tag.txt`
file_name=`cat file_name.txt`
data_file=`cat data_file.txt`



if [ -f xml.txt ] 
then
rm xml.txt
fi 

for i in `cat $data_file`
do


col_std_id=`cat $file_name | awk -F ',' '{print $1'}`
col_std_nm=`cat $file_name  | awk -F ',' '{print $2'}`
col_std_lang=`cat $file_name  | awk -F ',' '{print $3'}`
col_std_add=` cat $file_name  | awk -F ',' '{print $4'}`


val_std_id=`echo $i | awk -F ',' '{print $1'}`
val_std_nm=`echo $i | awk -F ',' '{print $2'}`
val_std_lang=`echo $i | awk -F ',' '{print $3'}`
val_std_add=`echo $i | awk -F ',' '{print $4'}`



echo "<$main_tag>" >> xml.txt
echo "<$col_std_id>$val_std_id</$col_std_id>" >>xml.txt
echo "<$col_std_nm>$val_std_nm</$col_std_nm>" >>xml.txt
echo "<$col_std_lang>$val_std_lang</$col_std_lang>" >>xml.txt
echo "<$col_std_add>$val_std_add</$col_std_add> ">>xml.txt
echo "</$main_tag>" >> xml.txt


done

echo "final output of the file "
cat  xml.txt

Output :

$ cat xml.txt
<Student Details>
<Student id>122</Student id>
<student name>Ashwini</student name>
< student language>English</ student language>
<student address>Bangalore</student address>
</Student Details>
<Student Details>
<Student id>123</Student id>
<student name>Amith</student name>
< student language>Kannada</ student language>
<student address>Hubli</student address>
</Student Details>

vanitham · April 29, 2011, 12:04am

palanisvr:

Sample file 1 :

$cat student.txt
Student id,student name, student language,student address

Sample file 2 :

$ cat student.csv
122,Ashwini,English,Bangalore
123,Amith,Kannada,Hubli

Script :

#!/bin/ksh

cat xml_file|grep "Nodename"  | sed  -e s/\<Nodename\>// |sed  s/\<\\/Nodename\>// > main_tag.txt
cat xml_file|grep "Filename"  | sed  -e s/\<Filename\>// |sed  s/\<\\/Filename\>// > file_name.txt
cat xml_file|grep "DataFile"  | sed  -e s/\<DataFile\>// |sed  s/\<\\/DataFile\>// > data_file.txt

main_tag=`cat main_tag.txt`
file_name=`cat file_name.txt`
data_file=`cat data_file.txt`



if [ -f xml.txt ] 
then
rm xml.txt
fi 

for i in `cat $data_file`
do


col_std_id=`cat $file_name | awk -F ',' '{print $1'}`
col_std_nm=`cat $file_name  | awk -F ',' '{print $2'}`
col_std_lang=`cat $file_name  | awk -F ',' '{print $3'}`
col_std_add=` cat $file_name  | awk -F ',' '{print $4'}`


val_std_id=`echo $i | awk -F ',' '{print $1'}`
val_std_nm=`echo $i | awk -F ',' '{print $2'}`
val_std_lang=`echo $i | awk -F ',' '{print $3'}`
val_std_add=`echo $i | awk -F ',' '{print $4'}`



echo "<$main_tag>" >> xml.txt
echo "<$col_std_id>$val_std_id</$col_std_id>" >>xml.txt
echo "<$col_std_nm>$val_std_nm</$col_std_nm>" >>xml.txt
echo "<$col_std_lang>$val_std_lang</$col_std_lang>" >>xml.txt
echo "<$col_std_add>$val_std_add</$col_std_add> ">>xml.txt
echo "</$main_tag>" >> xml.txt


done

echo "final output of the file "
cat  xml.txt

Output :

$ cat xml.txt
<Student Details>
<Student id>122</Student id>
<student name>Ashwini</student name>
< student language>English</ student language>
<student address>Bangalore</student address>
</Student Details>
<Student Details>
<Student id>123</Student id>
<student name>Amith</student name>
< student language>Kannada</ student language>
<student address>Hubli</student address>
</Student Details>

Hi,

Thank u very much.

But if the code is in perl it was very much appreciated.

How can i handle in perl ?

pravin27 · April 29, 2011, 12:32am

Perl code,

#!/usr/bin/perl
open(FH,"<","student.xml") or die "Failure- $!\n";
while(<FH>) {
chomp;
if(/<Nodename>(.+?)<\/Nodename>/) {$nodename=$1;}
if(/<Filename>(.+?)<\/Filename>/) {$filename=$1;}
if(/<DataFile>(.+?)<\/DataFile>/) {$datafile=$1;}
}
close(FH);
open(ST,"<",$filename) or die "Failure- $!\n";
open(DT,"<",$datafile) or die "Failure- $!\n";

$file=<ST>;
chomp($file);
@flds=split(",",$file);

while(<DT>) {
chomp;
print "<",$nodename,">\n";
@data=split(",");
for($i=0;$i<=$#data;$i++) {
print "<",$flds[$i],">",$data[$i],"</",$flds[$i],">\n";
}
print "</",$nodename,">\n";
}
close(ST);
close(DT);

vanitham · April 29, 2011, 3:51am

pravin27:

Perl code,

#!/usr/bin/perl
open(FH,"<","student.xml") or die "Failure- $!\n";
while(<FH>) {
chomp;
if(/<Nodename>(.+?)<\/Nodename>/) {$nodename=$1;}
if(/<Filename>(.+?)<\/Filename>/) {$filename=$1;}
if(/<DataFile>(.+?)<\/DataFile>/) {$datafile=$1;}
}
close(FH);
open(ST,"<",$filename) or die "Failure- $!\n";
open(DT,"<",$datafile) or die "Failure- $!\n";

$file=<ST>;
chomp($file);
@flds=split(",",$file);

while(<DT>) {
chomp;
print "<",$nodename,">\n";
@data=split(",");
for($i=0;$i<=$#data;$i++) {
print "<",$flds[$i],">",$data[$i],"</",$flds[$i],">\n";
}
print "</",$nodename,">\n";
}
close(ST);
close(DT);

Hi,

Thanks for the quick reply.

But i have one question here if the xml node has dummy node with out any node name... some thing like this:

<Node>
<Nodename>dummy</Nodename>
<filenames>file1.txt</filenames>
<datafiles>1.csc </datafiles>

In that case the above code how will it handle and along with the attributes and the node level can be more.

Forexample for attributes:

 
student fees:code
In xml it would be
<student fees="USD">23</student fees>

How to handle this?

does sax parser helps with this?

How can i handle it?

Regards
vanitha

DGPickett · April 29, 2011, 11:55am

If you are up to PERL, use a real SAX parser: PERL XML SAX Parser - Google Search

SAX is serial parsing, best for batch with potentially huge files. (The alternative, DOM, puts the entire XML file in memory as an object tree, impossible for bulk and for serial transmissions in real time.) It calls you as it traverses tags (elements), gives you direct access to attributes in and content after the start tag. It is up to you to manage state variables for where in the nesting you are.

In JAVA, I created a reusable object tree that reflected the XML syntax tree, with an abstract class to support an interface to support building classes for each type of element. PERL can probably do something similar. The PERL XML SAX lib probably has dictionary correctness checking as well, although I turned that off for speed and robustness and did my own validation.