how to parse the file in xml format using awk/nawk

natalie23 · September 10, 2009, 10:54pm

Hi All,
I have an xml file with the below format.

<a>111</a><b>222</b><c>333<c><d><e>123</e><f>234</f><d><e>456</e><f>789</f>

output needed is

111,222,333,123,234
111,222,333,456,789

nawk 'BEGIN{FS="<|>"}
        
       {print a,b,c,e,f
                   a=""
                   b=""
                   e=""
                   f=""
          }         
        
         {for(i=1;i<=NF;i++) {if($i=="a"){a=$(i+1);continue}}}
         {for(i=1;i<=NF;i++) {if($i=="b"){b=$(i+1); continue}}}
         {for(i=1;i<=NF;i++) {if($i=="c"){d=$(i+1); continue}}}
         {for(i=1;i<=NF;i++) {if($i=="e"){d=$(i+1); continue}}}
         {for(i=1;i<=NF;i++) {if($i=="f"){d=$(i+1); continue}}}
       END {print a,b,c,e,f}' file

However,
the output that I have is

111,222,333,456,789

ANy one have any idea?

clx · September 11, 2009, 4:58am

lots of threads are available regarding this.
please use search.

matrixmadhan · September 11, 2009, 5:57am

Trick is in using the right tool for the right job.

There are modules already available in CPAN for xml parsing and creating xml stuff. Try them instead!

fpmurphy · September 11, 2009, 10:21am

Unfortunately, if you look closely at the string, it is not valid XML as it is not well-formed. No XML parser is going to handle this string.

---------- Post updated at 10:21 AM ---------- Previous update was at 09:06 AM ----------

One way of doing it would be to use a mix of sed and awk to parse and process the line

sed 's/\<d\>/|/g' file | sed 's/\<\/.\>/ /g' | sed 's/\<.\>//g' | sed 's/ \(.\)/,\1/g' | \
sed 's/,|/|/g' |  awk -F'|' '{ printf "%s,%s\n", $1, $2; printf "%s,%s\n", $1, $3 }'

This outputs

111,222,333,123,234
111,222,333,456,789

Not elegant but it works!

summer_cherry · September 13, 2009, 11:38pm

$_='<a>111</a><b>222</b><c>333<c><d><e>123</e><f>234</f><d><e>456</e><f>789</f>';
my @tmp=$_=~/[0-9]+/g;
my @a1=@tmp[0..4];
my @a2=@tmp[0..2,5,6];
print join ",", @a1;
print "\n";
print join ",",@a2;

natalie23 · September 14, 2009, 6:01am

Hi anchal _khare,matrixmadhan,fpmurphy,summer_cherry

Thank you very much for your help!!

---------- Post updated at 05:01 AM ---------- Previous update was at 05:00 AM ----------

summer_cherry:

$_='<a>111</a><b>222</b><c>333<c><d><e>123</e><f>234</f><d><e>456</e><f>789</f>';
my @tmp=$_=~/[0-9]+/g;
my @a1=@tmp[0..4];
my @a2=@tmp[0..2,5,6];
print join ",", @a1;
print "\n";
print join ",",@a2;

Hi summer_cherry,

This is the perl script?

Thanks.

malcomex999 · September 14, 2009, 7:08am

Another way with awk...

 
awk -F"<d>" '{print $1","$2,"\n"$1","$3}' f1 | tr -d '<[a-z]>' | tr '\/' ','

ripat · September 14, 2009, 12:00pm

To the OP: are you sure of the XML data. If you look carefully, some closing tag are missing.

<a>111</a><b>222</b><c>333<c><d><e>123</e><f>234</f><d><e>456</e><f>789</f>
<a>111</a><b>222</b><c>333</c><d><e>123</e><f>234</f></d><e>456</e><f>789</f>

If I recall, some clever awk fans have developed XML parser modules. I will have a look and post the link if I find it back.

---------- Post updated at 06:00 PM ---------- Previous update was at 05:53 PM ----------

Here you go:
awk.info