how to parse the file in xml format using awk/nawk

Hi All,
I have an xml file with the below format.

<a>111</a><b>222</b><c>333<c><d><e>123</e><f>234</f><d><e>456</e><f>789</f>

output needed is

111,222,333,123,234
111,222,333,456,789
nawk 'BEGIN{FS="<|>"}
        
       {print a,b,c,e,f
                   a=""
                   b=""
                   e=""
                   f=""
          }         
        
         {for(i=1;i<=NF;i++) {if($i=="a"){a=$(i+1);continue}}}
         {for(i=1;i<=NF;i++) {if($i=="b"){b=$(i+1); continue}}}
         {for(i=1;i<=NF;i++) {if($i=="c"){d=$(i+1); continue}}}
         {for(i=1;i<=NF;i++) {if($i=="e"){d=$(i+1); continue}}}
         {for(i=1;i<=NF;i++) {if($i=="f"){d=$(i+1); continue}}}
       END {print a,b,c,e,f}' file

However,
the output that I have is

111,222,333,456,789

ANy one have any idea?

lots of threads are available regarding this.
please use search.

Trick is in using the right tool for the right job.

There are modules already available in CPAN for xml parsing and creating xml stuff. Try them instead! :slight_smile:

Unfortunately, if you look closely at the string, it is not valid XML as it is not well-formed. No XML parser is going to handle this string.

---------- Post updated at 10:21 AM ---------- Previous update was at 09:06 AM ----------

One way of doing it would be to use a mix of sed and awk to parse and process the line

sed 's/\<d\>/|/g' file | sed 's/\<\/.\>/ /g' | sed 's/\<.\>//g' | sed 's/ \(.\)/,\1/g' | \
sed 's/,|/|/g' |  awk -F'|' '{ printf "%s,%s\n", $1, $2; printf "%s,%s\n", $1, $3 }'

This outputs

111,222,333,123,234
111,222,333,456,789

Not elegant but it works!

$_='<a>111</a><b>222</b><c>333<c><d><e>123</e><f>234</f><d><e>456</e><f>789</f>';
my @tmp=$_=~/[0-9]+/g;
my @a1=@tmp[0..4];
my @a2=@tmp[0..2,5,6];
print join ",", @a1;
print "\n";
print join ",",@a2;

Hi anchal _khare,matrixmadhan,fpmurphy,summer_cherry

Thank you very much for your help!!

---------- Post updated at 05:01 AM ---------- Previous update was at 05:00 AM ----------

Hi summer_cherry,

This is the perl script?

Thanks.

Another way with awk...

 
awk -F"<d>" '{print $1","$2,"\n"$1","$3}' f1 | tr -d '<[a-z]>' | tr '\/' ','

To the OP: are you sure of the XML data. If you look carefully, some closing tag are missing.

<a>111</a><b>222</b><c>333<c><d><e>123</e><f>234</f><d><e>456</e><f>789</f>
<a>111</a><b>222</b><c>333</c><d><e>123</e><f>234</f></d><e>456</e><f>789</f>

If I recall, some clever awk fans have developed XML parser modules. I will have a look and post the link if I find it back.

---------- Post updated at 06:00 PM ---------- Previous update was at 05:53 PM ----------

Here you go:
awk.info