Extract multiple xml tag value into CSV format

Hi All,

Need your assistance on another xml tag related issue. I have a xml file as below:

<INVOICES>
<INVOICE>
<BILL>
<BILL_NO>1234</BILL_NO>
<BILL_DATE>01 JAN 2011</BILL_DATE>
</BILL>
<NAMEINFO>
<NAME>ABC</NAME>
</NAMEINFO>
</INVOICE>
<INVOICE>
<BILL>
<BILL_NO>5678</BILL_NO>
<BILL_DATE>01 JAN 2011</BILL_DATE>
</BILL>
<NAMEINFO>
<NAME>BCA</NAME>
</NAMEINFO>
</INVOICE>
<INVOICE>
<BILL>
<BILL_NO>1256</BILL_NO>
<BILL_DATE>01 JAN 2011</BILL_DATE>
</BILL>
<NAMEINFO>
<NAME></NAME>
</NAMEINFO>
</INVOICE>
<INVOICE>
<BILL>
<BILL_NO>345</BILL_NO>
<BILL_DATE>01 JAN 2011</BILL_DATE>
</BILL>
<NAMEINFO>
<NAME/>
</NAMEINFO>
</INVOICE>
<INVOICE>
<BILL>
<BILL_NO>8934</BILL_NO>
<BILL_DATE>01 JAN 2011</BILL_DATE>
</BILL>
<NAMEINFO>
<NAME>PKL</NAME>
</NAMEINFO>
</INVOICE>
</INVOICES>

I need the CSV file in the following format


1234.ABC
5678,BCA
1256,NA
345,NA
8934,PKL

The xml tag is not consistent for NAME. Is this achievebale ? Your help is highly appreciated.

Thanks
Angshuman

awk -F'>|<' '/BILL_NO/{printf $3}/NAME\>/{print NF==3?",NA":","$3}'
1 Like

Hi Yinyuemi,

I tried your code but I am not sure where do I need to put in the file name. Another point is that I am using it in HP Unix.

Thanks
Angshuman

please try:

awk -F'>|<' '/BILL_NO/{printf $3}/NAME\>/{print NF==3?",NA":","$3}' urfile

Hi Yinyuemi,

I tried and got the following error:

syntax error The source line is 1.
 The error context is
                /BILL_NO/{printf $3}/NAME\>/{print >>>  NF== <<<
 awk: The statement cannot be correctly parsed.
 The source line is 1.

how about this?

awk -F'>|<' '/BILL_NO/{printf $3}/NAME\>/{if(NF==3) {print ",NA"} else{print ","$3}}' file

or:

awk -F'>|<' '/BILL_NO/{printf $3}/NAME\>/{if($2~/\//) {print ",NA"} else{print ","$3}}' file

Hi Yinyuemi,

Both the modifed code is giving output without any error except the text NA is not appearing for two bills.

awk -F'>|<' '/BILL_NO/{printf $3}/NAME\>/{if(NF==3) {print ",NA"} else{print ","$3}}' myfile

The out put is :

1234,ABC
5678,BCA
1256,
3458934,PKL

The expected was:

1234,ABC
5678,BCA
1256,NA
345,NA
8934,PKL

If you notice the xml file, you will see that for bill number 1256 the name tag is "<NAME></NAME>" whereas for bill number 345 it is "<NAME/>"

hope this works

awk -F'>|<' '/BILL_NO/{printf $3}$2=="NAME"{if($3) {print ","$3}else {print ",NA"}}'

Hi yinyuemi,

Sorry to bother again.

Yes this worked except the last two rows. The output is now as below

1234,ABC
5678,BCA
1256,NA
3458934,PKL

The last two rows are coming together. The outpur should be as below:

1234,ABC
5678,BCA
1256,NA
345,NA
8934,PKL

As I mentioned before, the ending tag for the second last row is different from the previous tag where name is balnk. The tag is <NAME/> where as for the previous one is <NAME></NAME>. If I change <NAME/> to "<NAME></NAME>", it works fine. But it is not confirmed that the tag will always be complete xml tag. I might receive the xml ending tag (<NAME/>)only. I was also qurious to know what will happen, if no tag is sent.

Thanks
Angshuman

awk -F'>|<' '/BILL_NO/{printf $3","}/NAME[^I]/{if($3) {print ","$3}else {print ",NA"}}' file

Hi yinyuemi,

Thanks for your support and help. It is now working perfectly irrespective of the tag format. Will it work in case the tag is not passed at all ? What is the purpose of ^I ?

If it takes care of non existense of the tag, it did not give desired output.
I tried by removing NAME tag for one of the bill number and the putput is as below:

1234,ABC
5678,BCA
1256345,NA
8934,PKL

Thanks
Angsuman

/NAME[^I]/ means it will not match the word like "NAMEINFO". [^I] means the pattern without "I".

while(<DATA>){
if(/<BILL_NO>(\d+)<\/BILL_NO>/){
print "NA\n" if $flag;
print $1,",";
$flag = 1;
}
elsif(/<NAME>(.*?)<\/NAME>/){
$flag = 0;
my $tmp = ($1 eq "")?"NA":$1;
print $tmp,"\n";
}
}
__DATA__