Extract multiple xml tag value into CSV format

angshuman · April 1, 2011, 1:23am

Hi All,

Need your assistance on another xml tag related issue. I have a xml file as below:

<INVOICES>
<INVOICE>
<BILL>
<BILL_NO>1234</BILL_NO>
<BILL_DATE>01 JAN 2011</BILL_DATE>
</BILL>
<NAMEINFO>
<NAME>ABC</NAME>
</NAMEINFO>
</INVOICE>
<INVOICE>
<BILL>
<BILL_NO>5678</BILL_NO>
<BILL_DATE>01 JAN 2011</BILL_DATE>
</BILL>
<NAMEINFO>
<NAME>BCA</NAME>
</NAMEINFO>
</INVOICE>
<INVOICE>
<BILL>
<BILL_NO>1256</BILL_NO>
<BILL_DATE>01 JAN 2011</BILL_DATE>
</BILL>
<NAMEINFO>
<NAME></NAME>
</NAMEINFO>
</INVOICE>
<INVOICE>
<BILL>
<BILL_NO>345</BILL_NO>
<BILL_DATE>01 JAN 2011</BILL_DATE>
</BILL>
<NAMEINFO>
<NAME/>
</NAMEINFO>
</INVOICE>
<INVOICE>
<BILL>
<BILL_NO>8934</BILL_NO>
<BILL_DATE>01 JAN 2011</BILL_DATE>
</BILL>
<NAMEINFO>
<NAME>PKL</NAME>
</NAMEINFO>
</INVOICE>
</INVOICES>

I need the CSV file in the following format


1234.ABC
5678,BCA
1256,NA
345,NA
8934,PKL

The xml tag is not consistent for NAME. Is this achievebale ? Your help is highly appreciated.

Thanks
Angshuman

yinyuemi · April 1, 2011, 1:27am

awk -F'>|<' '/BILL_NO/{printf $3}/NAME\>/{print NF==3?",NA":","$3}'

angshuman · April 1, 2011, 1:39am

Hi Yinyuemi,

I tried your code but I am not sure where do I need to put in the file name. Another point is that I am using it in HP Unix.

Thanks
Angshuman

yinyuemi · April 1, 2011, 1:42am

please try:

awk -F'>|<' '/BILL_NO/{printf $3}/NAME\>/{print NF==3?",NA":","$3}' urfile

angshuman · April 1, 2011, 1:46am

Hi Yinyuemi,

I tried and got the following error:

syntax error The source line is 1.
 The error context is
                /BILL_NO/{printf $3}/NAME\>/{print >>>  NF== <<<
 awk: The statement cannot be correctly parsed.
 The source line is 1.

yinyuemi · April 1, 2011, 1:51am

how about this?

awk -F'>|<' '/BILL_NO/{printf $3}/NAME\>/{if(NF==3) {print ",NA"} else{print ","$3}}' file

or:

awk -F'>|<' '/BILL_NO/{printf $3}/NAME\>/{if($2~/\//) {print ",NA"} else{print ","$3}}' file

angshuman · April 1, 2011, 2:09am

Hi Yinyuemi,

Both the modifed code is giving output without any error except the text NA is not appearing for two bills.

awk -F'>|<' '/BILL_NO/{printf $3}/NAME\>/{if(NF==3) {print ",NA"} else{print ","$3}}' myfile

The out put is :

1234,ABC
5678,BCA
1256,
3458934,PKL

The expected was:

1234,ABC
5678,BCA
1256,NA
345,NA
8934,PKL

If you notice the xml file, you will see that for bill number 1256 the name tag is "<NAME></NAME>" whereas for bill number 345 it is "<NAME/>"

yinyuemi · April 1, 2011, 2:20am

hope this works

awk -F'>|<' '/BILL_NO/{printf $3}$2=="NAME"{if($3) {print ","$3}else {print ",NA"}}'

angshuman · April 1, 2011, 2:54am

Hi yinyuemi,

Sorry to bother again.

Yes this worked except the last two rows. The output is now as below

1234,ABC
5678,BCA
1256,NA
3458934,PKL

The last two rows are coming together. The outpur should be as below:

1234,ABC
5678,BCA
1256,NA
345,NA
8934,PKL

As I mentioned before, the ending tag for the second last row is different from the previous tag where name is balnk. The tag is <NAME/> where as for the previous one is <NAME></NAME>. If I change <NAME/> to "<NAME></NAME>", it works fine. But it is not confirmed that the tag will always be complete xml tag. I might receive the xml ending tag (<NAME/>)only. I was also qurious to know what will happen, if no tag is sent.

Thanks
Angshuman

yinyuemi · April 1, 2011, 3:08am

awk -F'>|<' '/BILL_NO/{printf $3","}/NAME[^I]/{if($3) {print ","$3}else {print ",NA"}}' file

angshuman · April 1, 2011, 3:33am

Hi yinyuemi,

Thanks for your support and help. It is now working perfectly irrespective of the tag format. Will it work in case the tag is not passed at all ? What is the purpose of ^I ?

If it takes care of non existense of the tag, it did not give desired output.
I tried by removing NAME tag for one of the bill number and the putput is as below:

1234,ABC
5678,BCA
1256345,NA
8934,PKL

Thanks
Angsuman

yinyuemi · April 1, 2011, 1:06pm

angshuman:

Hi yinyuemi,

Thanks for your support and help. It is now working perfectly irrespective of the tag format. Will it work in case the tag is not passed at all ? What is the purpose of ^I ?

If it takes care of non existense of the tag, it did not give desired output.
I tried by removing NAME tag for one of the bill number and the putput is as below:
1234,ABC
5678,BCA
1256345,NA
8934,PKL
Thanks
Angsuman

/NAME[^I]/ means it will not match the word like "NAMEINFO". [^I] means the pattern without "I".

summer_cherry · April 7, 2011, 4:29am

while(<DATA>){
if(/<BILL_NO>(\d+)<\/BILL_NO>/){
print "NA\n" if $flag;
print $1,",";
$flag = 1;
}
elsif(/<NAME>(.*?)<\/NAME>/){
$flag = 0;
my $tmp = ($1 eq "")?"NA":$1;
print $tmp,"\n";
}
}
__DATA__