Hi All,
I have a large xml file of invoices. The file looks like below:
<INVOICES>
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>1234</INVOICE_NO>
</INVOICE>
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>2345</INVOICE_NO>
</INVOICE>
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>3456</INVOICE_NO>
</INVOICE>
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>5678</INVOICE_NO>
</INVOICE>
</INVOICES>
I need to extract all the <INVOICE>...........</INVOICE> provided the value of INVOICE_NO = 2345 and 5678.
I searched the forum and found how to extract values between xml tag. But this is a different scenario.
Your help is highly appreciated.
Thanks
Angshuman
kurumi
January 15, 2011, 4:58am
2
ruby -ne 'BEGIN{$/="</INVOICE>"}; print "#{$_}\n"; if /2345|5678/ ' file
HI Kurumi,
Thank you for your reply. Do we have any awk or sed command to achieve this?
Thanks
Angshuman
kamaraj@kamaraj-laptop:~/Desktop$ for i in `cat xml_input`; do grep -B2 $i test | sed '$d'; grep -A1 $i test; done
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>2345</INVOICE_NO>
</INVOICE>
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>5678</INVOICE_NO>
</INVOICE>
kamaraj@kamaraj-laptop:~/Desktop$ cat xml_input
2345
5678
Hi Kamaraj,
Thank you for your reply. I tried your command but got the following:
grep: illegal option -- B
grep: illegal option -- 2
grep: illegal option -- A
grep: illegal option -- 1
Are these parameters of grep command ? Please let me know
Thanks
Angshuman
angshuman:
...
<INVOICES>
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>1234</INVOICE_NO>
</INVOICE>
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>2345</INVOICE_NO>
</INVOICE>
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>3456</INVOICE_NO>
</INVOICE>
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>5678</INVOICE_NO>
</INVOICE>
</INVOICES>
I need to extract all the <INVOICE>...........</INVOICE> provided the value of INVOICE_NO = 2345 and 5678.
...
Maybe something like this?
$
$ # display the contents of the xml file
$ cat f1.xml
<INVOICES>
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>1234</INVOICE_NO>
</INVOICE>
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>2345</INVOICE_NO>
</INVOICE>
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>3456</INVOICE_NO>
</INVOICE>
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>5678</INVOICE_NO>
</INVOICE>
</INVOICES>
$
$ # Perl one-liner to extract the information
$ perl -lne 'BEGIN{undef $/} while(/(<INVOICE>(.*?)<\/INVOICE>)/sg) {$x=$1; print $x if $2 =~ /2345|5678/}' f1.xml
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>2345</INVOICE_NO>
</INVOICE>
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>5678</INVOICE_NO>
</INVOICE>
$
$
tyler_durden
2 Likes
what is the grep version you are using ?
what operating system is that ?
I am using the below version
kamaraj@kamaraj-laptop:~$ grep -V
GNU grep 2.5.4
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
The Perl solution by durden_tyler is excellent providing that the search term is not present in an unrelated element, i e.
<INVOICES>
<INVOICE>
<NAME>Customer A 2345</NAME>
<INVOICE_NO>1234</INVOICE_NO>
</INVOICE>
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>2345</INVOICE_NO>
</INVOICE>
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>3456</INVOICE_NO>
</INVOICE>
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>5678</INVOICE_NO>
</INVOICE>
</INVOICES>
The Perl example will incorrectly output:
<INVOICE>
<NAME>Customer A 2345</NAME>
<INVOICE_NO>1234</INVOICE_NO>
</INVOICE>
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>2345</INVOICE_NO>
</INVOICE>
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>5678</INVOICE_NO>
</INVOICE>
A more precise solution is to use XSLT. If xsltproc is available to you (and it is on all GNU/Linux distributions) the following XSL stylesheet will provide a precise answer:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<!-- XSLTPROC pass in as -param invno "'value'" -->
<xsl:param name="invno1">XXXX</xsl:param>
<xsl:param name="invno2">XXXX</xsl:param>
<xsl:output method="xml" indent="yes" omit-xml-declaration="yes"/>
<xsl:template match="/">
<xsl:apply-templates select="INVOICES"/>
</xsl:template>
<xsl:template match="INVOICES">
<xsl:apply-templates select="INVOICE"/>
</xsl:template>
<xsl:template match="INVOICE">
<xsl:if test="./INVOICE_NO = $invno1 or ./INVOICE_NO = $invno2">
<xsl:copy-of select="." />
</xsl:if>
</xsl:template>
</xsl:stylesheet>
For example:
$ xsltproc --param invno1 "'1234'" --param invno2 "'3456'" example.xsl example.xml
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>1234</INVOICE_NO>
</INVOICE><INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>3456</INVOICE_NO>
</INVOICE>
1 Like
Hi Kamaraj,
I am using HP-UX S29BF226 B.11.23 U ia64 4081221980 unlimited-user license
Thanks
Angshuman
---------- Post updated at 09:55 PM ---------- Previous update was at 09:22 PM ----------
Hi fpmurphy,
First I would like to thank all of you to take out some time and reply my question.
xsltproc is not available. I tried the solution provided by durden_tyler and it is working fine except the scenario that you have highlighted. Though, the chance of having invoice number in any other tag is remote, still I should take care of that.
Is there any other way I can achieve this ? I also would like to raise another concern. In my question, I mentioned that it is required to print <INVOICE>.....</INVOICE> provided <INVOICE_NO>2345</INVOCIE_NO>. In case the value is passed through a variable, the following code does not return anything. I modifed the solution of durden_tyler as below
You could make your regex more precise to come up with accurate results -
$
$ perl -lne 'BEGIN{undef $/} while(/(<INVOICE>(.*?)<\/INVOICE>)/sg) {$x=$1; print $x if $2 =~ /<INVOICE_NO>(2345|5678)<\/INVOICE_NO>/}' f1.xml
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>2345</INVOICE_NO>
</INVOICE>
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>5678</INVOICE_NO>
</INVOICE>
$
$
The one-liner will have to change if you want to pass a shell variable to it -
$
$
$ export MY_INVOICE_NO="2345"
$
$
$ perl -lne "BEGIN{undef $/}
while(/(<INVOICE>(.*?)<\/INVOICE>)/sg) {\$x=\$1; print \$x if \$2=~/<INVOICE_NO>$MY_INVOICE_NO<\/INVOICE_NO>/}" f1.xml
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>2345</INVOICE_NO>
</INVOICE>
$
$
You could be more creative and pass multiple Invoice Numbers thusly -
$
$
$ export MY_INVOICE_NOS="2345|5678"
$
$
$ perl -lne "BEGIN{undef $/}
while(/(<INVOICE>(.*?)<\/INVOICE>)/sg) {\$x=\$1; print \$x if \$2=~/<INVOICE_NO>$MY_INVOICE_NOS<\/INVOICE_NO>/}" f1.xml
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>2345</INVOICE_NO>
</INVOICE>
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>5678</INVOICE_NO>
</INVOICE>
$
$
However, if you want to do serious XML work then XSLT is the way to go, as suggested by fpmurphy.
tyler_durden