extract xml tag based on condition

Hi All,

I have a large xml file of invoices. The file looks like below:

<INVOICES>
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>1234</INVOICE_NO>
</INVOICE>
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>2345</INVOICE_NO>
</INVOICE>
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>3456</INVOICE_NO>
</INVOICE>
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>5678</INVOICE_NO>
</INVOICE>
</INVOICES>

I need to extract all the <INVOICE>...........</INVOICE> provided the value of INVOICE_NO = 2345 and 5678.

I searched the forum and found how to extract values between xml tag. But this is a different scenario.

Your help is highly appreciated.

Thanks
Angshuman

ruby -ne 'BEGIN{$/="</INVOICE>"}; print "#{$_}\n"; if /2345|5678/  ' file

HI Kurumi,

Thank you for your reply. Do we have any awk or sed command to achieve this?

Thanks
Angshuman

kamaraj@kamaraj-laptop:~/Desktop$ for i in `cat xml_input`; do grep -B2 $i test | sed '$d'; grep -A1 $i test; done
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>2345</INVOICE_NO>
</INVOICE>
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>5678</INVOICE_NO>
</INVOICE>

kamaraj@kamaraj-laptop:~/Desktop$ cat xml_input 
2345 
5678

Hi Kamaraj,

Thank you for your reply. I tried your command but got the following:

grep: illegal option -- B
grep: illegal option -- 2

grep: illegal option -- A
grep: illegal option -- 1

Are these parameters of grep command ? Please let me know

Thanks
Angshuman

Maybe something like this?

$
$ # display the contents of the xml file
$ cat f1.xml
<INVOICES>
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>1234</INVOICE_NO>
</INVOICE>
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>2345</INVOICE_NO>
</INVOICE>
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>3456</INVOICE_NO>
</INVOICE>
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>5678</INVOICE_NO>
</INVOICE>
</INVOICES>
$
$ # Perl one-liner to extract the information
$ perl -lne 'BEGIN{undef $/} while(/(<INVOICE>(.*?)<\/INVOICE>)/sg) {$x=$1; print $x if $2 =~ /2345|5678/}' f1.xml
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>2345</INVOICE_NO>
</INVOICE>
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>5678</INVOICE_NO>
</INVOICE>
$
$

tyler_durden

2 Likes

what is the grep version you are using ?

what operating system is that ?

I am using the below version

kamaraj@kamaraj-laptop:~$ grep -V
GNU grep 2.5.4

Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

The Perl solution by durden_tyler is excellent providing that the search term is not present in an unrelated element, i e.

<INVOICES>
<INVOICE>
<NAME>Customer A 2345</NAME>
<INVOICE_NO>1234</INVOICE_NO>
</INVOICE>
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>2345</INVOICE_NO>
</INVOICE>
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>3456</INVOICE_NO>
</INVOICE>
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>5678</INVOICE_NO>
</INVOICE>
</INVOICES>

The Perl example will incorrectly output:

<INVOICE>
<NAME>Customer A 2345</NAME>
<INVOICE_NO>1234</INVOICE_NO>
</INVOICE>
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>2345</INVOICE_NO>
</INVOICE>
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>5678</INVOICE_NO>
</INVOICE>

A more precise solution is to use XSLT. If xsltproc is available to you (and it is on all GNU/Linux distributions) the following XSL stylesheet will provide a precise answer:

<xsl:stylesheet version="1.0"
   xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

   <!-- XSLTPROC pass in as -param invno "'value'" -->
   <xsl:param name="invno1">XXXX</xsl:param>
   <xsl:param name="invno2">XXXX</xsl:param>

   <xsl:output method="xml" indent="yes" omit-xml-declaration="yes"/>

   <xsl:template match="/">
      <xsl:apply-templates select="INVOICES"/>
   </xsl:template>

   <xsl:template match="INVOICES">
      <xsl:apply-templates select="INVOICE"/>
   </xsl:template>

   <xsl:template match="INVOICE">
      <xsl:if test="./INVOICE_NO = $invno1 or ./INVOICE_NO = $invno2">
         <xsl:copy-of select="." />
      </xsl:if>
   </xsl:template>

</xsl:stylesheet>

For example:

$ xsltproc --param invno1 "'1234'" --param invno2 "'3456'" example.xsl example.xml
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>1234</INVOICE_NO>
</INVOICE><INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>3456</INVOICE_NO>
</INVOICE>
1 Like

Hi Kamaraj,

I am using HP-UX S29BF226 B.11.23 U ia64 4081221980 unlimited-user license

Thanks
Angshuman

---------- Post updated at 09:55 PM ---------- Previous update was at 09:22 PM ----------

Hi fpmurphy,

First I would like to thank all of you to take out some time and reply my question.

xsltproc is not available. I tried the solution provided by durden_tyler and it is working fine except the scenario that you have highlighted. Though, the chance of having invoice number in any other tag is remote, still I should take care of that.

Is there any other way I can achieve this ? I also would like to raise another concern. In my question, I mentioned that it is required to print <INVOICE>.....</INVOICE> provided <INVOICE_NO>2345</INVOCIE_NO>. In case the value is passed through a variable, the following code does not return anything. I modifed the solution of durden_tyler as below

You could make your regex more precise to come up with accurate results -

$
$ perl -lne 'BEGIN{undef $/} while(/(<INVOICE>(.*?)<\/INVOICE>)/sg) {$x=$1; print $x if $2 =~ /<INVOICE_NO>(2345|5678)<\/INVOICE_NO>/}' f1.xml
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>2345</INVOICE_NO>
</INVOICE>
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>5678</INVOICE_NO>
</INVOICE>
$
$

The one-liner will have to change if you want to pass a shell variable to it -

$
$
$ export MY_INVOICE_NO="2345"
$
$
$ perl -lne "BEGIN{undef $/}
             while(/(<INVOICE>(.*?)<\/INVOICE>)/sg) {\$x=\$1; print \$x if \$2=~/<INVOICE_NO>$MY_INVOICE_NO<\/INVOICE_NO>/}" f1.xml
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>2345</INVOICE_NO>
</INVOICE>
$
$

You could be more creative and pass multiple Invoice Numbers thusly -

$
$
$ export MY_INVOICE_NOS="2345|5678"
$
$
$ perl -lne "BEGIN{undef $/}
             while(/(<INVOICE>(.*?)<\/INVOICE>)/sg) {\$x=\$1; print \$x if \$2=~/<INVOICE_NO>$MY_INVOICE_NOS<\/INVOICE_NO>/}" f1.xml
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>2345</INVOICE_NO>
</INVOICE>
<INVOICE>
<NAME>Customer A</NAME>
<INVOICE_NO>5678</INVOICE_NO>
</INVOICE>
$
$

However, if you want to do serious XML work then XSLT is the way to go, as suggested by fpmurphy.

tyler_durden