Extracting data between tags based on search string from unix file

gaya · November 18, 2009, 10:33am

Input file is on Linux box and the input file has data in just one line with 1699741696 characters.

Sample Input:

<xxx><document coll="uspatfull" version="0"><CMSdoc>xxxantivirus</CMSdoc><tag1>1</tag1></document><document coll="uspatfull" version="0"><CMSdoc>yyy</CMSdoc><tag1>a</tag1></document><document coll="uspatfull" version="0"><CMSdoc>likeavirusesxxx</CMSdoc><tag1>aaa</tag1></document>
</xxx>

Output should be:
If data like "virus" appears anywhere between the document tags we need that in the output.

<xxx><document coll="uspatfull" version="0"><CMSdoc>xxxantivirus</CMSdoc><tag1>1</tag1></document><document coll="uspatfull" version="0"><CMSdoc>likeavirusesxxx</CMSdoc><tag1>aaa</tag1></document></xxx>

Thanks!

fpmurphy · November 18, 2009, 11:01am

Your input is not valid XML

<xml>
<document coll="uspatfull" version="0">
   <CMSdoc>xxxantivirus<tag1>1</tag1></CMSdoc>
</document>
<document coll="uspatfull" version="0">
   <CMSdoc>yyy<tag1>a</tag1></CMSdoc>
</document>
<document coll="uspatfull" version="0">
   <CMSdoc>likeavirusesxxx<tag1>aaa</tag1></CMSdoc>
</document>
</xml>

First, <xml> is not a valid element name within the meaning of the XML specification. Both xml and XML are reserved names. Second, you cannot embed another element (tag1)within an element's text content as is occuring in the CMSdoc element.

If you can modify your file to be a valid XML document, what you want to do will be much easier to achieve.

gaya · November 18, 2009, 3:24pm

Please consider valid xml as input:
removing reserved word: xml as tag, and removing tags under CMSdoc

Modified input:

<xxx>
<document coll="uspatfull" version="0">
   <CMSdoc>xxxantivirus</CMSdoc>
<tag1>1</tag1>
</document>
<document coll="uspatfull" version="0">
   <CMSdoc>yyy</CMSdoc>
<tag1>a</tag1>
</document>
<document coll="uspatfull" version="0">
   <CMSdoc>likeavirusesxxx</CMSdoc>
<tag1>aaa</tag1>
</document>
</xxx>

Expected output:

<xxx>
<document coll="uspatfull" version="0">
   <CMSdoc>xxxantivirus</CMSdoc>
<tag1>1</tag1>
</document>
<document coll="uspatfull" version="0">
   <CMSdoc>likeavirusesxxx</CMSdoc>
<tag1>aaa</tag1>
</document>
</xxx>

---------- Post updated at 03:24 PM ---------- Previous update was at 11:10 AM ----------

can you please look into this: extracting extracting data between tags based on search string from unix file

fpmurphy · November 19, 2009, 10:29am

Best way to handle something like this requirement is to use an XSL stylesheet processor.

Here is a stylesheet which will transfer the supplied document into the required output.

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

   <!-- pass in searchterm as -param searchterm "'virus'"  -->
   <xsl:param name="searchterm" />

   <xsl:output method="xml" indent ="yes"/>

   <xsl:template match="//document">
      <xsl:if test=".//text()[contains(., $searchterm)]">
         <xsl:copy-of select="." />
      </xsl:if>
   </xsl:template>

   <xsl:template match="/">
      <xsl:element name="xxx">
         <xsl:apply-templates select="//document" />
      </xsl:element>
   </xsl:template>

</xsl:stylesheet>

Here is the output using the xsltproc (which comes with libxslt) processor:

$ xsltproc -param searchterm "'virus'" file.xsl file.xml
<?xml version="1.0"?>
<xxx>
  <document coll="uspatfull" version="0">
<CMSdoc>xxxantivirus</CMSdoc>
<tag1>1</tag1>
</document>
  <document coll="uspatfull" version="0">
<CMSdoc>likeavirusesxxx</CMSdoc>
<tag1>aaa</tag1>
</document>
</xxx>

ghostdog74 · November 19, 2009, 11:50am

gawk '/virus/{print $0RT}' RS="</document>" file

rdcwayx · November 22, 2009, 4:03am

Good solution by Gawk.