regex/shell script to Parse through XML Records

Jerrad · June 11, 2009, 11:59am

Hi All,

I have been working on something that doesn't seem to have a clear regex solution and I just wanted to run it by everyone to see if I could get some insight into the method of solving this problem.

I have a flat text file that contains billing records for users, however the records are stored as XML with each record starting and stopping at <record> and </record> respectively.

What I am trying to do is be able to search for a users id and have it extract the complete record for them.

Sample Data

<record>
<recId>xxxxxxxxxxxxxxx</recId>
<created>Wed Dec 17 06:00:16 2008</created>
<userid>jondoe</userid>
<domain>xxxxxxxxxxxxxxxxxxxx</domain>
<type>260</type>
<nasIP>xxxxxxxxxxxxxxxx</nasIP>
<portType>18</portType>
<radIP>xxxxxxxxxxxxxxx</radIP>
<userIP>0.0.0.0</userIP>
<delta>7598</delta>
<gmtOffset>0</gmtOffset>
<bytesIn>3159</bytesIn>
<bytesOut>563</bytesOut>
<packetsIn>52</packetsIn>
<packetsOut>19</packetsOut>
<proxyAuthIPAddr>0</proxyAuthIPAddr>
<proxyAcctIPAddr>xxxxxxxxxxxxxxx</proxyAcctIPAddr>
<proxyAcctAck>1</proxyAcctAck>
<termCause>17</termCause>
<clientIPAddr>xxxxxxxxxxxxxxx</clientIPAddr>
<entityID>955</entityID>
<entityCtxt>1</entityCtxt>
<backupMethod>L</backupMethod>
<sessionCountInfo></sessionCountInfo>
<clientID>xxxxxxxxxxxxxxx</clientID>
<sessionID>xxxxxxxxxxxxxxxxxxxxxx</sessionID>
<nasID>xxxxx</nasID>
<nasVendor>xxxxxx</nasVendor>
<nasModel>xxxxxxxxxxxx</nasModel>
<nasPort>xxxxxxxx</nasPort>
<billingID></billingID>
<startDate>2008/12/17 03:57:06</startDate>
<callingNumber>xxxxxxxxxxxxxxx</callingNumber>
<calledNumber></calledNumber>
<radiusAttr>xxxxxxxxxxxxxxxx</radiusAttr>
<startAttr></startAttr>
<auditID>xxxxxxxxxxxxxxxxxxxxxxxx</auditID>
<seqNum>0</seqNum>
<accountName></accountName>
</record><record>
<recId>xxxxxxxxxxxxxxx</recId>
<created>Wed Dec 17 06:00:16 2008</created>
<userid>janedoe</userid>
<domain>xxxxxxxxxxxxxxxxxxxx</domain>
<type>260</type>
<nasIP>xxxxxxxxxxxxxxxx</nasIP>
<portType>18</portType>
<radIP>xxxxxxxxxxxxxxx</radIP>
<userIP>0.0.0.0</userIP>
<delta>7598</delta>
<gmtOffset>0</gmtOffset>
<bytesIn>3159</bytesIn>
<bytesOut>563</bytesOut>
<packetsIn>52</packetsIn>
<packetsOut>19</packetsOut>
<proxyAuthIPAddr>0</proxyAuthIPAddr>
<proxyAcctIPAddr>xxxxxxxxxxxxxxx</proxyAcctIPAddr>
<proxyAcctAck>1</proxyAcctAck>
<termCause>17</termCause>
<clientIPAddr>xxxxxxxxxxxxxxx</clientIPAddr>
<entityID>955</entityID>
<entityCtxt>1</entityCtxt>
<backupMethod>L</backupMethod>
<sessionCountInfo></sessionCountInfo>
<clientID>xxxxxxxxxxxxxxx</clientID>
<sessionID>xxxxxxxxxxxxxxxxxxxxxx</sessionID>
<nasID>xxxxx</nasID>
<nasVendor>xxxxxx</nasVendor>
<nasModel>xxxxxxxxxxxx</nasModel>
<nasPort>xxxxxxxx</nasPort>
<billingID></billingID>
<startDate>2008/12/17 03:57:06</startDate>
<callingNumber>xxxxxxxxxxxxxxx</callingNumber>
<calledNumber></calledNumber>
<radiusAttr>xxxxxxxxxxxxxxxx</radiusAttr>
<startAttr></startAttr>
<auditID>xxxxxxxxxxxxxxxxxxxxxxxx</auditID>
<seqNum>0</seqNum>
<accountName></accountName>
</record><record>

What I would like to be able to do is search for johndoe and have it spit out all records for johndoe.

So the output would be the following, however there could be multiple records in the file for this user so it would need to write out the record to a text file or standard output each time it found a record.

<record>
<recId>xxxxxxxxxxxxxxx</recId>
<created>Wed Dec 17 06:00:16 2008</created>
<userid>jondoe</userid>
<domain>xxxxxxxxxxxxxxxxxxxx</domain>
<type>260</type>
<nasIP>xxxxxxxxxxxxxxxx</nasIP>
<portType>18</portType>
<radIP>xxxxxxxxxxxxxxx</radIP>
<userIP>0.0.0.0</userIP>
<delta>7598</delta>
<gmtOffset>0</gmtOffset>
<bytesIn>3159</bytesIn>
<bytesOut>563</bytesOut>
<packetsIn>52</packetsIn>
<packetsOut>19</packetsOut>
<proxyAuthIPAddr>0</proxyAuthIPAddr>
<proxyAcctIPAddr>xxxxxxxxxxxxxxx</proxyAcctIPAddr>
<proxyAcctAck>1</proxyAcctAck>
<termCause>17</termCause>
<clientIPAddr>xxxxxxxxxxxxxxx</clientIPAddr>
<entityID>955</entityID>
<entityCtxt>1</entityCtxt>
<backupMethod>L</backupMethod>
<sessionCountInfo></sessionCountInfo>
<clientID>xxxxxxxxxxxxxxx</clientID>
<sessionID>xxxxxxxxxxxxxxxxxxxxxx</sessionID>
<nasID>xxxxx</nasID>
<nasVendor>xxxxxx</nasVendor>
<nasModel>xxxxxxxxxxxx</nasModel>
<nasPort>xxxxxxxx</nasPort>
<billingID></billingID>
<startDate>2008/12/17 03:57:06</startDate>
<callingNumber>xxxxxxxxxxxxxxx</callingNumber>
<calledNumber></calledNumber>
<radiusAttr>xxxxxxxxxxxxxxxx</radiusAttr>
<startAttr></startAttr>
<auditID>xxxxxxxxxxxxxxxxxxxxxxxx</auditID>
<seqNum>0</seqNum>
<accountName></accountName>
</record>

I started with some regex trying to grab <record> then johndoe then </record> <record>(\s|\S)+johndoe(\s|\S)+</record>

However this is selecting all records if they contain <record> etc and even if I could just extract the portion I want I am not sure how I can have it remember where it left off and keep chewing through the file without creating duplicates.

Since this is being performed on Solairs 10 I wasn't able to use some of the more advanced grep features like grep -B(x) -A(x)

Thanks in advance for any help you can provide

edgarvm · June 11, 2009, 1:43pm

Maybe you must try with xpath , you can find a perl module for xml processing in cpan.org

ghostdog74 · June 11, 2009, 7:45pm

does "</record><record>" always appear together like this , or on separate lines

casman46 · June 12, 2009, 8:47am

Using the sample data I obtained the requested output using this script

#!/usr/bin/ksh

gawk -v name=$1 '
BEGIN{
   RS = "</record>"; FS = "\n"; ORS = "</record>"
}

{
   pos = index($4,name)
   if(pos > 0)
       print $0
    else
      next
}
' file3 > awk.out

fpmurphy · June 12, 2009, 9:59am

A XSL stylesheet is the easiest way to process your records. Consider the following sample set of records:

<records>
   <record>
       <recId>1</recId>
       <created>Wed Dec 10 06:00:16 2008</created>
       <userid>joebloggs</userid>
       <domain>xxxxxxxxxxxxxxxxxxxx</domain>
   </record>
   <record>
       <recId>2</recId>
       <created>Wed Dec 17 06:00:16 2008</created>
       <userid>jondoe</userid>
       <domain>xxxxxxxxxxxxxxxxxxxx</domain>
   </record>
   <record>
       <recId>3</recId>
       <created>Wed Jan 19 06:00:16 2008</created>
       <userid>jjhollis</userid>
       <domain>xxxxxxxxxxxxxxxxxxxx</domain>
   </record>
   <record>
       <recId>4</recId>
       <created>Mon Dec 22 16:30:17 2008</created>
       <userid>jondoe</userid>
       <domain>xxxxxxxxxxxxxxxxxxxx</domain>
   </record>
</records>

which is a valid and well-formed XML document containing 4 records.

Using the following XSL stylesheet with xsltproc:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<!-- pass in userid as -param userid "'joedoe'"  -->
<xsl:param name="userid" />

<xsl:output method="xml" indent="yes" />

<xsl:template match="records">
<records>
   <xsl:apply-templates select="record" />
</records>
</xsl:template>

<xsl:template match="record">
   <xsl:if test="userid=$userid">
       <xsl:copy-of select="." />
   </xsl:if>
</xsl:template>

</xsl:stylesheet>

you can output all the records for "jondoe" to stdout as follows:

$ xsltproc --param userid "'jondoe'" file42.xsl file42.xml
<?xml version="1.0"?>
<records>
  <record>
       <recId>1</recId>
       <created>Wed Dec 17 06:00:16 2008</created>
       <userid>jondoe</userid>
       <domain>xxxxxxxxxxxxxxxxxxxx</domain>
   </record>
  <record>
       <recId>4</recId>
       <created>Mon Dec 22 16:30:17 2008</created>
       <userid>jondoe</userid>
       <domain>xxxxxxxxxxxxxxxxxxxx</domain>
   </record>
</records>
$

Jerrad · June 12, 2009, 5:59pm

Thanks for all the replies guys, I will try some of the suggestions you made and see what I can come up with.