Extracting a part of XML File

shridhard · November 10, 2008, 7:21am

Hi Guys,

I have a very large XML feed (2.7 MB) which crashes the server at the time of parsing. Now to reduce the load on the server I have a cron job running every 5 min.'s. This job will get the file from the feed host and keep it in the local machine.

This does not solve the problem as the file still gets loaded in the server. The file looks something like this:

<?xml version="1.0" standalone="no"?>
<IRXML CorpMasterID="">
<NewsReleases PubDate="20081104" PubTime="16:48:03">
<NewsCategory Category="">
<NewsRelease ReleaseID="" DLU="20081104 16:47:00" ArchiveStatus="Current"
RNSSource="">
<Title></Title>
<ExternalURL/>
<Date Date="20081104" Time="16:33:00">11/4/2008 4:33:00 PM</Date>
<ContentNetworkingLinks/>
<Categories>
<Category></Category>
</Categories>
</NewsRelease>
<NewsRelease ReleaseID="" DLU="20081104 09:19:00" ArchiveStatus="Current"
RNSSource="">
<Title></Title>
<ExternalURL/>
<Date Date="20081104" Time="09:01:00">11/4/2008 9:01:00 AM</Date>
<ContentNetworkingLinks/>
<Categories>
<Category></Category>
</Categories>
</NewsRelease>

I want to write a shell script which will extract only the part starting from
<NewsRelease> till </NewsRelease>
Something like:

Also there is one more problem, in unix when the file is downloaded there are no return carriage, so the complete file appears to be in one line :(.

Any help would be appreciated. Thanks,
Shridhar

wempy · November 10, 2008, 11:27am

sed -n '/<NewsRelease R/,/<\/NewsRelease>/p' xmldump >outputfile

wempy · November 10, 2008, 11:32am

regarding the end of line problem, what format is the file currently in i.e. does it have LF, CR/LF or CR as it's end of line marker?
depending on format depends on which tool to use.
to go from dos to unix use dos2unix or run the file up in vim and :set fileformat=unix

shridhard · November 11, 2008, 1:55am

Thanks for the reply.

There seems to be some problem with the command. The command seems to execute, but when I see the outputfile, it is the complete copy of the xmlfeed.
I don't think there is a problem with the file format, because I do not see ^M in the file.
I think the problem could be with the multiple occurrences of "NewsRelease" in the file.

Also my requirement is that, I need the first 5 occurrences of <NewsRelease> ... </NewsRelease> from the XMLFeed to another file, as I need to Parse the first 5 news releases to HTML using XSL.

Please let me know if this is possible.

Thanks again.
Shridhar

summer_cherry · November 11, 2008, 11:15pm

Hope this can help you some.

it will only print out the first five part surrounded by <NewsRelease and /NewsRelease>.

awk '/<NewsRelease/,/\/NewsRelease/{
if(n<5)
	print
if(index($0,"/NewsRelease")!=0)
	n++
}' filename

shridhard · November 12, 2008, 8:23am

Thanks for the reply, it worked ... I have to add few more things to make it work completely.

Warm Regards,
Shridhar

fpmurphy · November 12, 2008, 12:57pm

Why not extract the first 5 releases using XSLT i.e.

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

  <xsl:output method="xml"/>

  <xsl:template match="/">
    <xsl:apply-templates>
      <xsl:with-param name="mycount" select="5"/>
    </xsl:apply-templates>
  </xsl:template>

  <xsl:template match="NewsReleases">
    <xsl:param name="mycount"/>
      <xsl:element name="NewsReleases">
      <xsl:attribute name="PubDate">
         <xsl:value-of select="@PubDate"/>
      </xsl:attribute>
      <xsl:attribute name="PubTime">
         <xsl:value-of select="@PubTime"/>
      </xsl:attribute>
      <xsl:text>
</xsl:text>
      <xsl:for-each select="//NewsRelease[position() <=$mycount]">
        <xsl:copy-of select="."/>
      </xsl:for-each>
      <xsl:text>
</xsl:text>
      </xsl:element>
  </xsl:template>

</xsl:stylesheet>

This assumes that your irXML document is well formed (XML) - which not the case for the sample document you supplied.

shridhard · November 13, 2008, 4:22am

Hi Murphy,

Thanks for the XSL.

I am already using a XSL Style sheet to convert the XML to HTML in Sun Portal Server (Using XML Provider). But I am facing a problem, whenever someone hits the server, it loads the complete XML (this file is around 2.5 MB) and loads the server. There are 4 servers which go down one by one because of the load.

I thought that if I can trim the file to a smaller file at the unix level, this might solve the purpose (I have put a crontab job which gets the file from the XML host server and puts the file in the file system, then I am trying to trim the file in UNIX, and then I will try to parse the output XML using a XSL).

Is there a UNIX level parser to convert XML to HTML using XSL?

If you want I can give you the code of the XSL that I am using to convert XML.

Thanks and Regards,
Shridhar

fpmurphy · November 13, 2008, 8:42am

There are a number of free parsers available for UNIX platforms. The most common is probally the one associated with libxslt2 i.e. xsltproc.

BTW, If your input document is that large and causing the problems you describe, I suggest you use a SAX or SiAX processor instead of a DOM-based processor. If you have access to IEEE Computer Society proceedings, there was an article in the Sept 2008 edition of Computer by Lam, Ding, and Liu on XML Document parsing performance characteristics which gives more information and benchmarks.

shridhard · November 17, 2008, 6:41am

Thanks Murphy , xsltproc resolved the issue. There were two problems I faced.
One was that there was a xml tag in the beginning of the html output and second was the html and body tags were missing.

For the xml tag I used:
sed '1,1d' input_with_xml_tag.html output_without_xml_tag.html

and I was not much bothered about the html and body tag missing as the Portal takes care of that.

Thanks and Regards,
Shridhar