Parse a string in XML file using shell script

Hi! I'm just new here and don't know much about shell scripting. I just want to ask for help in creating a shell script that will parse a string or value of the status in the xml file. Please sample xml file below. Can you please help me create a simple script to get the value of status? Also it would be better if I can get the values of each parameter from the xml file. I really need it asap. Hope someone can help me. Thanks!

<?xml version="1.0"?><message><cdr version="1.0"><appid>testbed</appid><threadid>6</threadid><origin>node1</origin><date>20071009
</date><time>12:45:36</time><chdate>20071009</chdate><chtime>12:45:43</chtime><status>201</status><type>103</type><calling>644</call
ing><cparty>xxxxxxx</cparty><accnum>xxxxxx</accnum><debirate1>0.0</debirate1><cos>-1</cos><strtbal>0.0</strtbal><freesms>0</
freesms><tuc>0</tuc><fandftype></fandftype></cdr></message>

If you have the whole XML on ONE line (a very simplified non bullet-proof approach) :
nawk -f ay.awk ay.xml
ay.awk:

BEGIN {
   FS="[><]"
}
{
  for(i=8; i<=NF; i+=2) {
          if ( $i ~ /^[/]/ ) continue
          printf("name->[%s] value->[%s]\n", $i, $(i+1))
  }
}

If you have your XML in a different format, pls post the sample using vB-code tags.

I dont understand ur requirement..

But hope this mighthelp u out!!!!!!!!!!!

sed 's/\(<Pattern>\)\(.*\)\(</Pattern>\)/\2/' <input_xml-file>

Regards,
aajan

Thanks for the reply. I tried sed but I got error sed: command garbled: s/\(<status>\)\(.*\)\(</status>\)/\2/. I just need to get the values of all the status from the xml file like <status>201</status> because it generates different values. The xml file is huge,filesize is approx 5Mb. Please see sample portion of xml file below which is just repeated with different values. I'm really having problem getting the values of status because it's a huge file and the format is not organized sometimes in a line you can have several occurance of status. I can't change the format since it's a cdr. It would be better if I will also get the values of other parameters like appid, threadid, date, chdate, etc. I do hope you were able to understand me. Thanks again!

<?xml version="1.0"?><message><cdr version="1.0"><appid>testbed</appid><threadid>6</threadid><origin>node1</origin><date>20071009
</date><time>12:45:36</time><chdate>20071009</chdate><chtime>12:45:43</chtime><status>201</status><type>103</type><calling>644</call
ing><cparty>xxxxxxx</cparty><accnum>xxxxxx</accnum><debirate1>0.0</debirate1><cos>-1</cos><strtbal>0.0</strtbal><freesms>0</
freesms><tuc>0</tuc><fandftype></fandftype></cdr></message>

wil the file have only one <status> tag?

There are many status tags in the file with different values like 201, 153, 28 etc.

Try This Out!!!!!!!!!!!!!!!!!!!!!!!!

But i dont know how far it wil work????:confused:

awk '/<status>/,/<\/status>/' filename | sed 's/\(.*\)\(<status>\)\(.*\)\(<\/status>\)\(.*\)/\3/'

Regards,
aajan

Hi, thanks for taking time to reply on my post. I've tried your suggestion but I get this error. Is there other way? Also if you have time can you please explain the command? Thanks a lot! Sorry, I really don't know shell scripting. :)Have a nice day!

awk: record `<?xml version="1.0"?...' too long

Oh tat really sounds bad!!!!!!!!!!!!!!!!!!!!!!!!

Actually i tried with your sample input and that works fine..
The error is due to the input file which is too big...

Anyway wil try getting an another solution

Reards,
aajan

Try something like...

$ cat data.xml
<?xml version="1.0"?><message><cdr version="1.0"><appid>testbed</appid><threadid>6</threadid><origin>node1</origin><date>20071009</date><time>12:45:36</time><chdate>20071009</chdate><chtime>12:45:43</chtime><status>201</status><type>103</type><calling>644</calling><cparty>xxxxxxx</cparty><accnum>xxxxxx</accnum><debirate1>0.0</debirate1><cos>-1</cos><strtbal>0.0</strtbal><freesms>0</freesms><tuc>0</tuc><fandftype></fandftype></cdr></message>
$ cat transform.xsl
<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:template match="/">
    <xsl:for-each select=".//message/cdr">
      appid = <xsl:value-of select=".//appid"/>
      status = <xsl:value-of select=".//status"/>
    </xsl:for-each>
  </xsl:template>
</xsl:stylesheet>
$ xsltproc transform.xsl data.xml
<?xml version="1.0"?>

      appid = testbed
      status = 201
$

See XSL Transformations - Wikipedia, the free encyclopedia

Hi Ygor,

Thanks! I tried the solution that you've mentioned but I got the folowing error below. Maybe I can't use it. Do you know any other solution? Thanks again for the help!

bash: xsltproc: command not found

Did you check for xsltproc in /usr/bin/xsltproc ?

Hi Matrixmadhan,

Yes, it does not exist. Do you know other way how I can do it without using xlstproc because I'm not familiar with it. I really need to know how I can parse string from XML file which is around 7MB in size. I need to get values between <date> </date>, <time> </time> and <status> </status>. Thanks in advance!

inputfile

>cat a
<?xml version="1.0"?><message><cdr version="1.0"><appid>testbed</appid><threadid>6</threadid><origin>node1</origin><date>20071009</date><time>12:45:36</time><chdate>20071009</chdate><chtime>12:45:43</chtime><status>201</status><type>103</type><calling>644</calling><cparty>xxxxxxx</cparty><accnum>xxxxxx</accnum><debirate1>0.0</debirate1><cos>-1</cos><strtbal>0.0</strtbal><freesms>0</freesms><tuc>0</tuc><fandftype></fandftype></cdr></message>

script

#! /opt/third-party/bin/perl

open(FILE, "<", "a");

while(<FILE>) {
  chomp;
  my @arr = split(/></);
  foreach (@arr) {
    if( />/ && /</ ) {
      s/(.*)>(.*)<.*$/\1|\2/;
      print "$_\n";
    }
  }
}

close(FILE);

exit 0

output

appid|testbed
threadid|6
origin|node1
date|20071009
time|12:45:36
chdate|20071009
chtime|12:45:43
status|201
type|103
calling|644
cparty|xxxxxxx
accnum|xxxxxx
debirate1|0.0
cos|-1
strtbal|0.0
freesms|0
tuc|0

Hi Matrixmadhan,

It's working! You're really great! I've been looking for scripts for a long time on how I can do it and it's working with the solution you've provided. Thanks so much! If it's not too much, can you please help me on how I modify the script that you provided to have the output like the one below? Thanks in advance!

expected output:

date time chdate chtime status calling cparty
20071009 12:45:36 20071009 12:45:43 201 644 xxxxxxx
20071010 03:09:13 20071010 03:10:07 29 644 xxxxxxx

based on your input,

input

>cat a
<?xml version="1.0"?><message><cdr version="1.0"><appid>testbed</appid><threadid>6</threadid><origin>node1</origin><date>20071009</date><time>12:45:36</time><chdate>20071009</chdate><chtime>12:45:43</chtime><status>201</status><type>103</type><calling>644</calling><cparty>xxxxxxx</cparty><accnum>xxxxxx</accnum><debirate1>0.0</debirate1><cos>-1</cos><strtbal>0.0</strtbal><freesms>0</freesms><tuc>0</tuc><fandftype></fandftype></cdr></message>
<?xml version="1.0"?><message><cdr version="1.0"><appid>testbed</appid><threadid>6</threadid><origin>node1</origin><date>20071009</date><time>12:45:36</time><chdate>20071009</chdate><chtime>12:45:43</chtime><status>201</status><type>103</type><calling>644</calling><cparty>xxxxxxx</cparty><accnum>xxxxxx</accnum><debirate1>0.0</debirate1><cos>-1</cos><strtbal>0.0</strtbal><freesms>0</freesms><tuc>0</tuc><fandftype></fandftype></cdr></message>

script

#! /opt/third-party/bin/perl

open(FILE, "<", "a");

while(<FILE>) {
  chomp;
  my @arr = split(/></);
  foreach (@arr) {
    if( />/ && /</ ) {
      if( $. == 1 ) {
        s/(.*)>(.*)<.*$/\1|\2/;
        my($tmp1, $tmp2) = split(/\|/);
        $data .= (" " . $tmp2);
        printf "%s ", $tmp1;
      }
      else {
        s/(.*)>(.*)<.*$/\2/;
        printf "%s ", $_;
      }
    }
  }
  print "\n";
  print "$data\n" if( $. == 1 );
}

close(FILE);

exit 0

output

appid threadid origin date time chdate chtime status type calling cparty accnum debirate1 cos strtbal freesms tuc
 testbed 6 node1 20071009 12:45:36 20071009 12:45:43 201 103 644 xxxxxxx xxxxxx 0.0 -1 0.0 0 0
testbed 6 node1 20071009 12:45:36 20071009 12:45:43 201 103 644 xxxxxxx xxxxxx 0.0 -1 0.0 0 0

Hi Matrixmadhan,

Thanks for taking time to help me with my problem. :slight_smile: I tried the solution that you've provided but the result is different. Can we just have one heading like the expected output below? Also if you can explain what the script does. Thanks a lot! I really appreciate all your help!

expected output:
date time chdate chtime status calling cparty
20071009 12:45:36 20071009 12:45:43 201 644 xxxxxxx
20071010 03:09:13 20071010 03:10:07 29 644 xxxxxxx

output of the script you've provided:
date time chdate chtime status calling cparty date time chdate chtime status calling cparty 20071009 12:45:36 20071009 12:45:43 201 644 xxxxxxx 20071010 03:09:13 20071010 03:10:07 29 644 xxxxxxx

I checked it again and its working as expected.

May be the input format that we had used might be slightly different, or some bug in the script ? :wink:

Could you please post the input file that you had used ( the one with 2 records ) ?

I could take a look again.

Hi Matrixmadhan,

For the sample that you provided it's working. But I use the actual input which is more than 5Mb of file. When I run using the script, it's output is different like what I mentioned in my previous post. I have attached a portion of the file since it's more than 5Mb I can't send it. Thanks again! :slight_smile:

Before explaining the script, it was written on the run - so its definitely not the optimized one :slight_smile:

open(FILE, "<", "a");

open the file - as simple as the code explains

while(<FILE>) {
  chomp;
  my @arr = split(/></);

based on the delimiter '><' split the input record and populate in the array '@arr'

foreach (@arr) {
    if( />/ && /</ ) {

iterate through the array and make sure processing proceeds only when both '>' and '<' are available. Because we are interested only in that data really

if( $. == 1 ) {
        s/(.*)>(.*)<.*$/\1|\2/;
        my($tmp1, $tmp2) = split(/\|/);
        $data .= (" " . $tmp2);
        printf "%s ", $tmp1;
      }

if its the first line, only then header has to be printed and not for consequent xml records. Block the input data by 'grouping' and mark the block as '\1' and '\2'
append the header and data individually to a variable

 else {
        s/(.*)>(.*)<.*$/\2/;
        printf "%s ", $_;
      }
    }
  }

if its not the first line, concentrate only on printing the data and not the header

 print "\n";

this newline is needed; to make sure data and header information is not clubbed together

 print "$data\n" if( $. == 1 );

now print the header if its the first line

}

close(FILE);

close the file.

Hope this explains the logic ! :slight_smile: