XML Parsing using awk

Tons · November 21, 2012, 3:30am

Hi All,

I have a problem to resolve. For following XML file, I need to parse the values based on Tag Name. I would prefer to use this by awk. I have used sed command to replace the tags (s/<SeqNo>//).

In this case there can be new tags introduced. So need to parse it based on Tag Name. Any awk command suggestions?

<Target>
    <SeqNo>43156489079</SeqNo>
    <AuthenticationToken><![CDATA[nY+sHZ2PrBmdj6wVnY]]></AuthenticationToken>
    <redcode>SKNEQGGEVHW</redcode>
    <GenError>Upload-Success</GenError>
</Target>
<Target>
    <SeqNo>43156489079</SeqNo>
    <AuthenticationToken><![CDATA[nY+sHZ2PrBmdj6wVnY]]></AuthenticationToken>
    <redcode>SKNEQGGEVHW</redcode>
    <GenError>Upload-Success</GenError>
</Target>

elixir_sinari · November 21, 2012, 3:33am

What's the expected output?
And please wrap your code and data samples with code tags to preserve formatting.

itkamaraj · November 21, 2012, 3:39am

something like this.. ?

 
$ nawk -F"[<>]" -v pat="SeqNo" '$0~pat{print $3}' a.txt
43156489079
43156489079
$ nawk -F"[<>]" -v pat="redcode" '$0~pat{print $3}' a.txt
SKNEQGGEVHW
SKNEQGGEVHW
$ nawk -F"[<>]" -v pat="AuthenticationToken" '$0~pat{print $4}' a.txt
![CDATA[nY+sHZ2PrBmdj6wVnY]]
![CDATA[nY+sHZ2PrBmdj6wVnY]]

Jotne · November 21, 2012, 3:41am

Not quite sure how you like your output, like this?

awk -F"[<>]" '{print $5,$9,$13}' RS="</Target>\n" file
43156489079 SKNEQGGEVHW Upload-Success
43156489079 SKNEQGGEVHW Upload-Success

elixir_sinari · November 21, 2012, 3:43am

A regexp RS will not work with all awk implementations.

Tons · November 21, 2012, 3:01pm

Hi I want parse this file and write into delimited file format
Source file:

<Target>
<SeqNo>43156489079</SeqNo>
<AuthenticationToken><![CDATA[nY+sHZ2PrBmdj6wVnY]]></AuthenticationToken>
<RedCode>SKNEQGGEVHW</RedCode>
<IncentiveGenError>Upload-Success</IncentiveGenError>
</Target>
<Target>
<SeqNo>43156489070</SeqNo>
<AuthenticationToken><![CDATA[nY+sHZ2PrBmdj6wVnY]]></AuthenticationToken>
<RedCode>SKNEQGGEVHW</RedCode>
<IncentiveGenError>Upload-Success</IncentiveGenError>
</Target>

Answer:

43156489079 SKNEQGGEVHW Upload-Success
43156489079 SKNEQGGEVHW Upload-Success

The tags can be changed in the order or new Tags can be introduced. So I want to parse this based on the Tag name.

---------- Post updated at 03:01 PM ---------- Previous update was at 01:12 PM ----------

Thanks for your input.. I used following script:


nawk 'BEGIN{FS="[<|>]"}
/<SeqNo>/{SeqNo=$3}
/<RedCode>/{Redcd=$3}
{printf(" %s,%s\n",SeqNo,Redcd)}' newack.xml

Only problem I found is its duplicating the results.. Any idea why?

Thanks,
Tons

birei · November 21, 2012, 4:15pm

Not awk. But here you have one solution using XML::Twig parser in perl:

$ cat xmlfile 
<root>
  <Target>
    <SeqNo>43156489079</SeqNo>
    <AuthenticationToken><![CDATA[nY+sHZ2PrBmdj6wVnY]]></AuthenticationToken>
    <RedCode>SKNEQGGEVHW</RedCode>
    <IncentiveGenError>Upload-Success</IncentiveGenError>
  </Target>
  <Target>
    <SeqNo>43156489070</SeqNo>
    <AuthenticationToken><![CDATA[nY+sHZ2PrBmdj6wVnY]]></AuthenticationToken>
    <RedCode>SKNEQGGEVHW</RedCode>
    <IncentiveGenError>Upload-Success</IncentiveGenError>
  </Target>
</root>
$ cat script.pl
#!/usr/bin/perl

use strict;
use warnings;
use XML::Twig;

{
        my $twig = XML::Twig->new(
                twig_handlers => {
                        'Target' => sub {
                                printf qq|%s\n|, 
                                        join q| |, 
                                        map { $_->trimmed_text } 
                                        grep { ! $_->is_cdata && $_->is_text } 
                                        $_->descendants
                        }
                },
        )->parsefile( shift );
}
$ perl-5.14.2 script.pl xmlfile 
43156489079 SKNEQGGEVHW Upload-Success
43156489070 SKNEQGGEVHW Upload-Success

Corona688 · November 21, 2012, 4:25pm

Parsing XML is not trivial.

Because of frequent requests for xml to flatfile conversion, I've got a script that works in some common situations however.

$ cat xmlh.awk

BEGIN { RS="<";         FS=">";
        # Uncomment to make windows-readable text files
        # ORS="\r\n";

        # Change this to alter how many close-tags in a row are needed
        # before a row of data is printed.
        if(!DEP) DEP=1
        SEP="\t"
        }

# Skip weird XML specification lines or blank records
/^\?/ || /^$/   {       next    }

# Handle close tags
/^[/]/  {
        N=D;    while((N>0) && ("/"STACK[N] != $1))     N--;

        if("/"STACK[N] == $1)   D=(N-1);
        POP++;

        if(POP == DEP)
        {
                if(!HEADER++)
                {
                        split(ARG[1], Z, SUBSEP);
                        printf("%s %s", Z[2], Z[3]);
                        for(N=2; N<=ARG_; N++)
                        {
                                split(ARG[N], Z, SUBSEP);
                                printf("%s%s %s", SEP, Z[2], Z[3]);
                        }

                        printf("\n");
                }

                printf("%s", DATA[ARG[1]]);
                for(N=2; N<=ARG_; N++)
                        printf("%s%s", SEP, DATA[ARG[N]]);
                printf("\n");
        }
        next
}

# Handle open tags
{
        gsub(/^[ \r\n\t]*/, "", $2);    # Whitespace isn't data
        gsub(/[ \r\n\t]*$/, "", $2);
        sub(/\/$/, "", $(NF-1));

        # Reset parameters
        POP=0;

        M=split($1, A, " ");
        STACK[++D]=A[1];

        if((!MAX) || (D>MAX)) MAX=D;    # Save max depth

        # Handle parameters
        Q=split(A[2], B, " ");
        for(N=1; N<=Q; N++)
        {
                split(B[N], C, "=");
                gsub(/['"]/,"", C[2]);

                I=D SUBSEP STACK[D] SUBSEP C[1];
                if(!SEEN++)
                        ARG[++ARG_]=I;

                DATA=C[2];
        }

        if($2)
        {
                I=D SUBSEP STACK[D] SUBSEP "CDATA";
                if(!SEEN++)
                        ARG[++ARG_]=I;

                DATA=$2;
        }
}

$ awk -f xmlh.awk DEP=2 data3.xml

SeqNo CDATA     redcode CDATA   GenError CDATA
43156489079     SKNEQGGEVHW     Upload-Success
43156489079     SKNEQGGEVHW     Upload-Success

$

Output is tab-separated. DEP is how many close-tags in a row it looks for before printing a row of data.

Tons · November 21, 2012, 4:25pm

Thanks ! I am looking for something by awk

Corona688 · November 21, 2012, 4:28pm

I think we crossposted. Does my solution above work for you? It's a generic xml-to-flatfile converter in awk which groups columns by itself.

It has some limitations. Spaces inside tag values are a problem. But it works for the data you gave as shown above.