Reading block by block in XML

kmajumder · July 25, 2012, 3:41pm

Hi ,

Can you pleas help me with below requirement?
There is only one big line in the file. I need to parse block by block(particular tag values, 'Val' in below case) to get different parameters.

Example:-
Portion of the Input string:-
<?xml version="1.1" encoding="UTF-8"?> <Data><Val Ti="1342750845538" Du="0" De="blackberry8520_ver1RIM" Db="encyclopedia" Pdb="" Uq="0" Dq="0" qry="sdsds?q=dsds" ab="dsds" Dc="4" Te=" Ca="xxx" Sc="320.240" Us="" Cd="X"</Val><Val Ti="1342750845538" Du="0" De="blackberry8520_ver1RIM" Db="home" Pdb="" Uq="0" Dq="0" qry="sdsds?q=dsds&dsdsds=dsds&ss?" ab="dsds" Dc="4" Te=" Ca="xxx" Sc="320.240" Us="" Cd="X"</Val> ..../>

Output:-
If value of Db parameter in <Val> block/tag is not null then I need to show both Db and corresponding qry parameter value.

This should be output for above one :-

encyclopedia -> sdsds?q=dsds
home -> sdsds?q=dsds&dsdsds=dsds&ss
....
....
Thanks in advance.

KM

Chirel · July 25, 2012, 6:48pm

Hi,

you should check xsltproc , it's build to solve that.

agama · July 25, 2012, 8:09pm

You could write a simple awk programme to extract the bits you need.

awk '
    /^Val.*Db="[^"]+"/ {
        gsub( "^Val ", "" );
        gsub( "=\"", "<" );
        gsub( "\" *", ">" );
        la = split( $0, a, ">" );
        for( i = 1; i <= la; i++ )
        {
            split( a, b, "<" );
            h[b[1]] = b[2];
        }

        printf( "%s -> %s\n", h["Db"], h["qry"] );
        delete h;
    }' RS="[<>]"   input-file >output-file

It makes a few assumptions about your code (and that you have GNU awk) which might be wrong, but it works on the small sample you posted and thus might work across all of your input.

kmajumder · July 26, 2012, 1:54pm

Thanks a lot agama. Its working as I expected.
Could you please explain the code. I am newbie to Linux. So it would be very helpful for me if you kindly explain the code.

Thanks again.

KM

agama · July 27, 2012, 12:42am

First, the very last line sets the record separator variable (RS) to be either the greater-than or less-than symbol. That splits all of the input file into records based on either of those rather than a newline. An important thing to note is that awk removes those symbols from the input as it uses them to split the input into records.

Awk processes records and the programme is applied to each record. for more details about awk, and the general syntax of an awk programme it is best to have a peek at this:
Awk - A Tutorial and Introduction - by Bruce Barnett

Comments in-line below should explain things more...

awk '
    /^Val.*Db="[^"]+"/ {   # execute this block of code for all records that start with "Val" and also contain a Db field that is not empty
        gsub( "^Val ", "" );  # replace the Val and trailing space with nothing
        gsub( "=\"", "<" );   # replace all =" with a less-than symbol
        gsub( "\" *", ">" );   # replace all quotes trailed by one or more spaces with a greater-than sym
        la = split( $0, a, ">" );  # split the record into array a based on greater-than sym
        for( i = 1; i <= la; i++ ) # for each token in a (something like Db<foo) 
        {
            split( a, b, "<" );   # split it into two components (name and value) 
            h[b[1]] = b[2];       # save the pair in a hash keyed on the name
        }

        printf( "%s -> %s\n", h["Db"], h["qry"] );  # print the two values that are interesting
        delete h;   # reset the hash
    }' RS="[<>]"

So, for the first bits of your input ( <?xml version="1.1" encoding="UTF-8"?> <Data> awk treats it as several records:


?xml version="1.1" encoding="UTF-8"?
 
Data

(Notice that the blanks between greater and less than symbols end up being blank records; not important, but interesting.) None of these records match our desired record, and they are discarded.

The first record that matches looks initially like:

Val Ti="1342750845538" Du="0" De="blackberry8520_ver1RIM" Db="encyclopedia" Pdb="" Uq="0" Dq="0"    qry="sdsds?q=dsds" ab="dsds" Dc="4" Te="" Ca="xxx" Sc="320.240" Us="" Cd="X"

After substitutions it becomes:

Ti<1342750845538>Du<0>De<blackberry8520_ver1RIM>Db<encyclopedia>Pdb<>Uq<0>Dq<0>qry<sdsds?q=dsds>ab<dsds>Dc<4>Te<>Ca<xxx>Sc<320.240>Us<>Cd<X>

The split into 'a' using the greater than symbol as the separator yields these tokens in the array:

a[1]= Ti<1342750845538
a[2]= Du<0
a[3]= De<blackberry8520_ver1RIM
a[4]= Db<encyclopedia
a[5]= Pdb<
a[6]= Uq<0
a[7]= Dq<0
a[8]= qry<sdsds?q=dsds
a[9]= ab<dsds
a[10]= Dc<4
a[11]= Te<
a[12]= Ca<xxx
a[13]= Sc<320.240
a[14]= Us<
a[15]= Cd<X

While your sample data didn't contain any spaces between the double quotes (e.g. Db="foo bar") the bracketing and splitting would have preserved them.

The tokens in the array 'a' can then be split, and placed into the hash 'h'. So a[8] is split into 'qry' and 'sdsds?q=dsds' and then can be referenced by name (e.g. h["qry"]).

Hope this helps you understand a bit more.

I also noticed this odd bit in your sample data: Te=" Ca="xxx" I'm not an XML expert, but this seems illegal syntax. I treated it as Te="" .