First, the very last line sets the record separator variable (RS) to be either the greater-than or less-than symbol. That splits all of the input file into records based on either of those rather than a newline. An important thing to note is that awk removes those symbols from the input as it uses them to split the input into records.
Awk processes records and the programme is applied to each record. for more details about awk, and the general syntax of an awk programme it is best to have a peek at this:
Awk - A Tutorial and Introduction - by Bruce Barnett
Comments in-line below should explain things more...
awk '
/^Val.*Db="[^"]+"/ { # execute this block of code for all records that start with "Val" and also contain a Db field that is not empty
gsub( "^Val ", "" ); # replace the Val and trailing space with nothing
gsub( "=\"", "<" ); # replace all =" with a less-than symbol
gsub( "\" *", ">" ); # replace all quotes trailed by one or more spaces with a greater-than sym
la = split( $0, a, ">" ); # split the record into array a based on greater-than sym
for( i = 1; i <= la; i++ ) # for each token in a (something like Db<foo)
{
split( a, b, "<" ); # split it into two components (name and value)
h[b[1]] = b[2]; # save the pair in a hash keyed on the name
}
printf( "%s -> %s\n", h["Db"], h["qry"] ); # print the two values that are interesting
delete h; # reset the hash
}' RS="[<>]"
So, for the first bits of your input ( <?xml version="1.1" encoding="UTF-8"?> <Data>
awk treats it as several records:
?xml version="1.1" encoding="UTF-8"?
Data
(Notice that the blanks between greater and less than symbols end up being blank records; not important, but interesting.) None of these records match our desired record, and they are discarded.
The first record that matches looks initially like:
Val Ti="1342750845538" Du="0" De="blackberry8520_ver1RIM" Db="encyclopedia" Pdb="" Uq="0" Dq="0" qry="sdsds?q=dsds" ab="dsds" Dc="4" Te="" Ca="xxx" Sc="320.240" Us="" Cd="X"
After substitutions it becomes:
Ti<1342750845538>Du<0>De<blackberry8520_ver1RIM>Db<encyclopedia>Pdb<>Uq<0>Dq<0>qry<sdsds?q=dsds>ab<dsds>Dc<4>Te<>Ca<xxx>Sc<320.240>Us<>Cd<X>
The split into 'a' using the greater than symbol as the separator yields these tokens in the array:
a[1]= Ti<1342750845538
a[2]= Du<0
a[3]= De<blackberry8520_ver1RIM
a[4]= Db<encyclopedia
a[5]= Pdb<
a[6]= Uq<0
a[7]= Dq<0
a[8]= qry<sdsds?q=dsds
a[9]= ab<dsds
a[10]= Dc<4
a[11]= Te<
a[12]= Ca<xxx
a[13]= Sc<320.240
a[14]= Us<
a[15]= Cd<X
While your sample data didn't contain any spaces between the double quotes (e.g. Db="foo bar") the bracketing and splitting would have preserved them.
The tokens in the array 'a' can then be split, and placed into the hash 'h'. So a[8] is split into 'qry' and 'sdsds?q=dsds' and then can be referenced by name (e.g. h["qry"]).
Hope this helps you understand a bit more.
I also noticed this odd bit in your sample data: Te=" Ca="xxx"
I'm not an XML expert, but this seems illegal syntax. I treated it as Te=""
.