Parse XML For Values

rahulmittal87 · October 23, 2014, 10:09am

Hi All,
I want to parse XML to extract values of the tags to do further processing. The XML looks like

<?xml version="1.0" encoding="ISO-8859-1"?>
<allinput>
<input A="2389906" B="install">
<C>111</C>
<D>222</D>
<E>333</E>
<F></F>
<G>444</G>
<H></H>
<I></I>
<J></J>
<K>C,D,E,G</K>
<L>C,D,E,G</L>
<M>555</M>
</input>
<input A="4732435" B="delete">
<C>999</C>
<D>792</D>
<E></E>
<F></F>
<G>990</G>
<H>942</H>
<I>992</I>
<J></J>
<K>C,D,G,H,I</K>
<L>C,D,G,H,I</L>
<M>804</M>
</input>
</allinput>

I want to extract valuesof Tags A to M for each group and do processing based on the values. There may be only 1 group or maybe 100s.

Can someone suggest the way forward.

Thanks!

Corona688 · October 23, 2014, 12:15pm

It's hard to help you when you post data that's so obviously different from what the real data will look like. Obscuring is one thing, but this is altered perhaps too far to be a useful test.

Once again, my awk generic XML parser:

BEGIN {
        FS=">"; #       OFS=">";
        RS="<"; #       ORS="<"
}

# These should be special variables for match() but aren't.
function rbefore(STR)   { return(substr(STR, N, RSTART-1)); }# before match
function rmid(STR)      { return(substr(STR, RSTART, 1)); }  # First char match
function rall(STR)      { return(substr(STR, RSTART, RLENGTH)); }# Entire match
function rafter(STR)    { return(substr(STR, RSTART+RLENGTH)); }# after match

function aquote(OUT, A, PFIX, TA) { # Turns Q SUBSEP R into A[PFIX":"Q]=R
        if(OUT)
        {
                if(PFIX) PFIX=PFIX":"
                split(OUT, TA, SUBSEP);
                A[toupper(PFIX) toupper(TA[1])]=TA[2];
        }

        return("");
}

# Intended to be less stupid about quoted text in XML/HTML.
# Splits a='b' c='d' e='f' into A[PFIX":"a]=b, A[PFIX":"c]=d, etc.
function qsplit(STR, A, PFIX, X, OUT) {
        while(STR && match(STR, /([ \n\t]+)|[\x27\x22=]/))
        {
                OUT = OUT rbefore(STR);
                RMID=rmid(STR);

                if((RMID == "'") || (RMID == "\""))     # Quote characters
                {
                        if(!Q)          Q=RMID;         # Begin quote section
                        else if(Q == RMID)      Q="";   # End quote section
                        else                    OUT = OUT RMID; # Quoted quote
                } else if(RMID == "=") {
                        if(Q)   OUT=OUT RMID; else OUT=OUT SUBSEP;
                } else if((RMID=="\r")||(RMID=="\n")||(RMID=="\t")||(RMID==" ")) {
                        if(Q)   OUT = OUT rall(STR); # Literal quoted whitespace
                        else    OUT = aquote(OUT, A, PFIX); # Unquoted WS, next block
                }
                STR=rafter(STR); # Strip off the text we've processed already.
        }

        aquote(OUT STR, A, PFIX); # Process any text we haven't already.
}


{ SPEC=0 ; TAG="" }

NR==1 {
        if(ORS == RS) print;
        next } # The first "line" is blank when RS=<

/^[!?]/ { SPEC=1    }   # XML specification junk

# Handle open-tags
match($1, /^[^\/ \r\n\t>]+/) {
        TAG=substr(toupper($1), RSTART, RLENGTH);
        if((!SPEC) && !($1 ~ /\/$/))
        {
                TAGS=TAG "%" TAGS;
                DEP++;
                LTAGS=TAGS
        }

        for(X in ARGS) delete ARGS[X];

        qsplit(rafter($1), ARGS);
}

# Handle close-tags
(!SPEC) && /^[\/]/ {
        sub(/^\//, "", $1);
        LTAGS=TAGS
#        sub("^.*" toupper($1) "%", "", TAGS);
        sub("^" toupper($1) "%", "", TAGS);
        $1="/"$1
        DEP=split(TAGS, TA, "%")-1;
        if(DEP < 0) DEP=0;
        TAG=TAGS
        # Get the previous opened tag if any
        sub(/%.*/, "", TAG);
}

### Example of how to use it ###
# TAG is the name of the last open-tag
# TAGS is an array of tag names like INNER%MIDDLE%OUTERMOST
# $2 is CDATA inside the current tag
# ARGS is an array of arguments for the current tag
# Tag names are all converted to uppercase.
#
# So, when processing <a> in  <html><a href="index.html">Yay!</a></html>
# it would have:
# TAG="A"
# ARGS["HREF"]="index.html"
# TAGS="A%HTML"
# $2="Yay!"

### Prints info on all open-tags and their CDATA whenever inside an <INPUT> tag.
### Tags with no CDATA are ignored.
(TAGS ~ /(^|%)INPUT%/) && ($2 ~ /[^ \r\n\t]/) {
        print "Data for tag " TAG" of " TAGS
        for(X in ARGS) print "\t"TAG"["X"]="ARGS[X]
        print "\tCDATA="$2
}

### Your Code Here ####

$ awk -f allinput.awk allinput.xml

Data for tag C of C%INPUT%ALLINPUT%
        CDATA=111

Data for tag D of D%INPUT%ALLINPUT%
        CDATA=222

Data for tag E of E%INPUT%ALLINPUT%
        CDATA=333

Data for tag G of G%INPUT%ALLINPUT%
        CDATA=444

Data for tag K of K%INPUT%ALLINPUT%
        CDATA=C,D,E,G

Data for tag L of L%INPUT%ALLINPUT%
        CDATA=C,D,E,G

Data for tag M of M%INPUT%ALLINPUT%
        CDATA=555

Data for tag C of C%INPUT%ALLINPUT%
        CDATA=999

Data for tag D of D%INPUT%ALLINPUT%
        CDATA=792

Data for tag G of G%INPUT%ALLINPUT%
        CDATA=990

Data for tag H of H%INPUT%ALLINPUT%
        CDATA=942

Data for tag I of I%INPUT%ALLINPUT%
        CDATA=992

Data for tag K of K%INPUT%ALLINPUT%
        CDATA=C,D,G,H,I

Data for tag L of L%INPUT%ALLINPUT%
        CDATA=C,D,G,H,I

Data for tag M of M%INPUT%ALLINPUT%
        CDATA=804

$

rahulmittal87 · October 23, 2014, 1:24pm

Thanks! I will try it.....

---------- Post updated at 12:24 PM ---------- Previous update was at 12:00 PM ----------

Looks very good!... Thanks a lot.

Just two things:

I also need value for A & B.
How can I execute some shell commands after each group, which has values A to M.

Corona688 · October 23, 2014, 1:42pm

1) Easy enough, but what do you want to do with them?
2) Good, now we're going somewhere.

Getting the data out of awk, into the shell, is the question now. Imagine you made a loop in the shell.

while [reading xml file]
do
       # What variables do you need here, set to what, for each tag?
done

Tell me exactly how you need to use this data and I can help create a loop for you.

A little more detail on the nature of your data would be good as well. If it's not as pretty as your example -- tags and data full of newlines, etc -- that might need some mangling to fix.

rahulmittal87 · October 23, 2014, 2:16pm

I want to perform database queries based on Values of A to M. I need to decide the type of query whether insert,update or delete based on the value of B. And, will update the value of database table attributes using values C to M.

I am sorry but I cannot expose the data fields.

Corona688 · October 23, 2014, 2:31pm

It tells me nothing about your customer credit card list or whatever to tell me that your XML might be messy and full of extra newlines which should be tossed before your script sees the data. You could at least have answered that.

I don't need the actual data. I do need to know what you want to do with it. You want to run shell commands on "something" -- well, what shell commands would you be running, based on your mockup data? Assume each tag is a single column, you can do the splitting yourself.

Is there any safe separator I can use, anything that's not found in A through M? Does it ever contain quotes or tabs?

Corona688 · October 23, 2014, 3:09pm

The best I can do without more information:

$ cat allinput.awk

BEGIN {
        FS=">"; OFS="\t"
        RS="<";

        # INPUTA, as in tag "input" attribute "a".  They must be allcaps here.
        split("INPUTA INPUTB A B C D E F G H I J K L M", ORDER, " ");
}

# These should be special variables for match() but aren't.
function rbefore(STR)   { return(substr(STR, N, RSTART-1)); }# before match
function rmid(STR)      { return(substr(STR, RSTART, 1)); }  # First char match
function rall(STR)      { return(substr(STR, RSTART, RLENGTH)); }# Entire match
function rafter(STR)    { return(substr(STR, RSTART+RLENGTH)); }# after match

function aquote(OUT, A, PFIX, TA) { # Turns Q SUBSEP R into A[PFIX":"Q]=R
        if(OUT)
        {
                if(PFIX) PFIX=PFIX":"
                split(OUT, TA, SUBSEP);
                A[toupper(PFIX) toupper(TA[1])]=TA[2];
        }

        return("");
}

# Intended to be less stupid about quoted text in XML/HTML.
# Splits a='b' c='d' e='f' into A[PFIX":"a]=b, A[PFIX":"c]=d, etc.
function qsplit(STR, A, PFIX, X, OUT) {
        while(STR && match(STR, /([ \n\t]+)|[\x27\x22=]/))
        {
                OUT = OUT rbefore(STR);
                RMID=rmid(STR);

                if((RMID == "'") || (RMID == "\""))     # Quote characters
                {
                        if(!Q)          Q=RMID;         # Begin quote section
                        else if(Q == RMID)      Q="";   # End quote section
                        else                    OUT = OUT RMID; # Quoted quote
                } else if(RMID == "=") {
                        if(Q)   OUT=OUT RMID; else OUT=OUT SUBSEP;
                } else if((RMID=="\r")||(RMID=="\n")||(RMID=="\t")||(RMID==" ")) {
                        if(Q)   OUT = OUT rall(STR); # Literal quoted whitespace
                        else    OUT = aquote(OUT, A, PFIX); # Unquoted WS, next block
                }
                STR=rafter(STR); # Strip off the text we've processed already.
        }

        aquote(OUT STR, A, PFIX); # Process any text we haven't already.
}


{ SPEC=0 ; TAG="" }

NR==1 {
        if(ORS == RS) print;
        next } # The first "line" is blank when RS=<

/^[!?]/ { SPEC=1    }   # XML specification junk

# Handle open-tags
match($1, /^[^\/ \r\n\t>]+/) {
        TAG=substr(toupper($1), RSTART, RLENGTH);
        if((!SPEC) && !($1 ~ /\/$/))
        {
                TAGS=TAG "%" TAGS;
                DEP++;
                LTAGS=TAGS
        }

        for(X in ARGS) delete ARGS[X];

        qsplit(rafter($1), ARGS);
}

# Handle close-tags
(!SPEC) && /^[\/]/ {
        sub(/^\//, "", $1);
        LTAGS=TAGS

#        sub("^.*" toupper($1) "%", "", TAGS);
        sub("^" toupper($1) "%", "", TAGS);
        $1="/"$1
        DEP=split(TAGS, TA, "%")-1;
        if(DEP < 0) DEP=0;
}

### Example of how to use it ###
# TAG is the name of the last open-tag
# TAGS is an array of tag names like INNER%MIDDLE%OUTERMOST
# $2 is CDATA inside the current tag
# ARGS is an array of arguments for the current tag
#
# So, when processing <a> in  <html><a href="index.html">Yay!</a></html>
# it would have:
# TAG="A"
# ARGS["HREF"]="index.html"
# TAGS="A%HTML"
# $2="Yay!"

# Handle <input> tag
(TAGS ~ /^INPUT%/) {    for(X in ARGS)  DATA[TAG X]=ARGS[X]     }

# Parse <tags> inside <input> so DATA[TAGNAME]=CONTENTS
(TAGS ~ /(^|%)INPUT%/) && ($2 ~ /[^ \r\n\t]/) && !/^\// {
        # Clean up tag contents
        sub(/^[ \r\n]+/, "", $2);
        sub(/[ \r\n]+$/, "", $2);
        DATA[TAG]=$2
}

# Handle </input>, printing and clearing collected data
toupper($1) == "/INPUT" {
        PFIX=""
        for(M=1; M in ORDER; M++)
        {
                # Convert blank fields into single spaces, since the shell will see
                # two tabs in a row as one field, skipping the blank one.
                if(DATA[ORDER[M]]=="") DATA[ORDER[M]]=" "
                printf("%s%s", PFIX, DATA[ORDER[M]]);
                PFIX=OFS;
        }

        printf("\n");

        for(X in DATA) delete DATA[X];
}

$ awk -f allinput.awk allinput.xml

2389906 install                 111     222     333             444                     C,D,E,G C,D,E,G 555
4732435 delete                  999     792                     990     942    992              C,D,G,H,I       C,D,G,H,I       804

$ awk -f allinput.awk allinput.xml |
while IFS=$'\t' read INPUTA INPUTB A B C D E F G H I J K L M
do
        # Convert all single-space fields into completely blank fields
        for X in INPUTA INPUTB A B C D E F G H I J K L M
        do
                [ "${!X}" = " " ] && read $X # Cheeky trick to set arbitrary variable contents
        done < /dev/null
        echo "doing something with $INPUTA $INPUTB $L $M"
done

doing something with 2389906 install C,D,E,G 555
doing something with 4732435 delete C,D,G,H,I 804

$

The best I can do without better information. It won't work if your data contains tabs anywhere. I've highlighted in red anywhere tag/attribute names are hardcoded.