processing xml with awk

cabrao · June 26, 2012, 8:48am

With the following input sample extracted from a xml file

                        <rel ver="123">
                        <mod name="on">
                        <node env="ac" env="1">
                            <ins ip="10.192.0.1"/>
                            <ins ip="10.192.0.2"/>
                        </node>
                             <node env="ac" env="2">
                             <ins ip="10.192.0.3"/>
                             <ins ip="10.192.0.4"/>
                        </node>
                             <node env="pr">
                            <ins ip="10.192.0.5"/>
                            <ins ip="10.192.0.6"/>
                        </node>
                          </mod>
                                <mod name="off">
                        <node env="ac" env="1">
                            <ins ip="10.192.0.7"/>
                        </node>
                            <node env="ac" env="2">
                            <ins ip="10.192.0.8"/>
                        </node>
                            <node env="pr">
                            <ins ip="10.192.0.9"/>
                        </node>
                        </mod>
                        </rel>

I was wondering if someone can help me having the following output:

123     env     off             on
        ac1     10.192.0.7      10.192.0.1      10.192.0.2
        ac2     10.192.0.8      10.192.0.3      10.192.0.4
        pr      10.192.0.9      10.192.0.5      10.192.0.6

It's kind of easy getting ride of all xml tags (ex below code) but I have no idea how to produce the desired output table

awk -F '[\"/>]' '/rel ver/{print $2}/mod name/{print $2}/node env/{print $2, $4}/ins ip/{print $2}' file.xml

Thanks for your help

Corona688 · June 26, 2012, 12:56pm

I once wrote a generic XML scanner which produces output similar to what you want. It produces columns from tags in a generic way without hardcoding tags/attributes. It has a weakness in that it can't handle spaces inside tag attributes.

Getting those two 'env' tags into one can be done with sed.

$ cat xmlg.awk
BEGIN { RS="<";         FS=">"; ORS="\r\n";

        # Change this to alter how many close-tags in a row are needed
        # before a row of data is printed.
        DEP=1
        SEP="\t"
        }

# Skip weird XML specification lines or blank records
/^\?/ || /^$/   {       next    }

# Handle close tags
/^[/]/  {
        N=D;    while((N>0) && ("/"STACK[N] != $1))     N--;

        if("/"STACK[N] == $1)   D=(N-1);
        POP++;

        if(POP == DEP)
        {
                if(!HEADER++)
                {
                        split(ARG[1], Z, SUBSEP);
                        printf("%s %s", Z[2], Z[3]);
                        for(N=2; N<=ARG_; N++)
                        {
                                split(ARG[N], Z, SUBSEP);
                                printf("%s%s %s", SEP, Z[2], Z[3]);
                        }

                        printf("\n");
                }

                printf("%s", DATA[ARG[1]]);
                for(N=2; N<=ARG_; N++)
                        printf("%s%s", SEP, DATA[ARG[N]]);
                printf("\n");
        }
        next
}

# Handle open tags
{
        gsub(/^[ \r\n\t]*/, "", $2);    # Whitespace isn't data
        gsub(/[ \r\n\t]*$/, "", $2);
        sub(/\/$/, "", $(NF-1));

        # Reset parameters
        POP=0;

        M=split($1, A, " ");
        STACK[++D]=A[1];

        if((!MAX) || (D>MAX)) MAX=D;    # Save max depth

        # Handle parameters
        Q=split(A[2], B, " ");
        for(N=1; N<=Q; N++)
        {
                split(B[N], C, "=");
                gsub(/['"]/,"", C[2]);

                I=D SUBSEP STACK[D] SUBSEP C[1];
                if(!SEEN++)
                        ARG[++ARG_]=I;

                DATA=C[2];
        }

        if($2)
        {
                I=D SUBSEP STACK[D] SUBSEP "CDATA";
                if(!SEEN++)
                        ARG[++ARG_]=I;

                DATA=$2;
        }
}

$ sed 's/env="\([^"]*\)" env="\([^"]*\)"/env="\1\2"/g' 3.xml | awk -f xmlg.awk
rel ver mod name        node env        ins ip  ins ip
123     on      ac1     10.192.0.1      10.192.0.2
123     on      ac2     10.192.0.3      10.192.0.4
123     on      pr      10.192.0.5      10.192.0.6
123     off     ac1     10.192.0.7      10.192.0.6
123     off     ac2     10.192.0.8      10.192.0.6
123     off     pr      10.192.0.9      10.192.0.6

$