Improving XML-Parsing capabilities of AWK

joker · February 4, 2024, 6:29pm

Hi,

even if awk is not really the ideal tool for parsing XML (better use xmlstarlet or xmllint), sometimes one likes to avoid the complexity XML handling can grow to. I noticed @Corona688 wrote an awk Script for parsing XML quite some time ago. The occurrance I'm finding right at the moment is here: processing xml with awk.

The bad of parsing XML that way is, that awk is line oriented, while XML is not. But to increase the chance of getting the job done with awk, just pipe it through an xml pretty printer like xmllint:

xmllint --format data.xml | awk ... <xml handling here>

Regards,
joker

kshji · February 5, 2024, 6:30am

Awk is used usually "line oriented", but it's only default. Awk is "record oriented". Using RS variable you set record delimiter, default is linefeed. Awk variables FS and RS give "tools" to parse xml if like to use awk.

My Awk XML-parser doc examples to use awk and other xml oriented tools.

Very simple method to get some element value, in this case we need to find element NUM values:

awk -F'</?NUM>' 'NF>1{print $2}' some.xml

Or

awk -v elem=NUM '
BEGIN {
    RS="<"
    FS=">"
    }
$1 == elem { print $2 }
' example.xml

More generic:

awk -F '[<|>="]' '
# give some rules to search interesting element, example:
/NUM/ { print "FOUND:",$3 }

# debug printing, easier to see what you can do/get
      {#-debug print
         for (f=1;f<=NF;f++) printf "%d:%s ",f,$f
         printf "\n"
      }

' example.xml

"Full xml" parsing using awk, I use getXML.awk, developed by Jan Weber.