Extracting the tag name from an xml file

Little · November 17, 2014, 1:12pm

Hi,

My requirement is something like this,
I have a xml file that contains some tags and nested tags,

<n:tag_name1>
       <n:sub_tag1>val1</n:sub_tag1> 
       <n:sub_tag2>val2</n:sub_tag2>
</n:tag_name1>
<n:tag_name2>
       <n:sub_tag1>value</n:sub_tag1>
       <n:sub_tag2>value</n:sub_tag2>
</n:tag_name2>

I need only parent tag names as a single string delimited by spaces.

the output should be:
tag_name="tag_name1 tag_name2"

i m getting difficulty coz the parent and sub tags both starts with "<n:" only.

Corona688 · November 17, 2014, 1:16pm

Is this what the XML actually looks like, newlines and all, or has it been prettied up?

Little · November 17, 2014, 1:28pm

Not exactly, these are just sample tags,
Newlines are almost like same but spaces are not can be more or less. But the tag starts with <> and end with a </>. inbetween these two there are sub tags.

junior-helper · November 17, 2014, 1:57pm

Out of curiosity... would

grep '^<n' xmlfile

catch the parent tag names only?

ongoto · November 17, 2014, 4:09pm

grep is a good idea.

Try this on some real data...

#!/bin/bash
#
# tags.sh

if [ ! $1 ]; then 
    echo "usage: `basename $0` <filename.xml> [output file]"
    exit 1
fi

tag_list=( $(grep -o -e "^\s*<\w.\w*>$" $1 | tr -d ' <>' | sed s/^n://g | sort -u ))
printf "\n%s%s\n" "tag_name=\"" "${tag_list
[*]}\""

# Append to outfile
if [ $2 ]; then
    touch $2
    printf "%s:\n%s%s\n\n" "$1" "tag_name=\"" "${tag_list
[*]}\"" >> $2
fi

### eof #

output
--------
$ tags.sh filename.xml
tag_name="tag_name1 tag_name2"

shamrock · November 17, 2014, 9:52pm

A sed approach that assumes that parent tags start in column 1 while sub-tags are offset by whitespace from the beginning of the line...

sed -n 's;^\(<[^/].*:\)\(.*\)>$;\2;p' file

Chubler_XL · November 17, 2014, 10:24pm

This finds and prints all the root-level unique tags.

awk '/^\// { L-- ; next }
/\/>/ { next }
L++ == 2 && /^n:/ { gsub(/^n:/, x); gsub(/[ >].*/, x) ; tags=tags" "$0 }
END{
  printf "tag_name=\"%s\"\n",substr(tags,2)
}' RS=\<  infile

However, note that well-formed XML should have only 1 root node.

If your actual XML file has <XML> and </XML> (or <DATA> </DATA> or similar) around these tags of interest you will need to change the match above to L++ == 2 && /^n:/

As an example this output: tag_name="test1 test2" is produced from:

<n:test1><empty/><name>Testing XML tag</name><ignore></ignore></n:test1><n:test2><empty/></n:test2>