Extracting the tag name from an xml file

Hi,

My requirement is something like this,
I have a xml file that contains some tags and nested tags,

<n:tag_name1>
       <n:sub_tag1>val1</n:sub_tag1> 
       <n:sub_tag2>val2</n:sub_tag2>
</n:tag_name1>
<n:tag_name2>
       <n:sub_tag1>value</n:sub_tag1>
       <n:sub_tag2>value</n:sub_tag2>
</n:tag_name2>

I need only parent tag names as a single string delimited by spaces.

the output should be:
tag_name="tag_name1 tag_name2"

i m getting difficulty coz the parent and sub tags both starts with "<n:" only.

Is this what the XML actually looks like, newlines and all, or has it been prettied up?

Not exactly, these are just sample tags,
Newlines are almost like same but spaces are not can be more or less. But the tag starts with <> and end with a </>. inbetween these two there are sub tags.

Out of curiosity... would

grep '^<n' xmlfile

catch the parent tag names only?

grep is a good idea.

Try this on some real data...

#!/bin/bash
#
# tags.sh

if [ ! $1 ]; then 
    echo "usage: `basename $0` <filename.xml> [output file]"
    exit 1
fi

tag_list=( $(grep -o -e "^\s*<\w.\w*>$" $1 | tr -d ' <>' | sed s/^n://g | sort -u ))
printf "\n%s%s\n" "tag_name=\"" "${tag_list
[*]}\""

# Append to outfile
if [ $2 ]; then
    touch $2
    printf "%s:\n%s%s\n\n" "$1" "tag_name=\"" "${tag_list
[*]}\"" >> $2
fi

### eof #

output
--------
$ tags.sh filename.xml
tag_name="tag_name1 tag_name2"

A sed approach that assumes that parent tags start in column 1 while sub-tags are offset by whitespace from the beginning of the line...

sed -n 's;^\(<[^/].*:\)\(.*\)>$;\2;p' file

This finds and prints all the root-level unique tags.

awk '/^\// { L-- ; next }
/\/>/ { next }
L++ == 2 && /^n:/ { gsub(/^n:/, x); gsub(/[ >].*/, x) ; tags=tags" "$0 }
END{
  printf "tag_name=\"%s\"\n",substr(tags,2)
}' RS=\<  infile

However, note that well-formed XML should have only 1 root node.

If your actual XML file has <XML> and </XML> (or <DATA> </DATA> or similar) around these tags of interest you will need to change the match above to L++ == 2 && /^n:/

As an example this output: tag_name="test1 test2" is produced from:

<n:test1><empty/><name>Testing XML tag</name><ignore></ignore></n:test1><n:test2><empty/></n:test2>