bash extract all occurences delimited from <name> and </name> tags from an xml file

ingalex · December 26, 2010, 12:30pm

I need to extract all text delimited from <name> and </name> tags from an xml file, but not only first occurence. I need to extract all occurences.
I've tried with this command:

awk -F"<name>|</name>" 'NF>2{print $2}'

but it give only first occurence. How can i modify it?

DGPickett · December 26, 2010, 1:20pm

Tell awk the record separator is < so every tag is a different record to be filtered.

ingalex · December 26, 2010, 1:47pm

How i can do that? I'm not an expert with awk. If i use < as separator it identify also other tag and not only the text delimited by <name> </name> tags

pravin27 · December 26, 2010, 1:50pm

Hi,
something like this,

sed 's/<name>/ /g;s/<\\name>/ /g'  inputfile

ingalex · December 26, 2010, 3:09pm

I've applied your suggested command to my xml file, but it doesn't show me only text between <name></name> tags. It shows the entire file.
I've applied it to this file h_tp://dl.dropbox.com/u/877248/info.xml
I need to obtain only a list.

frans · December 26, 2010, 5:03pm

Not very 'clean but working (with your sample) one:

sed -e 's/name>/\n/g' info.xml | grep '^[^<]*<\/' | cut -d'<' -f1

result in the attachement

cabrao · December 26, 2010, 5:34pm

If you want to do it with awk, try this:

awk '!/<.*>/' RS="<name>|</name>" file

DGPickett · December 26, 2010, 6:50pm

tr '<' '\12' < file | sed -n 's/^name>//p'

Narrative: Turn all < into linefeed, then pipe to sed where the name tagged lines have the name> removed and print the result but not other lines.

fpmurphy · December 26, 2010, 8:27pm

#!/bin/bash
TMP=file.$$

cat <<EOT >$TMP
<header>
   <name>first name is Santa</name>
   <name>second name Klaus</name>
</header>
EOT

# Note - sed by default is greedy and removes up to last >
sed -n 's/\(<name>\)\([[:print:]]*\)<\/[^>]*>/\2/p' $TMP

rm $TMP

exit 0

gives

  first name is Santa
  second name is Klaus

ingalex · December 27, 2010, 2:46am

thanks to all for your precious help. All this script works good for me, but i think that this is the easiest command to use:

awk '!/<.*>/' RS="<name>|</name>" file

DGPickett · December 27, 2010, 9:21am

Note none used bash, as text processing is more the realm of sed and awk !

m.d.ludwig · December 27, 2010, 2:29pm

Maybe not the "shell scripting" answer you are looking for:

perl -e '$/ = undef; $, = $\ = "\n"; $_ = <>; print m{<name>(.*?)</name>}g;' inputfile...

For an awk solution:

awk 'BEGIN { RS = "<"; FS = ">"; } $1 == "name" { print $2; }' inputfile....

DGPickett · December 27, 2010, 5:08pm

Well, I am a ksh guy, so it would be a lot of 'read l' and "${l#*<}" and such, but definitely doable.

Don't become one of those managers that summons an expert but then tells their expert the right tool to use!

m.d.ludwig · December 27, 2010, 5:28pm

I am also a ksh guy ... since when it was first released (it's been a long strange trip...). But filtering text files, PERL just rocks. Put the two together, and we "try to take over the world!".

As a recommendation, O'Reilly's Mastering Regular Expressions is seriously meaty when it comes to regexes.

DGPickett · December 27, 2010, 9:15pm

Well, PERL is more a language than a script, although some use it more to call executables than PERL libs! It has a pretty unique place, being as full featured as C/C++/JAVA but more script-interpret-like than JAVA, where you start worrying about heap space. I just jump all the way back and forth from ksh to C/C++/JAVA without stopping in the middle. Some I know live mostly in the middle happily enough.

The REGEX people seem to have moved in a PERL direction, to the point of introducing incompatibilities, and you now have to test which era you are coding in! Is it '\b', \<' or '(^|[^a-zA-Z0-9_])' for the left side of a word or identifier?

fpmurphy · December 28, 2010, 1:49am

ksh93 solution using regular expression with backreference:

re='(<name>)(?*)(</name>)'

while read line
do
   [[ $line == ${re} ]] && print ${line/${re}/\2}
done < infile

DGPickett · December 28, 2010, 12:32pm

Does that ksh92 bit work for 2+ name elements on the same line?

Usually, you have to trim the left one you find off the line and then reevaluate the rest of the line.

m.d.ludwig · December 28, 2010, 4:39pm

@fpmurphy: I tried your ksh script, and it did not work running version "sh (AT&T Research) 93t+ 2010-06-21". Can you enlighten me as to what I did wrong?

DGPickett · December 29, 2010, 2:38pm

I used dtksh, which is a ksh 93 derivitive, and got a blank line:

$ echo '<name>ho</name><name>ha</name>' >infile
$ while read line
do
   [[ $line == ${re} ]] && print ${line/${re}/\2}
done < infile
 
$