Read content between xml tags with awk, grep, awk or what ever...


I trying to extract text that is surrounded by xml-tags. I tried this

cat tst.xml | egrep "<SERVER>.*</SERVER>" |sed -e "s/<SERVER>\(.*\)<\/SERVER>/\1/"|tr "|" " "

which works perfect, if the start-tag and the end-tag are in the same line, e.g.:

<tag1>Hello Linux-Users</tag1>

but if I have somethink like that:


it doesn't do anythink. I think the problem is that the tools I used are working line by line and because of that there's no way to recognize
the end-tag... I'm no very experienced with awk, sed and grep so i need some help...

Hope someone can help...


Hi, Sebi0815:

Perhaps you can change each newline to a space, so that the data appears as one long line. This is a naive approach, but if it doesn't affect the semantics of your data it may be sufficient.

tr '\n' ' ' < tst.xml | egrep...

Or delete them altogether:

tr -d '\n' < tst.xml | egrep...


Thanks for the fast answer Alister...

... but this solution won't work for me. I need to the "newlines" in the text.

Here's a Perl solution. Assume your file is as follows -

$ cat sample.xml
<?xml version="1.0"?>
   <book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <description>An in-depth look at creating applications with XML.</description>
   <book id="bk102">
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <description>A former architect battles corporate zombies,
      an evil sorceress, and her own childhood to become queen
      of the world.</description>
   <book id="bk103">
      <author>Corets, Eva</author>
      <title>Maeve Ascendant</title>
      <description>After the collapse of a nanotechnology
      society in England, the young survivors lay the
      foundation for a new society.</description>
   <book id="bk104">
      <author>Corets, Eva</author>
      <title>Oberon's Legacy</title>
      <description>In post-apocalypse England, the mysterious
      agent known only as Oberon helps to create a new life
      for the inhabitants of London. Sequel to Maeve
   <book id="bk105">
      <author>Corets, Eva</author>
      <title>The Sundered Grail</title>
      <description>The two daughters of Maeve, half-sisters,
      battle one another for control of England. Sequel to
      Oberon's Legacy.</description>

You want to pick up the stuff between the "<description>, </description>" tags.

The first occurrence is on a single line. The rest of them span multiple lines and you want the newlines to be preserved. I shall assume that you want the whitespaces to be preserved as well.

Here's the script -

$ perl -lne 'BEGIN{undef $/} while (/<description>(.*?)<\/description>/sg){print $1}' sample.xml
An in-depth look at creating applications with XML.
A former architect battles corporate zombies,
      an evil sorceress, and her own childhood to become queen
      of the world.
After the collapse of a nanotechnology
      society in England, the young survivors lay the
      foundation for a new society.
In post-apocalypse England, the mysterious
      agent known only as Oberon helps to create a new life
      for the inhabitants of London. Sequel to Maeve
The two daughters of Maeve, half-sisters,
      battle one another for control of England. Sequel to
      Oberon's Legacy.

In case you want the newlines preserved, but want to remove the whitespace at the beginning, then -

$ perl -lne 'BEGIN{undef $/} while (/<description>(.*?)<\/description>/sg){($x = $1) =~ s/\n\s*/\n/g; print $x}' sample.xml
An in-depth look at creating applications with XML.
A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.
After the collapse of a nanotechnology
society in England, the young survivors lay the
foundation for a new society.
In post-apocalypse England, the mysterious
agent known only as Oberon helps to create a new life
for the inhabitants of London. Sequel to Maeve
The two daughters of Maeve, half-sisters,
battle one another for control of England. Sequel to
Oberon's Legacy.

And in case you want to neither the newline nor the whitespace i.e. each chunk between "<description>" tags on a single line, then -

$ perl -lne 'BEGIN{undef $/} while (/<description>(.*?)<\/description>/sg){($x = $1) =~ s/\n\s*//g; print $x}' sample.xml
An in-depth look at creating applications with XML.
A former architect battles corporate zombies, an evil sorceress, and her own childhood to become queen of the world.
After the collapse of a nanotechnology society in England, the young survivors lay the foundation for a new society.
In post-apocalypse England, the mysterious agent known only as Oberon helps to create a new life for the inhabitants of London. Sequel to Maeve Ascendant.
The two daughters of Maeve, half-sisters, battle one another for control of England. Sequel to Oberon's Legacy.



The following is about as smart as your original solution; it will not work correctly if this tag can be embedded within itself, nor if there are multiple instances of it on a single line. If you require more intelligence, perhaps it is time to step up to a tool that understands xml.

$ cat data

<tag2>Good Bye</tag2>

$ sed -n '/<tag2>/,/<\/tag2>/H; /<tag2>/h; /\/tag2/{x;s/<tag2>\(.*[^\n]\)\n*<\/tag2>/\1/p;}' data
Good Bye


EDIT I'm sorry, this won't really work, it prints any other text it comes across too, but someone with more awk experience may be able to fix that too.

Here's an awk line I got from someone here for a similar problem. I changed it to suit your problem, but it puts out some blank lines at the end and I don't know enough awk to fix that. Maybe someone else can perfect it.

It extracts everything between the opening and closing tags that you specify, it doesn't matter if it's one line or multiple lines. You can also use "awk command file" to run it on a file.

# echo '<tag2>Hello
<tag2>Hello Linux-Users</tag2>' | awk 'BEGIN{ RS="</tag2>"}{gsub(/.*<tag2>/,"");print}'
Hello Linux-Users
