extract a number within an xml file

tret · October 6, 2008, 4:36pm

Hi Everyone, I have an sh script that I am working on and I have run into a little snag that I am hoping someone here can assist me with.

I am using wget to retrieve an xml file from thetvdb.com. This part works ok but what I need to be able to do is extract the series ID # from the xml and put the number into a variable for further use in my script.

Here is an example of an xml:

<Data>

<Series>
<seriesid>73545</seriesid>
<language>en</language>
<SeriesName>Battlestar Galactica (2003)</SeriesName>
<banner>graphical/73545-g11.jpg</banner>

<Overview>blah blah blah blah</Overview>
<FirstAired>2003-12-01</FirstAired>
<IMDB_ID>tt0407362</IMDB_ID>
<zap2it_id>SH710749</zap2it_id>
<id>73545</id>
</Series>
</Data>

The number in <seriesid>73545</seriesid> is what I am interested in and
in this case I need "73545" extracted and placed into a variable.

How can I accomplish this. I have been able to print out the entire line using grep so it looks like this. But I need just the number.
<seriesid>73545</seriesid>

I haven't been able to figure out sed or awk enough to make this happen. Help?

Thanks!
Rob

vidyadhar85 · October 6, 2008, 4:47pm

not sure try and see

variable=`sed -ne 's/\(^\<seriesid\>\)\(.*\)\(\<\/seriesid\>\)/\2/p' filename`

tret · October 6, 2008, 5:52pm

Beautiful my friend!

Many thanks

tret · October 6, 2008, 6:48pm

Ok I just ran into another snag, In some cases when there are multiple hits for a given tv show:

<Data>

<Series>
<seriesid>77847</seriesid>
<language>en</language>
<SeriesName>MacGyver</SeriesName>
<banner>graphical/706-g2.jpg</banner>

<Overview>
blah blah blah
</Overview>
<FirstAired>1985-09-29</FirstAired>
<zap2it_id>SH002714</zap2it_id>
<id>77847</id>
</Series>

<Series>
<seriesid>83158</seriesid>
<language>en</language>
<SeriesName>Young MacGyver</SeriesName>

<Overview>
blah blah blah
</Overview>
<IMDB_ID>tt0352117</IMDB_ID>
<id>83158</id>
</Series>
</Data>

Is there a way to make it compare the name in <SeriesName></SeriesName> to a variable i provide and only capture the seriesID for the match?

Thanks

vidyadhar85 · October 6, 2008, 6:58pm

use head -1 after that command

variable=`sed -ne 's/\(^\<seriesid\>\)\(.*\)\(\<\/seriesid\>\)/\2/p' filename|head -1`

tret · October 6, 2008, 7:16pm

This did the trick, thanks again Vid!!! I really need to learn how to use sed!

Thanks

tret · October 6, 2008, 9:40pm

Hey vid!

Ok so I'm not sure why but this worked just fine on a mac os x machine but when I used the exact same thing on a linux box it just gives me a blank output, i duplicated everything exactly.

I am running ubuntu hardy, any ideas?

Thanks

ghostdog74 · October 6, 2008, 10:29pm

if you have PHP(or other language that support parsing XML)

<?php
$seriesname = $argv[1];
$result=file_get_contents("file");
$xml = new SimpleXMLElement($result);
foreach ($xml->Series as $series) {
   if ( strpos($series->SeriesName,$seriesname) !==FALSE ){
        echo $series->seriesid." ";
   }
}
?>

usage:

#!/bin/bash
seriesname=$1
seriesID=`php5 test.php  $seriesname`
echo $seriesID

danmero · October 6, 2008, 10:55pm

Just because this is wrong:

The regexp will match any seriesid regardless your condition(s).

and @vidyadhar85 use head -1 to print(cheat) the first match? even if it's not the right one.

One solution should be:

$ v="Young MacGyver"
$ awk -F'[<|>]' -v v="$v" '$2=="seriesid"{s=$3}$2=="SeriesName" && $3==v{print s}' file
83158

.. keep seriesid on hold and if $3 match

tret · October 7, 2008, 12:07pm

danmero:

Just because this is wrong:

The regexp will match any seriesid regardless your condition(s).

and @vidyadhar85 use head -1 to print(cheat) the first match? even if it's not the right one.

One solution should be:
$ v="Young MacGyver"
$ awk -F'[<|>]' -v v="$v" '$2=="seriesid"{s=$3}$2=="SeriesName" && $3==v{print s}' file
83158
.. keep seriesid on hold and if $3 match

Hey danmero this works really well thanks! I played around with it and it only takes the SeriesID if the SeriesName matches. Very cool

Is there any way to modify this so it isn't case sensitve? I have a little trouble with show names such as NCIS or ER or other abbreviated titles that may or may not be in caps.

Thanks again!

danmero · October 7, 2008, 7:10pm

Use awk "tolower" function

awk -F'[<|>]' -v v="$v" '$2=="seriesid"{s=$3}$2=="SeriesName" && tolower($3)==tolower(v){print s}' file