Extract URL from RSS Feed in AWK

Hi,
I have following data file;

<outline title="Matt Cutts" type="rss" version="RSS" xmlUrl="http://www.mattcutts.com/blog/feed/" htmlUrl="http://www.mattcutts.com/blog"/>
<outline title="Stone" text="Stone" type="rss" version="RSS" xmlUrl="http://feeds.feedburner.com/STC-Art" htmlUrl="http://www.stone.com/S.shtml"/>
<outline title="Stone" text="Stone" type="rss" version="RSS" ymlUrl="http://feeds.feedburner.com/STC-Art" htmlUrl="http://www.stone.com/S.shtml"/>
<outline title="Adam Leventhal's Weblog" text="Adam Leventhal's Weblog" type="rss" version="RSS" xmlUrl="http://blogs.sun.com/ahl/feed/entries/atom" htmlUrl="http://blogs.sun.com/ahl/"/>

I want to just extract the url in xmlUrl attribute and save it another file. I want to do it in awk.

Thanks for your time.

regards

#!/bin/bash
exec 6<"file"
while read -r LINE<&6
do
  case "$LINE" in
   *xmlUrl*)
      LINE=${LINE##*xmlUrl=\"}
      echo ${LINE%%\" *};;
  esac
done
exec 6<&-

awk 'BEGIN{RS=FS}/^xmlUrl/{print $2}' FS='"' infile

Output:

http://www.mattcutts.com/blog/feed/
http://feeds.feedburner.com/STC-Art
http://blogs.sun.com/ahl/feed/entries/atom
1 Like

Hi Scrutinizer,
Thanks a lot. It works. Please do a favor and explain your code in words please.

regards

Hi fahdmirza, this awk script changes the record separator to the value of the field separator so that every record becomes one field. Then it splits the new records in new fields separated by double quotes. The required values are then in the second new field of the new records that start with xmlUrl.

1 Like
RS=FS, good idea

Hi Scrutinizer, thanks for the reply. Pardon my ignorance, but I have little confusion.

For example take the following line from the data:

<outline title="Matt Cutts" type="rss" version="RSS" xmlUrl="Matt Cutts: Gadgets, Google, and SEO" htmlUrl="Matt Cutts: Gadgets, Google, and SEO"/>

Now first your code makes the above full line (or record) as one field by doing RS=FS.

Then it matches the start of xmlUrl in above line, and now the field separater is ".

My question is how $2 contains the required url. Please explain.

Thanks.

It is the other way around; it takes the fields in the line and turns every field into a record....
Then it matches the records that start with xmlUrl

If the separator is " then there are three fields in the record that we are looking for:
$1 contains the part to the left of the first double quote, xmlUrl=
$2 contains the url and
$3 contains the part to the right of the second double quote, which is an empty string...

Does that answer you question?

1 Like

Crystal Clear. Many many Thanks.

best regards