Extract a substring.

I have a shell script that uses wget to grab a bunch of html from a url.

URL_DATA=`wget -qO - "$URL1"`

I now have a string $URL_DATA that I need to pull a substring out of..say I had the following in my string

<p><a href="/scooby/929011567.html">Dog pictures check them out! -</a><font size="-1"> (Silly)</font></p> <p><a href="/shaggey/928861647.html">Vacation -</a><font size="-1"> (boating)</font></p> <p><a href="/gopher/928782568.html">Garden -</a><font size="-1"> (winter)</font></p>

I want to extract the URL, Title and Description throughout the string...like the following

/scooby/929011567.html
Dog pictures check them out!
(silly)

/shaggey/928861647.html
Vacation
(boating)

/gopher/928782568.html
Garden
(winter)

and keep going with that pattern as many times as it's in the string. How would I do this?

Hi,

not elegant, but it works. First split the long line in chunks. You can save the result in an array or a temporary file.

sed -n -e 's!</font></p>\s*!\n!pg' file >> tempfile 

and now extract the data from the temp file:

sed -n 's!^[^/]*\([^"]*\)..\([^-]*\)[^(]*\(.*\)!\1\n\2\n\3!p' tempfile

If your sed doesn't support "\n" you have to write "\
" instead. (Backslash, then press return)

HTH Chris

This seems to work from command line but when I put it into my shell script I get the error
"sed: unrecognized option '-->'"

here's how I have it in my script

biglines=`sed -n -e 's!</font></p>\s*!\n!pg' $URL_DATA`

You can do it with awk:
echo $URL_DATA | awk -F '<p><a href="|">|-</a><font size="-1"> |</font>' 'BEGIN{RS="</p> "; OFS="\n"; ORS="\n\n"} {print $2,$3,$4}'