Extract a substring.

shellpower · November 22, 2008, 12:16am

I have a shell script that uses wget to grab a bunch of html from a url.

URL_DATA=`wget -qO - "$URL1"`

I now have a string $URL_DATA that I need to pull a substring out of..say I had the following in my string

<a href="/scooby/929011567.html">Dog pictures check them out! -</a> (Silly) <a href="/shaggey/928861647.html">Vacation -</a> (boating) <a href="/gopher/928782568.html">Garden -</a> (winter)

I want to extract the URL, Title and Description throughout the string...like the following

/scooby/929011567.html
Dog pictures check them out!
(silly)

/shaggey/928861647.html
Vacation
(boating)

/gopher/928782568.html
Garden
(winter)

and keep going with that pattern as many times as it's in the string. How would I do this?

Christoph_Spohr · November 22, 2008, 4:13am

Hi,

not elegant, but it works. First split the long line in chunks. You can save the result in an array or a temporary file.

sed -n -e 's!</font></p>\s*!\n!pg' file >> tempfile

and now extract the data from the temp file:

sed -n 's!^[^/]*\([^"]*\)..\([^-]*\)[^(]*\(.*\)!\1\n\2\n\3!p' tempfile

If your sed doesn't support "\n" you have to write "\
" instead. (Backslash, then press return)

HTH Chris

shellpower · November 22, 2008, 11:02am

This seems to work from command line but when I put it into my shell script I get the error
"sed: unrecognized option '-->'"

here's how I have it in my script

biglines=`sed -n -e 's!\s*!\n!pg' $URL_DATA`

rc7 · November 22, 2008, 9:32pm

You can do it with awk:
echo $URL_DATA | awk -F '<a href="|">|-</a> |' 'BEGIN{RS=" "; OFS="\n"; ORS="\n\n"} {print $2,$3,$4}'