Extract urls from index.html downloaded using wget

mnanavati · October 13, 2010, 6:09pm

Hi,
I need to basically get a list of all the tarballs located at uri
I am currently doing a wget on urito get the index.html page

Now this index page contains the list of uris that I want to use in my bash script.

can someone please guide me ,.

I am new to Linux and shell scripting.

Thanks,
M

Chubler_XL · October 13, 2010, 6:29pm

You want to look at wget resursive download options in particular the -r (recursive) and -l (level).

Typically wget -r -l 1 http://my.site.com/index.html

mnanavati · October 13, 2010, 6:34pm

This creates a directory structure of the site itself. I donot want to create a directory stucture. Basically, just like index.html , i want to have another text file that contains all the URLs present in the site.

Thanks,
M

Chubler_XL · October 13, 2010, 6:53pm

Oh I see, how about this:

awk 'BEGIN{ RS="<a *href *= *\""} NR>2 {sub(/".*/,"");print; }' index.html

mnanavati · October 13, 2010, 8:07pm

Thank you! That helped a lot

Habitual · October 13, 2010, 9:26pm

lynx -dump Website Domains Names & Hosting | Domain.com | grep -A999 "^References$" | tail -n +3 | awk '{print $2 }'