Extract urls from index.html downloaded using wget

Hi,
I need to basically get a list of all the tarballs located at uri
I am currently doing a wget on urito get the index.html page

Now this index page contains the list of uris that I want to use in my bash script.

can someone please guide me ,.

I am new to Linux and shell scripting.

Thanks,
M

You want to look at wget resursive download options in particular the -r (recursive) and -l (level).

Typically wget -r -l 1 http://my.site.com/index.html

This creates a directory structure of the site itself. I donot want to create a directory stucture. Basically, just like index.html , i want to have another text file that contains all the URLs present in the site.

Thanks,
M

Oh I see, how about this:

awk 'BEGIN{ RS="<a *href *= *\""} NR>2 {sub(/".*/,"");print; }' index.html
1 Like

Thank you! That helped a lot :slight_smile:

lynx -dump Website Domains Names & Hosting | Domain.com | grep -A999 "^References$" | tail -n +3 | awk '{print $2 }'