Grabbing URL's from a website

ogoy · December 15, 2008, 9:17pm

Hi everyone,

Need your help creating a script that downloads a website after entering the URL e.g. Google, and parses its contents for URL's and appends the output to a text file in this format:

http://grabbed-url-from-downloaded-website.com/
http://another-url-from-downloaded-website.com/

I was thinking maybe wget, and a script that greps for http* or www* and lists all it can find on a flatfile.

Thanks!
Ogoy

sudhir_onweb · December 15, 2008, 10:27pm

Hi

Exactly. What you are trying to do is fine. Did you try any such script? I have done it couple of times, and it seems simple. Let me know if you need any help in parsing the contents.

~$u)hir

yongitz · December 15, 2008, 10:32pm

You might want to try lwp-request. man lwp-request

ogoy · December 15, 2008, 10:46pm

I've used the following with fairly good results:

links -dump "http://www.cnn.com" | egrep -o "http:.*"

I append the output > urlist.list however I am unsure how reliable this is. In other words yes, I do need help with parsing haha!

Thanks for the quick response guys

Ogoy