How to extract url from html page?

14th · October 16, 2010, 12:36am

for example, I have an html file, contain

<a href="http://awebsite"  id="awebsite" class="first">website</a>

and sometime a line contains more then one link, for example

<a href="http://awebsite"  id="awebsite" class="first">website</a><a href="http://bwebsite"  id="bwebsite" class="first">websiteb</a>

how can I extract that become something like this.

http://awebsite website
http://bwebsite websiteb

I only know how to get word between <a>and</a> using

sed -e 's/<[^<]*>//g'

but I don't know how to get the link..
thanks

kurumi · October 16, 2010, 1:09am

#!/usr/bin/env ruby -Ku

require 'hpricot'
doc = open("file"){|f|Hpricot(f)}
(doc/"a").each do |x|
 print "-->#{x.get_attribute("href)}, #{x.inner_text}\n"
end

$ ruby geturl.rb
-->http://awebsite, website
-->http://bwebsite, websiteb

14th · October 16, 2010, 1:32am

kurumi:

http://hpricot.com/

#!/usr/bin/env ruby -Ku

require 'hpricot'
doc = open("file"){|f|Hpricot(f)}
(doc/"a").each do |x|
 print "-->#{x.get_attribute("href)}, #{x.inner_text}\n"
end

$ ruby geturl.rb
-->http://awebsite, website
-->http://bwebsite, websiteb

thanks for the answering, but I don't really know what is hpricot and how to use it.
I'm new to shell programming and I don't know ruby at all.

do you know any solution using sed, awk, or grep maybe?
thanks.

malcomex999 · October 16, 2010, 4:18am

Try...

 
awk -F'href="|"  |">|</' '{for(i=2;i<=NF;i=i+4) print $i,$(i+2)}' infile

kurumi · October 16, 2010, 5:08am

$ cat file
<a href="http://awebsite"  id="awebsite" class="first" someattribute="last" > website</a>
<a href="http://bwebsite"  id="bwebsite" class="first">websiteb</a>

$ awk -F'href="|"  |">|</' '{for(i=2;i<=NF;i=i+4) print $i,$(i+2)}' file
http://awebsite a>
http://bwebsite websiteb

$ ruby test.rb
-->http://awebsite,  website
-->http://bwebsite, websiteb

Scrutinizer · October 16, 2010, 5:13am

grep -o 'http://[^"]*'

kurumi · October 16, 2010, 5:22am

OP also needs the inner text, not just urls.

Scrutinizer · October 16, 2010, 5:31am

Oops...

awk '/^id/{print $2}/^href/{printf $2 RS}' FS='"' RS=" " infile

malcomex999 · October 16, 2010, 5:53am

<a href="http://awebsite"  id="awebsite" class="first" someattribute="last" > website</a>
<a href="http://bwebsite"  id="bwebsite" class="first">websiteb</a>

Nice one but I guess the OP needs the one which is in red only, not $2 after id.

---------- Post updated at 12:53 PM ---------- Previous update was at 12:45 PM ----------

kurumi:

$ cat file
<a href="http://awebsite"  id="awebsite" class="first" someattribute="last" > website</a>
<a href="http://bwebsite"  id="bwebsite" class="first">websiteb</a>
 
$ awk -F'href="|"  |">|</' '{for(i=2;i<=NF;i=i+4) print $i,$(i+2)}' file
http://awebsite a>
http://bwebsite websiteb
 
$ ruby test.rb
-->http://awebsite,  website
-->http://bwebsite, websiteb

I guess no space before and after >(on red below) in the original requirement from the OP so it will work fine as below if you remove the space

 
$ cat file
<a href="http://awebsite"  id="awebsite" class="first" someattribute="last" > website</a>
<a href="http://bwebsite"  id="bwebsite" class="first">websiteb</a>
 
$ awk -F'href="|"  |">|</' '{for(i=2;i<=NF;i=i+4) print $i,$(i+2)}' file
http://awebsite  website
http://bwebsite websiteb

kurumi · October 16, 2010, 6:15am

HTML is not regular. you won't know in future there will be more or less spaces as web page changes. Ideally, we should use a parser.

Scrutinizer · October 16, 2010, 6:16am

Wow I am really having trouble reading today... This then maybe?

awk '$2=="a href="{printf $3; getline; print OFS $1}' RS='>' FS='["<]'

kurumi · October 16, 2010, 6:36am

Nice. works with OP's mini sample. But I used it to parse a page like that of Google, for exampl, i hit error.

awk: (FILENAME=file FNR=239) fatal: not enough arguments to satisfy format string

Scrutinizer · October 16, 2010, 6:46am

OK , the next attempt then:

awk -F'>' '/^a href/{split($1,F,"\"");print F[2],$NF}' RS='<'

kurumi · October 16, 2010, 6:59am

Nice. It worked. Now, how about taking care of HTML entities like ?
This is a symbol (down arrow). Gawk doesn't return it. But my parser does. Any ways to set it with gawk?

Scrutinizer · October 16, 2010, 7:27am

Do you have a sample url?

kurumi · October 16, 2010, 12:21pm

Its just Google. Google is just an example.

Scrutinizer · October 16, 2010, 12:50pm

When I do wget www.google.com , I see no -character.

kurumi · October 16, 2010, 11:48pm

ok, the first time i do my test, i copied the html using the browser's view source. But since you mentioned wget, so here's how i do my next test. Using wget to download Google

$ wget 209.85.132.104

$ awk -F'>' '/^a href/{split($1,F,"\"");print F[2],$NF}' RS='<' index.html
http://mail.google.com/mail/?hl=en&tab=wm Gmail
http://www.google.com/intl/en/options/
/url?sa=p&pref=ig&pval=3&q=http://www.google.com/ig%3Fhl%3Den%26source%3Diglk&usg=AFQjCNFA18XPfgb7dKnXfKz7x7g1GDH1tg iGoogle
/preferences?hl=en Settings
https://www.google.com/accounts/Login?hl=en&continue=http://209.85.132.104/ Sign in
/advanced_search?hl=en Advanced Search
/language_tools?hl=en Language Tools
/intl/en/ads/ Advertising�Programs
/services/ Business Solutions
/intl/en/about.html About Google
http://www.google.com/ncr Go to Google.com
/intl/en/privacy.html Privacy


$ ruby test.rb
-->http://www.google.com/imghp?hl=en&tab=wi, Images
-->http://video.google.com/?hl=en&tab=wv, Videos
-->http://maps.google.com/maps?hl=en&tab=wl, Maps
-->http://news.google.com/nwshp?hl=en&tab=wn, News
-->http://www.google.com/prdhp?hl=en&tab=wf, Shopping
-->http://mail.google.com/mail/?hl=en&tab=wm, Gmail
-->http://www.google.com/intl/en/options/, more �
-->/url?sa=p&pref=ig&pval=3&q=http://www.google.com/ig%3Fhl%3Den%26source%3Diglk&usg=AFQjCNFA18XPfgb7dKnXfKz7x7g1GDH1tg, iGoogle
-->/preferences?hl=en, Settings
-->https://www.google.com/accounts/Login?hl=en&continue=http://209.85.132.104/, Sign in
-->/advanced_search?hl=en, Advanced Search
-->/language_tools?hl=en, Language Tools
-->/intl/en/ads/, Advertising*Programs
-->/services/, Business Solutions
-->/intl/en/about.html, About Google
-->http://www.google.com/ncr, Go to Google.com
-->/intl/en/privacy.html, Privacy

If you notice at Google main page, there is a link called "more" right at the top, and the down arrow key is next to it, which is reflected in the ruby output as

-->http://www.google.com/intl/en/options/, more �

The down arrow key is itself a url link.

Scrutinizer · October 17, 2010, 1:56am

Thanks Kurumi, it was because of event definitions in the tag.

awk -F'>' '/[ \t]href=/{N=split($1,F,"\""); i=1; while(F[i++]!~/[ \t]href=/); print F,$NF}' RS='<'

kurumi · October 17, 2010, 2:42am

thanks scrutinizer for following up on this. you are almost there.

$ awk -F'>' '/[ \t]href=/{N=split($1,F,"\""); i=1; while(F[i++]!~/[ \t]href=/); print F,$NF}' RS='<' index.html |sort
/advanced_search?hl=en Advanced Search
http://mail.google.com/mail/?hl=en&tab=wm Gmail
http://maps.google.com/maps?hl=en&tab=wl Maps
http://news.google.com/nwshp?hl=en&tab=wn News
https://www.google.com/accounts/Login?hl=en&continue=http://209.85.132.104/ Sign in
http://video.google.com/?hl=en&tab=wv Videos
http://www.google.com/imghp?hl=en&tab=wi Images
http://www.google.com/intl/en/options/
http://www.google.com/ncr Go to Google.com
http://www.google.com/prdhp?hl=en&tab=wf Shopping
/intl/en/about.html About Google
/intl/en/ads/ Advertising�Programs
/intl/en/privacy.html Privacy
/language_tools?hl=en Language Tools
/preferences?hl=en Settings
/services/ Business Solutions
/url?sa=p&pref=ig&pval=3&q=http://www.google.com/ig%3Fhl%3Den%26source%3Diglk&usg=AFQjCNFA18XPfgb7dKnXfKz7x7g1GDH1tg iGoogle

$ ruby test.rb index.html |sort
-->/advanced_search?hl=en, Advanced Search
-->http://mail.google.com/mail/?hl=en&tab=wm, Gmail
-->http://maps.google.com/maps?hl=en&tab=wl, Maps
-->http://news.google.com/nwshp?hl=en&tab=wn, News
-->https://www.google.com/accounts/Login?hl=en&continue=http://209.85.132.104/, Sign in
-->http://video.google.com/?hl=en&tab=wv, Videos
-->http://www.google.com/imghp?hl=en&tab=wi, Images
-->http://www.google.com/intl/en/options/, more �
-->http://www.google.com/ncr, Go to Google.com
-->http://www.google.com/prdhp?hl=en&tab=wf, Shopping
-->/intl/en/about.html, About Google
-->/intl/en/ads/, Advertising*Programs
-->/intl/en/privacy.html, Privacy
-->/language_tools?hl=en, Language Tools
-->/preferences?hl=en, Settings
-->/services/, Business Solutions
-->/url?sa=p&pref=ig&pval=3&q=http://www.google.com/ig%3Fhl%3Den%26source%3Diglk&usg=AFQjCNFA18XPfgb7dKnXfKz7x7g1GDH1tg, iGoogle