Extracting urls from curl output

Hello.

I use curl to fetch a website, then, I want to extract the URLs from this curls output.

I tried both sed and grep, but couldnt figure it out.

Ive tried :

sed -n 's/href="\([^"]*\).*/\1/p' results.txt

and grep -o

grep -o '<a href="http://[a-z]*.[a-z]*.[a-z]*/[a-z]*">' results.txt

.

What pattern shall I use and whats wrong with mine ?

EDIT:

Added some of the data I use

EDIT 2:
Removed the data sample, because it ruines the thread width, but just curl whatever website, and use that output as data.

Hello jozo95,

Could you please try following and let me know if this helps you.

awk '{match($0,/<a href=\"http.*><img/);A=substr($0,RSTART,RLENGTH-4);if(A){print A;A=""}}'  Input_file

Output will be as follows.

<a href="http://www.bth.se/web/utbildning.nsf/sidor/program?OpenDocument&expand=int">
<a href="https://www.antagning.se/se/triggerlogin?triggerloginurl=/se/mypages">
<a href="http://edu.bth.se/utbildning/utb_sok_resultat.asp?lang=sv&KtTermin=20161&PtStartTermin=20161&vy=hitta">
<a href="https://www.hogskoleprov.nu ">
<a href="http://www.bth.se/web/nyheter.nsf/sidor/8F5E44896F091A3AC1257E9F0045AC7D?OpenDocument">
<a href="http://edu.bth.se/utbildning/utb_sok_resultat.asp?lang=sv&KtTermin=20152&PtStartTermin=20152&vy=hitta&sortering=amne&sortering=installd&grupperingar=1">
<a href="http://www.bth.se/info/ophus.nsf/sidor/oppet-hus-pa-bth">
<a href="http://edu.bth.se/utbildning/utb_sok_resultat.asp?KtTermin=inne&PtStartTermin=inne&KtTyp=SOMM&lang=sv">

Thanks,
R. Singh

Doesnt work :confused:

I get output as this along the way :

g">Studiev�gledning</a></li><li><a href="http://www.bth.se/jobb">Lediga tj�nster</a></li></ul></div><div class="footer-section set"><h5><br></h5><ul class="footer-list"><li><a href="http://www.bth.se/web/ombth.nsf/sidor/organisation">Organisation</a></li><li><a href="http://www.bth.se/bib">Bibliotek</a></li><li><a href="http://careergate.bth.se/">BTH Career Gate</a></li><li><a href="http://www.bth.se/for/Sakerhet.nsf/sidor/593cb6bf948640dac1257f1f00365b42?OpenDocument">I h�ndelse av kris</a></li><li><a href=""></a></li><li><a href=""></a></li></ul> </div></span></div></div></div><div class="footer-info"><a class="footer-logo" href="http://www.bth.se">

Hello jozo95,

Sorry I haven't seen links without <img , so only it didn't match it properly.
Could you please try following and let me know if this helps you.

awk -F"[><]" '{for(i=1;i<=NF;i++){if($i ~ /a href=.*\//){print "<" $i ">"}}}'   Input_file

Output will be as follows.

<a href="http://www.bth.se/web/nyheter.nsf/AllaDok?OpenView">
<a href="http://www.bth.se/web/utbildning.nsf/sidor/program?OpenDocument&expand=int">
<a href="https://www.antagning.se/se/triggerlogin?triggerloginurl=/se/mypages">
<a href="http://edu.bth.se/utbildning/utb_sok_resultat.asp?lang=sv&KtTermin=20161&PtStartTermin=20161&vy=hitta">
<a href="https://www.hogskoleprov.nu ">
<a href="http://www.bth.se/web/nyheter.nsf/sidor/8F5E44896F091A3AC1257E9F0045AC7D?OpenDocument">
<a href="http://edu.bth.se/utbildning/utb_sok_resultat.asp?lang=sv&KtTermin=20152&PtStartTermin=20152&vy=hitta&sortering=amne&sortering=installd&grupperingar=1">
<a href="http://www.bth.se/info/ophus.nsf/sidor/oppet-hus-pa-bth">
<a href="http://edu.bth.se/utbildning/utb_sok_resultat.asp?KtTermin=inne&PtStartTermin=inne&KtTyp=SOMM&lang=sv">
<a href="/web/kalendarium.nsf/sidor/52CB572F173DEE64C1257F3400428859?OpenDocument">
<a href="/web/kalendarium.nsf/sidor/2743A2376777BC7BC1257F3400530744?OpenDocument">
<a href="/web/kalendarium.nsf/sidor/0F9CAD034B2DD920C1257F3400533F5A?OpenDocument">
<a href="http://www.bth.se/web/kalendarium.nsf">
<a href="http://www.bth.se/web/kalendarium.nsf">
<a href="http://www.bth.se/web/kalendarium.nsf/AllaDok?OpenView">
<a href="/web/pressmeddelande.nsf/sidor/8422F16DAC76024FC1257F390042E05C?OpenDocument">
<a href="/web/pressmeddelande.nsf/sidor/8410EC0AA8C20BD5C1257F39004301F0?OpenDocument">
<a href="/web/pressmeddelande.nsf/sidor/EA2119AB45CE9648C1257F1E002D44E0?OpenDocument">
<a href="/web/pressmeddelande.nsf/sidor/5992F8120E2655F0C1257F22002CCD89?OpenDocument">
<a href="http://www.bth.se/web/pressmeddelande.nsf/AllaDok?OpenView">
<a href="/web/utmarkelser.nsf/sidor/4CC79392B8F8D211C1257D88003709CB?OpenDocument">
<a href="/web/utmarkelser.nsf/sidor/C5C8D8F87E6EC6DCC1257D39004CE1D0?OpenDocument">
<a href="/web/utmarkelser.nsf/sidor/6121811FF55C891AC1257D8800366D5C?OpenDocument">
<a href="/web/utmarkelser.nsf/sidor/936596A2A8C92FBEC1257D6300322897?OpenDocument">

Thanks,
R. Singh

1 Like

Any href:

perl -nle 'while(/(href="[^"]*")/g){print $1}' curl_href
[...]
href="#"
href="#"
href="#"
href="#"
href="#"
href="/web/pressmeddelande.nsf/sidor/8422F16DAC76024FC1257F390042E05C?OpenDocument"
href="/web/pressmeddelande.nsf/sidor/8410EC0AA8C20BD5C1257F39004301F0?OpenDocument"
href="/web/pressmeddelande.nsf/sidor/EA2119AB45CE9648C1257F1E002D44E0?OpenDocument"
href="/web/pressmeddelande.nsf/sidor/5992F8120E2655F0C1257F22002CCD89?OpenDocument"
href="http://www.bth.se/web/pressmeddelande.nsf/AllaDok?OpenView"
href="/web/utmarkelser.nsf/sidor/4CC79392B8F8D211C1257D88003709CB?OpenDocument"
href="/web/utmarkelser.nsf/sidor/C5C8D8F87E6EC6DCC1257D39004CE1D0?OpenDocument"
href="/web/utmarkelser.nsf/sidor/6121811FF55C891AC1257D8800366D5C?OpenDocument"
href="/web/utmarkelser.nsf/sidor/936596A2A8C92FBEC1257D6300322897?OpenDocument"
href="http://www.bth.se/web/utmarkelser.nsf/AllaDok?OpenView"
href="http://www.bth.se/for/address-book.nsf/addressbook.xsp?lang=sv"
href="http://www.bth.se/web/ombth.nsf/sidor/hitta-till-bth"
[...]

Hrefs starting with / or http:

perl -nle 'while(/href=("(?:http|\/)[^"]*")/g){print $1}' curl_href
[...]
"http://greencharge.se/?p=5691"
"/web/pressmeddelande.nsf/sidor/8422F16DAC76024FC1257F390042E05C?OpenDocument"
"/web/pressmeddelande.nsf/sidor/8410EC0AA8C20BD5C1257F39004301F0?OpenDocument"
"/web/pressmeddelande.nsf/sidor/EA2119AB45CE9648C1257F1E002D44E0?OpenDocument"
"/web/pressmeddelande.nsf/sidor/5992F8120E2655F0C1257F22002CCD89?OpenDocument"
"http://www.bth.se/web/pressmeddelande.nsf/AllaDok?OpenView"
"/web/utmarkelser.nsf/sidor/4CC79392B8F8D211C1257D88003709CB?OpenDocument"
"/web/utmarkelser.nsf/sidor/C5C8D8F87E6EC6DCC1257D39004CE1D0?OpenDocument"
"/web/utmarkelser.nsf/sidor/6121811FF55C891AC1257D8800366D5C?OpenDocument"
"/web/utmarkelser.nsf/sidor/936596A2A8C92FBEC1257D6300322897?OpenDocument"
"http://www.bth.se/web/utmarkelser.nsf/AllaDok?OpenView"
"http://www.bth.se/for/address-book.nsf/addressbook.xsp?lang=sv"
"http://www.bth.se/web/ombth.nsf/sidor/hitta-till-bth"
[...]

Only domain names:

perl -nle 'while(m|href="(http://[^/"]*)|g){print $1}' curl_href
[...]
http://www.bth.se
http://www.bth.se
http://www.bth.se
http://www.bth.se
http://edu.bth.se
http://www.bth.se
http://edu.bth.se
http://www.bth.se
http://edu.bth.se
http://www.youtube.com
http://www.bth.se
http://www.bth.se
http://www.bth.se
http://www.bth.se
http://www.bth.se
http://singingsingapore.wordpress.com
http://singingsingapore.wordpress.com
[...]

Unique domain names:

perl -nle 'while(m|href="(http://[^/"]*)|g){$sites{$1}++}END{for(keys %sites){print $_}}' curl_href
http://twitter.com
http://www.bth.se
http://greencharge.se
http://www.flickr.com
http://singingsingapore.wordpress.com
http://edu.bth.se
http://careergate.bth.se
http://se.linkedin.com
http://www.youtube.com
http://www.facebook.com
1 Like

'

That works good.

I solved it using this code:

grep -o '<a href="[a-z]\+[^>"]*' | sed -ne 's/^<a href="\(.*\)/\1/p' 

---------- Post updated at 04:14 PM ---------- Previous update was at 04:12 PM ----------

Unfortunately I dont know perl, yet, but thanks for your input anyways, much appreciated :slight_smile:

If you can use lynx then

lynx -dump URL

produces a good text output of a page. Every link on the page goes to References section in the end of the output.