Extracting urls from curl output

jozo95 · January 16, 2016, 12:03pm

Hello.

I use curl to fetch a website, then, I want to extract the URLs from this curls output.

I tried both sed and grep, but couldnt figure it out.

Ive tried :

sed -n 's/href="\([^"]*\).*/\1/p' results.txt

and grep -o

grep -o '<a href="http://[a-z]*.[a-z]*.[a-z]*/[a-z]*">' results.txt

.

What pattern shall I use and whats wrong with mine ?

EDIT:

Added some of the data I use

EDIT 2:
Removed the data sample, because it ruines the thread width, but just curl whatever website, and use that output as data.

RavinderSingh13 · January 16, 2016, 1:09pm

Hello jozo95,

Could you please try following and let me know if this helps you.

awk '{match($0,/<a href=\"http.*><img/);A=substr($0,RSTART,RLENGTH-4);if(A){print A;A=""}}'  Input_file

Output will be as follows.

<a href="http://www.bth.se/web/utbildning.nsf/sidor/program?OpenDocument&expand=int">
<a href="https://www.antagning.se/se/triggerlogin?triggerloginurl=/se/mypages">
<a href="http://edu.bth.se/utbildning/utb_sok_resultat.asp?lang=sv&KtTermin=20161&PtStartTermin=20161&vy=hitta">
<a href="https://www.hogskoleprov.nu ">
<a href="http://www.bth.se/web/nyheter.nsf/sidor/8F5E44896F091A3AC1257E9F0045AC7D?OpenDocument">
<a href="http://edu.bth.se/utbildning/utb_sok_resultat.asp?lang=sv&KtTermin=20152&PtStartTermin=20152&vy=hitta&sortering=amne&sortering=installd&grupperingar=1">
<a href="http://www.bth.se/info/ophus.nsf/sidor/oppet-hus-pa-bth">
<a href="http://edu.bth.se/utbildning/utb_sok_resultat.asp?KtTermin=inne&PtStartTermin=inne&KtTyp=SOMM&lang=sv">

Thanks,
R. Singh

jozo95 · January 16, 2016, 1:21pm

ravindersingh13:

Hello jozo95,

Could you please try following and let me know if this helps you.

awk '{match($0,/<a href=\"http.*><img/);A=substr($0,RSTART,RLENGTH-4);if(A){print A;A=""}}'  Input_file

Output will be as follows.

<a href="http://www.bth.se/web/utbildning.nsf/sidor/program?OpenDocument&expand=int">
<a href="https://www.antagning.se/se/triggerlogin?triggerloginurl=/se/mypages">
<a href="http://edu.bth.se/utbildning/utb_sok_resultat.asp?lang=sv&KtTermin=20161&PtStartTermin=20161&vy=hitta">
<a href="https://www.hogskoleprov.nu ">
<a href="http://www.bth.se/web/nyheter.nsf/sidor/8F5E44896F091A3AC1257E9F0045AC7D?OpenDocument">
<a href="http://edu.bth.se/utbildning/utb_sok_resultat.asp?lang=sv&KtTermin=20152&PtStartTermin=20152&vy=hitta&sortering=amne&sortering=installd&grupperingar=1">
<a href="http://www.bth.se/info/ophus.nsf/sidor/oppet-hus-pa-bth">
<a href="http://edu.bth.se/utbildning/utb_sok_resultat.asp?KtTermin=inne&PtStartTermin=inne&KtTyp=SOMM&lang=sv">

Thanks,
R. Singh

Doesnt work

I get output as this along the way :

g">Studiev�gledning</a></li><li><a href="http://www.bth.se/jobb">Lediga tj�nster</a></li></ul></div><div class="footer-section set"><h5><br></h5><ul class="footer-list"><li><a href="http://www.bth.se/web/ombth.nsf/sidor/organisation">Organisation</a></li><li><a href="http://www.bth.se/bib">Bibliotek</a></li><li><a href="http://careergate.bth.se/">BTH Career Gate</a></li><li><a href="http://www.bth.se/for/Sakerhet.nsf/sidor/593cb6bf948640dac1257f1f00365b42?OpenDocument">I h�ndelse av kris</a></li><li><a href=""></a></li><li><a href=""></a></li></ul> </div></span></div></div></div><div class="footer-info"><a class="footer-logo" href="http://www.bth.se">

RavinderSingh13 · January 16, 2016, 2:10pm

Hello jozo95,

Sorry I haven't seen links without <img , so only it didn't match it properly.
Could you please try following and let me know if this helps you.

awk -F"[><]" '{for(i=1;i<=NF;i++){if($i ~ /a href=.*\//){print "<" $i ">"}}}'   Input_file

Output will be as follows.

<a href="http://www.bth.se/web/nyheter.nsf/AllaDok?OpenView">
<a href="http://www.bth.se/web/utbildning.nsf/sidor/program?OpenDocument&expand=int">
<a href="https://www.antagning.se/se/triggerlogin?triggerloginurl=/se/mypages">
<a href="http://edu.bth.se/utbildning/utb_sok_resultat.asp?lang=sv&KtTermin=20161&PtStartTermin=20161&vy=hitta">
<a href="https://www.hogskoleprov.nu ">
<a href="http://www.bth.se/web/nyheter.nsf/sidor/8F5E44896F091A3AC1257E9F0045AC7D?OpenDocument">
<a href="http://edu.bth.se/utbildning/utb_sok_resultat.asp?lang=sv&KtTermin=20152&PtStartTermin=20152&vy=hitta&sortering=amne&sortering=installd&grupperingar=1">
<a href="http://www.bth.se/info/ophus.nsf/sidor/oppet-hus-pa-bth">
<a href="http://edu.bth.se/utbildning/utb_sok_resultat.asp?KtTermin=inne&PtStartTermin=inne&KtTyp=SOMM&lang=sv">
<a href="/web/kalendarium.nsf/sidor/52CB572F173DEE64C1257F3400428859?OpenDocument">
<a href="/web/kalendarium.nsf/sidor/2743A2376777BC7BC1257F3400530744?OpenDocument">
<a href="/web/kalendarium.nsf/sidor/0F9CAD034B2DD920C1257F3400533F5A?OpenDocument">
<a href="http://www.bth.se/web/kalendarium.nsf">
<a href="http://www.bth.se/web/kalendarium.nsf">
<a href="http://www.bth.se/web/kalendarium.nsf/AllaDok?OpenView">
<a href="/web/pressmeddelande.nsf/sidor/8422F16DAC76024FC1257F390042E05C?OpenDocument">
<a href="/web/pressmeddelande.nsf/sidor/8410EC0AA8C20BD5C1257F39004301F0?OpenDocument">
<a href="/web/pressmeddelande.nsf/sidor/EA2119AB45CE9648C1257F1E002D44E0?OpenDocument">
<a href="/web/pressmeddelande.nsf/sidor/5992F8120E2655F0C1257F22002CCD89?OpenDocument">
<a href="http://www.bth.se/web/pressmeddelande.nsf/AllaDok?OpenView">
<a href="/web/utmarkelser.nsf/sidor/4CC79392B8F8D211C1257D88003709CB?OpenDocument">
<a href="/web/utmarkelser.nsf/sidor/C5C8D8F87E6EC6DCC1257D39004CE1D0?OpenDocument">
<a href="/web/utmarkelser.nsf/sidor/6121811FF55C891AC1257D8800366D5C?OpenDocument">
<a href="/web/utmarkelser.nsf/sidor/936596A2A8C92FBEC1257D6300322897?OpenDocument">

Thanks,
R. Singh

Aia · January 16, 2016, 2:16pm

Any href:

perl -nle 'while(/(href="[^"]*")/g){print $1}' curl_href

[...]
href="#"
href="#"
href="#"
href="#"
href="#"
href="/web/pressmeddelande.nsf/sidor/8422F16DAC76024FC1257F390042E05C?OpenDocument"
href="/web/pressmeddelande.nsf/sidor/8410EC0AA8C20BD5C1257F39004301F0?OpenDocument"
href="/web/pressmeddelande.nsf/sidor/EA2119AB45CE9648C1257F1E002D44E0?OpenDocument"
href="/web/pressmeddelande.nsf/sidor/5992F8120E2655F0C1257F22002CCD89?OpenDocument"
href="http://www.bth.se/web/pressmeddelande.nsf/AllaDok?OpenView"
href="/web/utmarkelser.nsf/sidor/4CC79392B8F8D211C1257D88003709CB?OpenDocument"
href="/web/utmarkelser.nsf/sidor/C5C8D8F87E6EC6DCC1257D39004CE1D0?OpenDocument"
href="/web/utmarkelser.nsf/sidor/6121811FF55C891AC1257D8800366D5C?OpenDocument"
href="/web/utmarkelser.nsf/sidor/936596A2A8C92FBEC1257D6300322897?OpenDocument"
href="http://www.bth.se/web/utmarkelser.nsf/AllaDok?OpenView"
href="http://www.bth.se/for/address-book.nsf/addressbook.xsp?lang=sv"
href="http://www.bth.se/web/ombth.nsf/sidor/hitta-till-bth"
[...]

Hrefs starting with / or http:

perl -nle 'while(/href=("(?:http|\/)[^"]*")/g){print $1}' curl_href

[...]
"http://greencharge.se/?p=5691"
"/web/pressmeddelande.nsf/sidor/8422F16DAC76024FC1257F390042E05C?OpenDocument"
"/web/pressmeddelande.nsf/sidor/8410EC0AA8C20BD5C1257F39004301F0?OpenDocument"
"/web/pressmeddelande.nsf/sidor/EA2119AB45CE9648C1257F1E002D44E0?OpenDocument"
"/web/pressmeddelande.nsf/sidor/5992F8120E2655F0C1257F22002CCD89?OpenDocument"
"http://www.bth.se/web/pressmeddelande.nsf/AllaDok?OpenView"
"/web/utmarkelser.nsf/sidor/4CC79392B8F8D211C1257D88003709CB?OpenDocument"
"/web/utmarkelser.nsf/sidor/C5C8D8F87E6EC6DCC1257D39004CE1D0?OpenDocument"
"/web/utmarkelser.nsf/sidor/6121811FF55C891AC1257D8800366D5C?OpenDocument"
"/web/utmarkelser.nsf/sidor/936596A2A8C92FBEC1257D6300322897?OpenDocument"
"http://www.bth.se/web/utmarkelser.nsf/AllaDok?OpenView"
"http://www.bth.se/for/address-book.nsf/addressbook.xsp?lang=sv"
"http://www.bth.se/web/ombth.nsf/sidor/hitta-till-bth"
[...]

Only domain names:

perl -nle 'while(m|href="(http://[^/"]*)|g){print $1}' curl_href

[...]
http://www.bth.se
http://www.bth.se
http://www.bth.se
http://www.bth.se
http://edu.bth.se
http://www.bth.se
http://edu.bth.se
http://www.bth.se
http://edu.bth.se
http://www.youtube.com
http://www.bth.se
http://www.bth.se
http://www.bth.se
http://www.bth.se
http://www.bth.se
http://singingsingapore.wordpress.com
http://singingsingapore.wordpress.com
[...]

Unique domain names:

perl -nle 'while(m|href="(http://[^/"]*)|g){$sites{$1}++}END{for(keys %sites){print $_}}' curl_href

http://twitter.com
http://www.bth.se
http://greencharge.se
http://www.flickr.com
http://singingsingapore.wordpress.com
http://edu.bth.se
http://careergate.bth.se
http://se.linkedin.com
http://www.youtube.com
http://www.facebook.com

jozo95 · January 16, 2016, 4:14pm

ravindersingh13:

Hello jozo95,

Sorry I haven't seen links without <img , so only it didn't match it properly.
Could you please try following and let me know if this helps you.
awk -F"[><]" '{for(i=1;i<=NF;i++){if($i ~ /a href=.*\//){print "<" $i ">"}}}'   Input_file
Thanks,
R. Singh

'

That works good.

I solved it using this code:

grep -o '<a href="[a-z]\+[^>"]*' | sed -ne 's/^<a href="\(.*\)/\1/p'

---------- Post updated at 04:14 PM ---------- Previous update was at 04:12 PM ----------

Unfortunately I dont know perl, yet, but thanks for your input anyways, much appreciated

yazu · January 16, 2016, 8:59pm

If you can use lynx then

lynx -dump URL

produces a good text output of a page. Every link on the page goes to References section in the end of the output.