jozo95
1
Hello.
I use curl to fetch a website, then, I want to extract the URLs from this curls output.
I tried both sed and grep, but couldnt figure it out.
Ive tried :
sed -n 's/href="\([^"]*\).*/\1/p' results.txt
and grep -o
grep -o '<a href="http://[a-z]*.[a-z]*.[a-z]*/[a-z]*">' results.txt
.
What pattern shall I use and whats wrong with mine ?
EDIT:
Added some of the data I use
EDIT 2:
Removed the data sample, because it ruines the thread width, but just curl whatever website, and use that output as data.
Hello jozo95,
Could you please try following and let me know if this helps you.
awk '{match($0,/<a href=\"http.*><img/);A=substr($0,RSTART,RLENGTH-4);if(A){print A;A=""}}' Input_file
Output will be as follows.
<a href="http://www.bth.se/web/utbildning.nsf/sidor/program?OpenDocument&expand=int">
<a href="https://www.antagning.se/se/triggerlogin?triggerloginurl=/se/mypages">
<a href="http://edu.bth.se/utbildning/utb_sok_resultat.asp?lang=sv&KtTermin=20161&PtStartTermin=20161&vy=hitta">
<a href="https://www.hogskoleprov.nu ">
<a href="http://www.bth.se/web/nyheter.nsf/sidor/8F5E44896F091A3AC1257E9F0045AC7D?OpenDocument">
<a href="http://edu.bth.se/utbildning/utb_sok_resultat.asp?lang=sv&KtTermin=20152&PtStartTermin=20152&vy=hitta&sortering=amne&sortering=installd&grupperingar=1">
<a href="http://www.bth.se/info/ophus.nsf/sidor/oppet-hus-pa-bth">
<a href="http://edu.bth.se/utbildning/utb_sok_resultat.asp?KtTermin=inne&PtStartTermin=inne&KtTyp=SOMM&lang=sv">
Thanks,
R. Singh
jozo95
3
Doesnt work
I get output as this along the way :
g">Studiev�gledning</a></li><li><a href="http://www.bth.se/jobb">Lediga tj�nster</a></li></ul></div><div class="footer-section set"><h5><br></h5><ul class="footer-list"><li><a href="http://www.bth.se/web/ombth.nsf/sidor/organisation">Organisation</a></li><li><a href="http://www.bth.se/bib">Bibliotek</a></li><li><a href="http://careergate.bth.se/">BTH Career Gate</a></li><li><a href="http://www.bth.se/for/Sakerhet.nsf/sidor/593cb6bf948640dac1257f1f00365b42?OpenDocument">I h�ndelse av kris</a></li><li><a href=""></a></li><li><a href=""></a></li></ul> </div></span></div></div></div><div class="footer-info"><a class="footer-logo" href="http://www.bth.se">
Hello jozo95,
Sorry I haven't seen links without <img
, so only it didn't match it properly.
Could you please try following and let me know if this helps you.
awk -F"[><]" '{for(i=1;i<=NF;i++){if($i ~ /a href=.*\//){print "<" $i ">"}}}' Input_file
Output will be as follows.
<a href="http://www.bth.se/web/nyheter.nsf/AllaDok?OpenView">
<a href="http://www.bth.se/web/utbildning.nsf/sidor/program?OpenDocument&expand=int">
<a href="https://www.antagning.se/se/triggerlogin?triggerloginurl=/se/mypages">
<a href="http://edu.bth.se/utbildning/utb_sok_resultat.asp?lang=sv&KtTermin=20161&PtStartTermin=20161&vy=hitta">
<a href="https://www.hogskoleprov.nu ">
<a href="http://www.bth.se/web/nyheter.nsf/sidor/8F5E44896F091A3AC1257E9F0045AC7D?OpenDocument">
<a href="http://edu.bth.se/utbildning/utb_sok_resultat.asp?lang=sv&KtTermin=20152&PtStartTermin=20152&vy=hitta&sortering=amne&sortering=installd&grupperingar=1">
<a href="http://www.bth.se/info/ophus.nsf/sidor/oppet-hus-pa-bth">
<a href="http://edu.bth.se/utbildning/utb_sok_resultat.asp?KtTermin=inne&PtStartTermin=inne&KtTyp=SOMM&lang=sv">
<a href="/web/kalendarium.nsf/sidor/52CB572F173DEE64C1257F3400428859?OpenDocument">
<a href="/web/kalendarium.nsf/sidor/2743A2376777BC7BC1257F3400530744?OpenDocument">
<a href="/web/kalendarium.nsf/sidor/0F9CAD034B2DD920C1257F3400533F5A?OpenDocument">
<a href="http://www.bth.se/web/kalendarium.nsf">
<a href="http://www.bth.se/web/kalendarium.nsf">
<a href="http://www.bth.se/web/kalendarium.nsf/AllaDok?OpenView">
<a href="/web/pressmeddelande.nsf/sidor/8422F16DAC76024FC1257F390042E05C?OpenDocument">
<a href="/web/pressmeddelande.nsf/sidor/8410EC0AA8C20BD5C1257F39004301F0?OpenDocument">
<a href="/web/pressmeddelande.nsf/sidor/EA2119AB45CE9648C1257F1E002D44E0?OpenDocument">
<a href="/web/pressmeddelande.nsf/sidor/5992F8120E2655F0C1257F22002CCD89?OpenDocument">
<a href="http://www.bth.se/web/pressmeddelande.nsf/AllaDok?OpenView">
<a href="/web/utmarkelser.nsf/sidor/4CC79392B8F8D211C1257D88003709CB?OpenDocument">
<a href="/web/utmarkelser.nsf/sidor/C5C8D8F87E6EC6DCC1257D39004CE1D0?OpenDocument">
<a href="/web/utmarkelser.nsf/sidor/6121811FF55C891AC1257D8800366D5C?OpenDocument">
<a href="/web/utmarkelser.nsf/sidor/936596A2A8C92FBEC1257D6300322897?OpenDocument">
Thanks,
R. Singh
1 Like
Aia
5
Any href:
perl -nle 'while(/(href="[^"]*")/g){print $1}' curl_href
[...]
href="#"
href="#"
href="#"
href="#"
href="#"
href="/web/pressmeddelande.nsf/sidor/8422F16DAC76024FC1257F390042E05C?OpenDocument"
href="/web/pressmeddelande.nsf/sidor/8410EC0AA8C20BD5C1257F39004301F0?OpenDocument"
href="/web/pressmeddelande.nsf/sidor/EA2119AB45CE9648C1257F1E002D44E0?OpenDocument"
href="/web/pressmeddelande.nsf/sidor/5992F8120E2655F0C1257F22002CCD89?OpenDocument"
href="http://www.bth.se/web/pressmeddelande.nsf/AllaDok?OpenView"
href="/web/utmarkelser.nsf/sidor/4CC79392B8F8D211C1257D88003709CB?OpenDocument"
href="/web/utmarkelser.nsf/sidor/C5C8D8F87E6EC6DCC1257D39004CE1D0?OpenDocument"
href="/web/utmarkelser.nsf/sidor/6121811FF55C891AC1257D8800366D5C?OpenDocument"
href="/web/utmarkelser.nsf/sidor/936596A2A8C92FBEC1257D6300322897?OpenDocument"
href="http://www.bth.se/web/utmarkelser.nsf/AllaDok?OpenView"
href="http://www.bth.se/for/address-book.nsf/addressbook.xsp?lang=sv"
href="http://www.bth.se/web/ombth.nsf/sidor/hitta-till-bth"
[...]
Hrefs starting with / or http:
perl -nle 'while(/href=("(?:http|\/)[^"]*")/g){print $1}' curl_href
[...]
"http://greencharge.se/?p=5691"
"/web/pressmeddelande.nsf/sidor/8422F16DAC76024FC1257F390042E05C?OpenDocument"
"/web/pressmeddelande.nsf/sidor/8410EC0AA8C20BD5C1257F39004301F0?OpenDocument"
"/web/pressmeddelande.nsf/sidor/EA2119AB45CE9648C1257F1E002D44E0?OpenDocument"
"/web/pressmeddelande.nsf/sidor/5992F8120E2655F0C1257F22002CCD89?OpenDocument"
"http://www.bth.se/web/pressmeddelande.nsf/AllaDok?OpenView"
"/web/utmarkelser.nsf/sidor/4CC79392B8F8D211C1257D88003709CB?OpenDocument"
"/web/utmarkelser.nsf/sidor/C5C8D8F87E6EC6DCC1257D39004CE1D0?OpenDocument"
"/web/utmarkelser.nsf/sidor/6121811FF55C891AC1257D8800366D5C?OpenDocument"
"/web/utmarkelser.nsf/sidor/936596A2A8C92FBEC1257D6300322897?OpenDocument"
"http://www.bth.se/web/utmarkelser.nsf/AllaDok?OpenView"
"http://www.bth.se/for/address-book.nsf/addressbook.xsp?lang=sv"
"http://www.bth.se/web/ombth.nsf/sidor/hitta-till-bth"
[...]
Only domain names:
perl -nle 'while(m|href="(http://[^/"]*)|g){print $1}' curl_href
[...]
http://www.bth.se
http://www.bth.se
http://www.bth.se
http://www.bth.se
http://edu.bth.se
http://www.bth.se
http://edu.bth.se
http://www.bth.se
http://edu.bth.se
http://www.youtube.com
http://www.bth.se
http://www.bth.se
http://www.bth.se
http://www.bth.se
http://www.bth.se
http://singingsingapore.wordpress.com
http://singingsingapore.wordpress.com
[...]
Unique domain names:
perl -nle 'while(m|href="(http://[^/"]*)|g){$sites{$1}++}END{for(keys %sites){print $_}}' curl_href
http://twitter.com
http://www.bth.se
http://greencharge.se
http://www.flickr.com
http://singingsingapore.wordpress.com
http://edu.bth.se
http://careergate.bth.se
http://se.linkedin.com
http://www.youtube.com
http://www.facebook.com
1 Like
jozo95
6
'
That works good.
I solved it using this code:
grep -o '<a href="[a-z]\+[^>"]*' | sed -ne 's/^<a href="\(.*\)/\1/p'
---------- Post updated at 04:14 PM ---------- Previous update was at 04:12 PM ----------
Unfortunately I dont know perl, yet, but thanks for your input anyways, much appreciated
yazu
7
If you can use lynx then
lynx -dump URL
produces a good text output of a page. Every link on the page goes to References section in the end of the output.