awk script to search an html file and output links

kyris · May 9, 2008, 12:15am

hello. i want to make an awk script to search an html file and output all the links (e.g .html, .htm, .jpg, .doc, .pdf, etc..) inside it. also, i want the links that will be output to be split into 3 groups (separated by an empty line), the first group with links to other webpages (.html .htm etc), the second group with links to images (.jpg .jpeg) and the third group with links to .pdf .doc or other downloadable files. and next to each link i want to output how many times each one occurs in the html file.

please, any help would be greatly appreciated!!

kyris

Franklin52 · May 9, 2008, 5:28am

To make a script with awk you must have at least some knowledge of awk, do you?
What have you done to attempt to solve this problem yourself?
Post your sample script, and we'll see how we can assist.

Regards

amatuer_lee_3 · May 9, 2008, 7:07am

I too want some help with the AWK command. I want to output data in a similar way.

I want to have a search using the month and year to output numerical data.

The knowledge i have so far on the matter is:

$awk '{print $1, $2, $3, $4 }' hits

the result of this is:

123.45.6.7 NOV 2006 1805GMT

now for me this code will print out the two colums i want but i need to figure out a way to create the search criteria for these columns.

the search criteria needs to be the month and date and needs to be entered either numerically or with text.

any ideas?

Franklin52 · May 9, 2008, 7:17am

amatuer_lee_3,

Please don't hijack another one's thread but start your own thread if you have a question.

Regards

kyris · May 9, 2008, 12:04pm

no i haven't done much. i only know a few things... actually i have just thought about declaring a FS or a RS with something like FS="< >" and then search within the fields for /http/ or something and then for /html/. But i don't know a lot of things so i just want to do the basics..

thanks for any help..!

matrixmadhan · May 10, 2008, 5:16am

you could happily do that with awk, sed, perl

but its difficult to maintain and make it scalable

rather i would suggest existing CPAN modules like

HTML::LinkExtractor
HTML::LinkExtractor - Extract links from an HTML document - search.cpan.org

Some gurus had already written those we don't have to reinvent the wheel ( am not lazy )

kyris · May 10, 2008, 10:36am

the thing is i need it specifically in awk not perl..!

risby · May 10, 2008, 11:32am

Here's something to start you off:
awk '/href/ {for (i=1; i<=NF; i++) {if ($i ~ /^href/) {print $i}}}' *.html

Take that, extend it, come back for help when you can show you've made some sort of effort

matrixmadhan · May 11, 2008, 4:25am

Not to offend

Am just showing you a way which is much easier to do, maintain, scalable

If its for something to practice in awk or some constraint to do in awk - then I will shut up.

Any time, using modules that are supported are much easier to get your job done

All the best