awk script to search an html file and output links

hello. i want to make an awk script to search an html file and output all the links (e.g .html, .htm, .jpg, .doc, .pdf, etc..) inside it. also, i want the links that will be output to be split into 3 groups (separated by an empty line), the first group with links to other webpages (.html .htm etc), the second group with links to images (.jpg .jpeg) and the third group with links to .pdf .doc or other downloadable files. and next to each link i want to output how many times each one occurs in the html file.

please, any help would be greatly appreciated!!

kyris

To make a script with awk you must have at least some knowledge of awk, do you?
What have you done to attempt to solve this problem yourself?
Post your sample script, and we'll see how we can assist.

Regards

I too want some help with the AWK command. I want to output data in a similar way.

I want to have a search using the month and year to output numerical data.

The knowledge i have so far on the matter is:

$awk '{print $1, $2, $3, $4 }' hits

the result of this is:

123.45.6.7 NOV 2006 1805GMT

now for me this code will print out the two colums i want but i need to figure out a way to create the search criteria for these columns.

the search criteria needs to be the month and date and needs to be entered either numerically or with text.

any ideas?

amatuer_lee_3,

Please don't hijack another one's thread but start your own thread if you have a question.

Regards

no i haven't done much. i only know a few things... actually i have just thought about declaring a FS or a RS with something like FS="< >" and then search within the fields for /http/ or something and then for /html/. But i don't know a lot of things so i just want to do the basics..

thanks for any help..!

you could happily do that with awk, sed, perl

but its difficult to maintain and make it scalable

rather i would suggest existing CPAN modules like

HTML::LinkExtractor
HTML::LinkExtractor - Extract links from an HTML document - search.cpan.org

Some gurus had already written those :wink: we don't have to reinvent the wheel ( am not lazy :wink: )

the thing is i need it specifically in awk not perl..!:confused:

Here's something to start you off:
awk '/href/ {for (i=1; i<=NF; i++) {if ($i ~ /^href/) {print $i}}}' *.html

Take that, extend it, come back for help when you can show you've made some sort of effort

Not to offend :slight_smile:

Am just showing you a way which is much easier to do, maintain, scalable

If its for something to practice in awk or some constraint to do in awk - then I will shut up.

Any time, using modules that are supported are much easier to get your job done

All the best :slight_smile: