Investigating web pages in awk

adpe · March 26, 2009, 1:30pm

hello. i want to make an awk script to search an html file and output all the links (e.g .html, .htm, .jpg, .doc, .pdf, etc..) inside it. also, i want the links that will be output to be split into 3 groups (separated by an empty line), the first group with links to other webpages (.html .htm etc), the second group with links to images (.jpg .jpeg) and the third group with links to .pdf .doc or other downloadable files. and next to each link i want to output how many times each one occurs in the html file.

(i am only doing the links first, then once I have crakced this i will be able to do the other formats easily)

So I have currently got...

BEGIN{FS = " "}
{for (i=1; i<=NF;i++){if ($i ~ /^href/) {print $i}}
}
#
END{}

which prints out the word e.g href="index.html" > , I would like this to just print out...index.html and the number of times it appears in the webpage.

Any help/hints on how i could achieve the top paragraph would be a great help.

otheus · April 15, 2009, 6:04am

You're on the right track. If you know about the FS variable, do you know about the RS?

 awk 'BEGIN { RS="[<>]";FS="\"" } /^a href/ { print $2 }'  | sort | uniq -c

Will give you a list of each href in the file, and the number of times it appears in the file.

PS: Older awks won't handle the \" code above. If on Sun, you'll need mawk, nawk, or gawk.