Scan for anchor tags in Perl?

sldsand · May 10, 2013, 12:32pm

Hello all,

I have some .html files on my hard drive and trying to figure out (if it's possible) how to scan the files in the directory for <a> anchor tags to find linked files. I know how to bring the files in with Perl, but as text. Wondering if there's a way to probe the file for information.

Thank you

spacebar · May 11, 2013, 5:46pm

I believe this will find all the <a> tags for you:

Match the characters "<a"
Match any single character that is not a line break character
Quantifiers must be preceded by a token that can be repeated �*�
Match the character ">" literally
Match any single character(Between zero and unlimited times, as many times as possible, giving back as needed (greedy))
Match the characters "</a>" literally

dot matches newlines

if ( $line =~ m!<a.(?s)*>.*</a>! ) {
    # perform code on match
}

sldsand · May 11, 2013, 6:03pm

Thanks, my problem was bigger than that. I didn't know how to even scan the muck of information that was inside the .html file. But a day later, I thought why not try the unix "cat" command and boom, it gave me the muck of what I was after. Then I pawed through on how to do it in perl and then finally the stuff you're talking about.
I went with

($line =~ /.*<a.*http.*$/)

To get the external links, but I see your code has some value too, which I will be taking a look at. Thanks for your help!