"links -dump" output format issue

Hi All,

I tried searching a lot about this but to no avail. I have a HTML file. I used

links -dump file_page.html > text_html.txt

What the above command gave me was a filtered text from the HTML file with tags removed. Now, the the output from the above command looked something like this:

This is a [1]test HTML [2]file.
References

         Visible Links
         1. file:///feed/rss.cgi?ChanKey=PubMedNews
         2. file:///corehtml/query/static/pubmedsearch.xml

The above output means that "test" is hyperlinked to page file:///feed/rss.cgi?ChanKey=PubMedNews and "file" is hyperlinked to file:///corehtml/query/static/pubmedsearch.xml

My question is: Can "links" give me output where [1] and [2] are removed and the exact HTML links are embedded? I mean I wish my output could be like:

This is a [file:///feed/rss.cgi?ChanKey=PubMedNews]test HTML [file:///corehtml/query/static/pubmedsearch.xml]file.

Or is there any other way of accomplishing my task?

I am using Linux with Bash.

The following should work, however it will break the std width formatting of links -dump

links -dump <Your HTML File> |perl -e '
$on_page=1;
while(<STDIN>){
   $on_page=0 if $on_page && /^References$/;
   push @output,$_ if $on_page;
   $links{$1}=$2 if $in_links && /^\s+(\d+)\.\s(.+)$/;
   $in_links=1 if (!$on_page && /^\s+Visible links$/)
}
for (@output){
   s/\[(\d+)\]/[$links{$1}]/g;
   print
}'
1 Like