Perl script to copy contents of a web page

Hi All,

Sorry to ask this question and i am not sure whether it is possible. please reply to my question. Thanks in advance.

I need a perl script ( or any linux compatible scripts ) to copy the graphical contents of the webpage to a word pad.

Say for example, i have a documentation site with 20 pages, i need to copy the contents of these 20 webpage document to single word file.

My requirement can be done manually by ctrl+A, ctrl+C ( on webpage P ) and ctrl+V ( on word document ). For 20 pages manual process can be done, think for 50+ pages or more.

Thanks in advance.....:):):):):b:

Did you check for lynx, curl ?

You may take a look at the curl command.
Simple usage :

curl http://www.somesite.com/

Hi thanhdat,

I am not looking for the source code of the web page. i need the graphical contents of the site. Sorry if i am wrong.

Question: why do you need the visual content of a page (with 50+ printed pages) in a word processor document? If it's for offline reading, why not convert it to PDF? If it's for archiving, why not save the HTML source?

Answer: You are right Pludi. Redhat.com has some (70+) webpages which is purely documentation. These webpages are needed to me.

So, wats my question is, i need these webpages as a word document ( or pdf id fine) for printing. Instead of printing it by page by page, i need a script to copy these ( 70+ ) pages locally to my machine and print it.

This is the URL which i am try do so.

"Red Hat Customer Portal - Access to 24x7 support and knowledge"

Thanks in advance.

Erm, why? Open the page in any web browser, select File->Print, and off you go.
Or if you cannot print from the machine accessing the Internet but have CUPS available, there's CUPS-PDF

HI Pludi,

So you wanna me to print each page manually??? No other way to have a script to make a local copy of entire pages as a single document???

Sorry, didn't see that RH went with putting each section on a page of it's own. But by clicking around I found a link here that points to a PDF of it.

Ok, Tats fine pludi. Consider some sites which doesn't provide PDF's so wat shall we do with those scenarios???

I can't think of an easy way to do this, since HTML doesn't have any definite way of telling when the actual content of a file starts and where it ends. The RH pages, for example, include multiple links at the top and the bottom, which might be easy to filter out here, but this might break on another page.

The hard way would be to use a parser, deparse the HTML, filter out anything above or below certain elements (which have to be unique), and write it out again. But this, again, will fail as soon as the beginning and the end aren't unique, or if the page itself isn't HTML but XML with XSLT.

In short: if you encounter such a site, it's always possible that there's a printable version on the site itself, or that the author might have one.

1 Like