Download pdf's using wget convert to txt

cmccabe · August 13, 2014, 10:17am

wget -i genedx.txt

The code above will download multiple pdf files from a site, but how can i download and convert these to .txt?

I have attached the master list (genedx.txt - which contains the url and file names)

as well as the two PDF's that are downloaded. I am trying to have those two files download as text files. Thank you.

achenle · August 13, 2014, 10:45am

pdftotext

cmccabe · August 13, 2014, 10:48am

is that a seperate command or can it be used with the wget command? Thanks.

Corona688 · August 13, 2014, 2:08pm

It is a separate command, which -- like any other separate command -- you can use with wget, either by piping the output or by feeding the resulting file into it once wget is done.

cmccabe · August 14, 2014, 10:11am

So would the command be:

 wget -i genedx.txt | info_sheet_ube.pdf Info_Sheet_XomeDx.pdf

and where do I download access pdftotext? Thanks.

Corona688 · August 14, 2014, 11:18am

No, pipes do not work that way.

What you would actually do depends on the contents of genedx.txt, and what you want to do with it.

Here is the second google hit.

cwchen123 · August 17, 2014, 7:48am

After installing PDFMiner, do batch conversion with a for loop. Nothing to do with pipe here.

$ for f in `ls *.pdf`; do pdf2txt.py $f > ${f%.pdf}.txt; done

cmccabe · September 9, 2014, 1:35pm

So just:

Directory containing the 4 pdf files

 cd "C:\Users\cmccabe\Desktop\PDF"

followed by:

 for f in `ls *.pdf`; do pdf2txt.py $f > ${f%.pdf}.txt; done

Thanks.