Pdf to text

cmccabe · September 12, 2014, 4:13pm

Is there a way using the pdf to text utility to convert all the pdf in a given directory?

So instead of one at a time:

pdftotext pdftotext hp-manual.pdf hp-manual.txt

a directory of 50 pdf files would be converted:

 pdftotext /home/dnascopev/Desktop/PDF.pdf /home/dnascopev/Desktop/PDF.txt

Thank you.

blackrageous · September 12, 2014, 4:22pm

Take a look at the portable bit map utilities, like pdftopbm and gocr, the tool that converts to text. You could convert to pbm, jpg, etc...and then use gocr to get text.

I am not sure if gocr works on pdf files, but if not you can use pdftopdm.

cmccabe · September 12, 2014, 4:26pm

The pdftotext works great for converting pdf files to text, but only seems to do one at a time. Can the command be modified for a directory? Thanks.

Don_Cragun · September 12, 2014, 4:54pm

If you have the source for pdftotext , you can change it to do anything you want. If you don't have source, or if you want a simple solution, write a shell script that calls pdftotext for each PDF file in your current directory:

for file in *.pdf
do      pdftotext "$file" "${file%.pdf}".txt
done

cmccabe · September 12, 2014, 5:00pm

 for file in *.pdf
do      pdftotext "$file" "${file%.pdf}".txt
done

So, if the directory is /home/dnascopev/Desktop/PDF are you saying that can put in the shell scripr or each pdf name ans where? Thank you :).

Corona688 · September 12, 2014, 5:30pm

You would use cd to change directory.

Also I'd use [pP][dD][fF] in case any of them were wonky case.

cd /home/dnascopev/Desktop/PDF
for file in *.[pP][dD][fF]
do
...
done

Don_Cragun · September 12, 2014, 5:47pm

Sorry. By posting in the Shell Programming and Scripting forum, I assumed that you knew how to write and run a shell script.

Making more wild assumptions:

you are using a UNIX or Linux system,
you have more than one directory that contains files you want to process,
you have a bin directory in your home directory, and
$HOME/bin is in your command search path:

then create a file named pdftotextdir in $HOME/bin containing:

#!/bin/ksh
if [ $# -eq 1 ]
then    cd "$1"
else    printf 'Usage: %s directory\n' "${0##*/}" >&2
        exit 1
fi
for file in *.[Pp][Dd][Ff]
do      pdftotext "$file" "${file%.[Pp][Dd][Ff]}".txt
done

(If you don't have a Korn shell, you can change /bin/ksh to /bin/bash or the pathname of any shell that understands POSIX required shell variable expansions.)

Then issue the command:

chmod +x $HOME/bin/pdftotextdir

Then you can run your new utility to use pdftotext on every PDF file in whatever directory you want to process by issuing the command:

pdftotextdir directory

which for you latest request would be:

pdftotextdir /home/dnascopev/Desktop/PDF