how to find watermark in a pdf

rahulkav · October 23, 2008, 9:25am

Hi All,

I have a few pdf files some of which have a watermark on it and my task is to find which all invoices have watermark without actually printing them.
Is there any way we can do this in Unix. Strings is not helping.
any idea how would I read binary file and grep for watermark. The watermark has a text "XYZ"

Regards...

jim_mcnamara · October 23, 2008, 9:43am

pdf files can be in compressed format, so this may not help at all, and depending on the pdf engine the X Y and Z can be separate:

for file in *.pdf
do
   grep -l '(XYZ)Tj$' $file
done

You could also try:

for file in *.pdf
do
   if [[ `grep -q '(X)Tj$' $file` ]] ; then
       if [ `grep -q '(Y)Tj$' $file` ]] ; then
          if [[ `grep -q '(Z)Tj$' $file` ]] ; then
               echo "$file"
          fi
       fi
   fi
done

This last one is an EXTREMELY inefficient method.... but is usually the way watermarks are generated. One char at a time.

rahulkav · October 23, 2008, 9:56am

thanks for your answer. Could you explain how does '(X)Tj$' work. pdf is a binary file isn't it. on my version it is giving following error:
grep: illegal option -- q
Usage: grep -hblcnsviw pattern file . . .

it is Sun 5.10

Regards,.
Rahul

jim_mcnamara · October 23, 2008, 10:11am

figures.. Solaris 5.10 isn't close to POSIX...
grep -q is the 'silent' way to look for a pattern it suppreses the display, but returns a status value ($?) to indicate whether it succeeded or not. See if one of your options does that.

( stuff in here )Tj is the way a postscript ()show command is written to uncompressed pdf format files. This displays "stuff in here".

'(X)Tj$' is the pattern for the command to print a single character 'X'. Again in uncompressed format.

If you literally cannot read your pdf because it is full of really weird characters, then it is compressed and this method will not work.

vimes · October 23, 2008, 10:34am

Instead of -q on Solaris you can just do:

grep value file >/dev/null

Just use the exit status from grep in your if. (0 = match found, 1 = no match).

So in your code it'd be like:

   if [[ `grep '(X)Tj$' $file >/dev/null` ]] ; then

And so on...

jim_mcnamara · October 23, 2008, 10:41am

The real rationale behind the -q option is to be more efficient - it stops searching on the first hit. It's there to say 'Hey this pattern is/is not in the file' with the least overhead.

fpmurphy · October 23, 2008, 1:37pm

Most of the text to pdf convertors ie. easyPDF, QuickBooks PDF convertor, etc. use stream objects and do not store the "text" within the file in a format which you can grep for.

jim_mcnamara · October 23, 2008, 1:47pm

FWIW - Metavante, Extreme, and other high end products that produce pdf, ps, etc. let you control whether the text is compressed, stream, pdf v1.1, pdf 1.2,m etc. So that you can do text processing if you want.

We routinely program pdf and ps templates for fully automating customer FAX and email requests out of a CIS system. Works great.