I have a few pdf files some of which have a watermark on it and my task is to find which all invoices have watermark without actually printing them.
Is there any way we can do this in Unix. Strings is not helping.
any idea how would I read binary file and grep for watermark. The watermark has a text "XYZ"
pdf files can be in compressed format, so this may not help at all, and depending on the pdf engine the X Y and Z can be separate:
for file in *.pdf
do
grep -l '(XYZ)Tj$' $file
done
You could also try:
for file in *.pdf
do
if [[ `grep -q '(X)Tj$' $file` ]] ; then
if [ `grep -q '(Y)Tj$' $file` ]] ; then
if [[ `grep -q '(Z)Tj$' $file` ]] ; then
echo "$file"
fi
fi
fi
done
This last one is an EXTREMELY inefficient method.... but is usually the way watermarks are generated. One char at a time.
thanks for your answer. Could you explain how does '(X)Tj$' work. pdf is a binary file isn't it. on my version it is giving following error:
grep: illegal option -- q
Usage: grep -hblcnsviw pattern file . . .
figures.. Solaris 5.10 isn't close to POSIX...
grep -q is the 'silent' way to look for a pattern it suppreses the display, but returns a status value ($?) to indicate whether it succeeded or not. See if one of your options does that.
( stuff in here )Tj is the way a postscript ()show command is written to uncompressed pdf format files. This displays "stuff in here".
'(X)Tj$' is the pattern for the command to print a single character 'X'. Again in uncompressed format.
If you literally cannot read your pdf because it is full of really weird characters, then it is compressed and this method will not work.
The real rationale behind the -q option is to be more efficient - it stops searching on the first hit. It's there to say 'Hey this pattern is/is not in the file' with the least overhead.
Most of the text to pdf convertors ie. easyPDF, QuickBooks PDF convertor, etc. use stream objects and do not store the "text" within the file in a format which you can grep for.
FWIW - Metavante, Extreme, and other high end products that produce pdf, ps, etc. let you control whether the text is compressed, stream, pdf v1.1, pdf 1.2,m etc. So that you can do text processing if you want.
We routinely program pdf and ps templates for fully automating customer FAX and email requests out of a CIS system. Works great.