Scanning a pdf file in Linux shell

SK33 · September 12, 2015, 7:30am

I want to search a keyword in a list of pdf files and when i find a match i want to write the title and author of that pdf file to another file. How will I do this using linux shell script?

RavinderSingh13 · September 12, 2015, 7:50am

Hello sk33,

Welcome to forums. Following may help you in same.
I- If you want to search a word let's say test in current directory and so on then following may help you.

 find . -type f -name "*.pdf" -exec grep -l "test" {} \+ 2>/dev/null

II- If you want to check string for a specific path then following may help.

 find /tmp/test/Singh/weekend -type f -name "*.pdf" -exec grep -l "test" {} \+ 2>/dev/null

Hope this helps. Welcome to forum again and have a nice weekend.

Thanks,
R. Singh

SK33 · September 12, 2015, 8:48am

But how I will print the title and author's name of the matched pdf files?

---------- Post updated at 07:48 AM ---------- Previous update was at 07:47 AM ----------

ravindersingh13:

Hello sk33,

Welcome to forums. Following may help you in same.
I- If you want to search a word let's say test in current directory and so on then following may help you.
 find . -type f -name "*.pdf" -exec grep -l "test" {} \+ 2>/dev/null
 
II- If you want to check string for a specific path then following may help.
 find /tmp/test/Singh/weekend -type f -name "*.pdf" -exec grep -l "test" {} \+ 2>/dev/null
 
Hope this helps. Welcome to forum again and have a nice weekend.

Thanks,
R. Singh

But how I will print the title and author's name of the matched pdf files?

protocomm · September 12, 2015, 11:09am

convert your pdf file to text with the command:

pdftotext

and parse the title and author's name

drl · September 12, 2015, 12:11pm

Hi.

Possibly:

pdfgrep search in pdf files for strings matching a regular expression
 Pdfgrep is a tool to search text in PDF files. It works similar to
 `grep'.
 .
 Features:
  - search for regular expressions.
  - support for some important grep options, including:
    + filename output.
    + page number output.
    + optional case insensitivity.
    + count occurrences.
  - and the most important feature: color output!

Seen in the repository for:

OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution        : Debian 8.1 (jessie)

See also: https://pdfgrep.org/

Good luck ... cheers, drl

drl · September 13, 2015, 1:22pm

Hi.

Here is a demonstration of pdfgrep :

#!/usr/bin/env bash

# @(#) s1	Demonstrate search PDF, regular expressions, pdfgrep.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C pdfgrep

FILE=${1-pdfgrep.pdf}

pl " Input data file $FILE (a sample pdf file, as created by pandoc):"
file $FILE

pl " Results:"
pdfgrep --color never "AUTHOR|NAME" pdfgrep.pdf

exit 0

producing:

$ ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution        : Debian 8.1 (jessie) 
bash GNU bash 4.3.30
pdfgrep (local) 1.3.1

-----
 Input data file pdfgrep.pdf (a sample pdf file, as created by pandoc):
pdfgrep.pdf: PDF document, version 1.5

-----
 Results:
NAME pdfgrep - search pdf files for a regular expression
AUTHOR Hans-Peter Deifel

Of minor importance: the pdf is of the man page for pdfgrep itself. It was created by man writing a text file and then pandoc creating the pdf.

Note that protocomm's suggestion would allow you to use the full power of your native grep, which may be a significant advantage in some cases.

Best wishes ... cheers, drl

drl · September 13, 2015, 2:52pm

Hi.

In reviewing this, I'm wondering if the OP was interested in the PDF meta-information. The fist and last lines of one rendition of a PFD looks like:

%PDF-1.1
1 0 obj
<<
/CreationDate (D:20150913125458)
/Producer (text2pdf v1.1 (\251 Phil Smith, 1996))
/Title (pdfgrep.txt)
   ---
/Root 2 0 R
/Info 1 0 R
>>
startxref
6452
%%EOF

In which case, a simple grep would probably suffice:

$ egrep 'Producer|Title' pdf-from-text2pdf.pdf
/Producer (text2pdf v1.1 (\251 Phil Smith, 1996))
/Title (pdfgrep.txt)

as has been posted by several responders here. I don't know enough about PDFs to say that Producer is/might be the same as Author. However, some PDFs seem to have binary data, so grep might not work as desired on those.

Best wishes ... cheers, drl

drl · September 19, 2015, 10:21am

Hi.

Also a thread started by SK33 at LQ in Scanning a pdf file in linux shell

Best wishes ... cheers, drl