I want to search a keyword in a list of pdf files and when i find a match i want to write the title and author of that pdf file to another file. How will I do this using linux shell script?
Hello sk33,
Welcome to forums. Following may help you in same.
I- If you want to search a word let's say test in current directory and so on then following may help you.
find . -type f -name "*.pdf" -exec grep -l "test" {} \+ 2>/dev/null
II- If you want to check string for a specific path then following may help.
find /tmp/test/Singh/weekend -type f -name "*.pdf" -exec grep -l "test" {} \+ 2>/dev/null
Hope this helps. Welcome to forum again and have a nice weekend.
Thanks,
R. Singh
But how I will print the title and author's name of the matched pdf files?
---------- Post updated at 07:48 AM ---------- Previous update was at 07:47 AM ----------
But how I will print the title and author's name of the matched pdf files?
convert your pdf file to text with the command:
pdftotext
and parse the title and author's name
Hi.
Possibly:
pdfgrep search in pdf files for strings matching a regular expression
Pdfgrep is a tool to search text in PDF files. It works similar to
`grep'.
.
Features:
- search for regular expressions.
- support for some important grep options, including:
+ filename output.
+ page number output.
+ optional case insensitivity.
+ count occurrences.
- and the most important feature: color output!
Seen in the repository for:
OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution : Debian 8.1 (jessie)
See also: https://pdfgrep.org/
Good luck ... cheers, drl
Hi.
Here is a demonstration of pdfgrep
:
#!/usr/bin/env bash
# @(#) s1 Demonstrate search PDF, regular expressions, pdfgrep.
# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C pdfgrep
FILE=${1-pdfgrep.pdf}
pl " Input data file $FILE (a sample pdf file, as created by pandoc):"
file $FILE
pl " Results:"
pdfgrep --color never "AUTHOR|NAME" pdfgrep.pdf
exit 0
producing:
$ ./s1
Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution : Debian 8.1 (jessie)
bash GNU bash 4.3.30
pdfgrep (local) 1.3.1
-----
Input data file pdfgrep.pdf (a sample pdf file, as created by pandoc):
pdfgrep.pdf: PDF document, version 1.5
-----
Results:
NAME pdfgrep - search pdf files for a regular expression
AUTHOR Hans-Peter Deifel
Of minor importance: the pdf is of the man page for pdfgrep itself. It was created by man writing a text file and then pandoc creating the pdf.
Note that protocomm's suggestion would allow you to use the full power of your native grep, which may be a significant advantage in some cases.
Best wishes ... cheers, drl
Hi.
In reviewing this, I'm wondering if the OP was interested in the PDF meta-information. The fist and last lines of one rendition of a PFD looks like:
%PDF-1.1
1 0 obj
<<
/CreationDate (D:20150913125458)
/Producer (text2pdf v1.1 (\251 Phil Smith, 1996))
/Title (pdfgrep.txt)
---
/Root 2 0 R
/Info 1 0 R
>>
startxref
6452
%%EOF
In which case, a simple grep
would probably suffice:
$ egrep 'Producer|Title' pdf-from-text2pdf.pdf
/Producer (text2pdf v1.1 (\251 Phil Smith, 1996))
/Title (pdfgrep.txt)
as has been posted by several responders here. I don't know enough about PDFs to say that Producer is/might be the same as Author. However, some PDFs seem to have binary data, so grep
might not work as desired on those.
Best wishes ... cheers, drl
Hi.
Also a thread started by SK33 at LQ in Scanning a pdf file in linux shell
Best wishes ... cheers, drl