Searching for a string in .PDF files inside .RAR & .ZIP archives.

lewk · March 28, 2011, 1:50am

Hi,

I have got a large number of .PDF files that are archived in .RAR & ZIP files in various directories and I would like to search for strings inside the PDF files.

I would think you would need something that can recursively read directories, extract the .RAR/.ZIP file in memory, read the PDF in memory, search for the given string in the PDF, display the result and in what .RAR/.ZIP filename and PDF it was found and discard everything to /dev/null so that you don't sit with everything extracted on your hard drive after the script is done, then move on to the next .RAR/.ZIP file etc. until done.

Is there any shell scripting wizards that could assist me with this?

Thanks

pkabali · March 28, 2011, 2:08am

this is my first post so I hope I dont screw up.

Think this should work

mkdir testfolder
cp test.zip testfolder/
cd testfolder/
unzip test.zip 
find . -type f -exec grep teststring -print {} \;
cd ..
rm -rf testfolder/

you would have to insert a statement to unpack the rar files.

lewk · March 28, 2011, 2:43am

Welcome pkabali! There are some very knowledgeable people on here.

I think what you are suggesting is good but I am not sure how well find reads .PDF metadata ? I am searching but there is probably a CLI app that can read and print a .PDF on the CLI.

I found another script that is kind of in the direction of what I am looking for, I am just asking the guy for permission to post it here.

pkabali · March 28, 2011, 3:06am

Great !

I knew the moment that I posted that this issue might come up :o . I am not sure if you want to look into pdftotext utility out there, however the overhead might be too much.