I have got a large number of .PDF files that are archived in .RAR & ZIP files in various directories and I would like to search for strings inside the PDF files.
I would think you would need something that can recursively read directories, extract the .RAR/.ZIP file in memory, read the PDF in memory, search for the given string in the PDF, display the result and in what .RAR/.ZIP filename and PDF it was found and discard everything to /dev/null so that you don't sit with everything extracted on your hard drive after the script is done, then move on to the next .RAR/.ZIP file etc. until done.
Is there any shell scripting wizards that could assist me with this?
Welcome pkabali! There are some very knowledgeable people on here.
I think what you are suggesting is good but I am not sure how well find reads .PDF metadata ? I am searching but there is probably a CLI app that can read and print a .PDF on the CLI.
I found another script that is kind of in the direction of what I am looking for, I am just asking the guy for permission to post it here.
I knew the moment that I posted that this issue might come up :o . I am not sure if you want to look into pdftotext utility out there, however the overhead might be too much.