Shell Script to Dynamically Extract file content based on Parameters from a pdf file

DIps · May 2, 2013, 10:39pm

Hi Guru's,
I am new to shell scripting. I have a unique requirement:
The system generates a single pdf(/tmp/ABC.pdf) file with Invoices for Multiple Customers, the format is something like this:
Page1 >> Customer 1 >>Invoice1 + invoice 2 >> Page1 end
Page2 >> Customer 2 >>Invoice 3 + Invoice 4 >> Continue to Page3 >> page3 End

I have to email Individual Statements to these customers individually based on the customer number/Email Address coming in the file.
Using Shell script how can i achieve this?

Thanks in Advance.
Regards,
Dips

DGPickett · May 7, 2013, 4:28pm

Start with something like pdf2txt so shell tools can see the pdf strings.

Chubler_XL · May 8, 2013, 12:38am

If you have python on your system you could try the PyPDF2 Library

Assumption is last page of invoice contains some text you can match to like "Total Due:"

#!/usr/bin/env python
from PyPDF2.pdf import PdfFileReader, PdfFileWriter
import sys

filenum = 1
Pageadded = False

output_pdf = PdfFileWriter()
input_pdf = PdfFileReader(open(sys.argv[1], "rb"))

for i in range(0, input_pdf.getNumPages()):
    Pageadded = True
    output_pdf.addPage(input_pdf.getPage(i))

    if input_pdf.getPage(i).extractText().find("Total Due:") != -1:
        outputStream = file("Cust_" + str(filenum) + ".pdf", "wb")
        output_pdf.write(outputStream)
        filenum = filenum + 1
        output_pdf = PdfFileWriter()
        Pageadded = False
if Pageadded:
    outputStream = file("Cust_" + str(filenum) + ".pdf", "wb")
    output_pdf.write(outputStream)

Note: indentation is a part of the python syntax so ensure you keep the indent levels correct. Call the script like this:

$ ./split_invoice.py Invoice_file.pdf

DGPickett · May 8, 2013, 1:31pm

See what the structure is after you dump it to text, or have python tell you what it finds. Then we can figure how to reprocess that.