Shell Script to Dynamically Extract file content based on Parameters from a pdf file

Hi Guru's,
I am new to shell scripting. I have a unique requirement:
The system generates a single pdf(/tmp/ABC.pdf) file with Invoices for Multiple Customers, the format is something like this:
Page1 >> Customer 1 >>Invoice1 + invoice 2 >> Page1 end
Page2 >> Customer 2 >>Invoice 3 + Invoice 4 >> Continue to Page3 >> page3 End

I have to email Individual Statements to these customers individually based on the customer number/Email Address coming in the file.
Using Shell script how can i achieve this?

Thanks in Advance.
Regards,
Dips

Start with something like pdf2txt so shell tools can see the pdf strings.

If you have python on your system you could try the PyPDF2 Library

Assumption is last page of invoice contains some text you can match to like "Total Due:"

#!/usr/bin/env python
from PyPDF2.pdf import PdfFileReader, PdfFileWriter
import sys

filenum = 1
Pageadded = False

output_pdf = PdfFileWriter()
input_pdf = PdfFileReader(open(sys.argv[1], "rb"))

for i in range(0, input_pdf.getNumPages()):
    Pageadded = True
    output_pdf.addPage(input_pdf.getPage(i))

    if input_pdf.getPage(i).extractText().find("Total Due:") != -1:
        outputStream = file("Cust_" + str(filenum) + ".pdf", "wb")
        output_pdf.write(outputStream)
        filenum = filenum + 1
        output_pdf = PdfFileWriter()
        Pageadded = False
if Pageadded:
    outputStream = file("Cust_" + str(filenum) + ".pdf", "wb")
    output_pdf.write(outputStream)

Note: indentation is a part of the python syntax so ensure you keep the indent levels correct. Call the script like this:

$ ./split_invoice.py Invoice_file.pdf

See what the structure is after you dump it to text, or have python tell you what it finds. Then we can figure how to reprocess that.