Working with OCR text inside PDF files

dorcas · January 15, 2009, 5:14pm

I'm trying to find a way to automate cleanup of OCR for a large number of scanned pages - due to limitations of the access mechanism where these are to end up, I need to create pdf files that include the background text for searching.

Going in I have Tif images too dirty to OCR and re-keyed text that matches page for page. I can see from reading here plenty of ways to turn the Tif files into pdf, what I can't find is a way to stick this text into the pdf file - I'm guessing this calls for some reverse-engineering of what ever mapping scheme pdf uses for the coordinates of words or characters. Does anyone know of a tool for getting access to this text - writing as well as reading. I'm looking at pdftk but so far all I can get is a dump of the "metadata" fields, but not the text with position mapping...

fpmurphy · January 16, 2009, 9:18am

Are you looking to map each word in the manually generated page text to its corresponding position in the OCR image of the page?

dorcas · January 16, 2009, 9:29am

There is an xml based metadata standard for this called METS-ALTO but what I'm trying to get at is the proprietary one that is inside of a pdf - the piece of the pdf file that is created when you run OCR within Adobe Acrobat.