I'm trying to find a way to automate cleanup of OCR for a large number of scanned pages - due to limitations of the access mechanism where these are to end up, I need to create pdf files that include the background text for searching.
Going in I have Tif images too dirty to OCR and re-keyed text that matches page for page. I can see from reading here plenty of ways to turn the Tif files into pdf, what I can't find is a way to stick this text into the pdf file - I'm guessing this calls for some reverse-engineering of what ever mapping scheme pdf uses for the coordinates of words or characters. Does anyone know of a tool for getting access to this text - writing as well as reading. I'm looking at pdftk but so far all I can get is a dump of the "metadata" fields, but not the text with position mapping...