Limitations of 'pdftotext' in Linux...

kenlenard · November 20, 2019, 4:33pm

Guys: I have a customer using the 'pdftotext' utility under Linux. PDFs are received via email, converted to text, etc. and it has worked nicely for years. They received a PDF from a customer and the utility will not read it. The text file is created but it's either empty or has 1-2 bytes of garbage in it. Acrobat renders the document correctly. I did a FILE -> SAVE AS TEXT inside of Acrobat and the same thing happened... an empty file. I tried it on another PDF and it worked. So why would pdftotext have an issue with a certain PDF? Could it be encrypted? Most of these files are 17 to 20kb in size. The one that will not process is a whopping 426kb. Is there a size limitation? I have used pdftotext but I have only 'used it'... I am not an expert on all its abilities. Thanks for reading and thanks for the help.

EDIT: I can upload the PDF here if anyone is interested. Thx.

jim_mcnamara · November 21, 2019, 5:55am

No. The file itself has problems. If you can read it with Acrobat, try to save the file as a different filename. If Acrobat cannot save it there not much you can do.

Help:

apmcd47 · November 21, 2019, 6:12am

This is probably a stupid question, but is your PDF rendering text or an image, such a a scanned-in page, or text rendered to an image before being imported into the PDF? I'm not quite sure how you can check that. If your copy of Acrobat has the capability, can you perform an OCR on the document?

Andrew

Neo · November 21, 2019, 6:52am

The file is more-than-likely corrupted. but we need more info.

Have you ran that PDF file though a PDF checker, like this one:

https://www.datalogics.com/products/pdftools/pdf-checker/

That site will provide you a zip file with the text results.

kenlenard · November 21, 2019, 9:37am

Thank you all for the reply. I did wonder if the entire document is an image but it's very hard to tell. It does not look like a scan. There are some barcode-like images in it as well as some text that is overlapping which is a bit unusual. I will try to save it to another filename. I will also attach it here in case any of you would like to see it.

kenlenard · November 21, 2019, 9:39am

Here is the PDF in question. Thanks again. I will report back after some further testing.

kenlenard · November 21, 2019, 9:46am

Mmm, apparently my version of Acrobat will not let me check the accessibility of the document. I feel like the barcodes are NOT the issue because we have PDFs come in with logos, etc. and those items are just ignored by pdftotext and the actual readable text is what comes out in the text file. Just the sheer size of the file suggests that it could be an image/scan which I could see causing problems... no actual text exists in the document. Also, I saved the document in Acrobat as TTS_NEW2.pdf and it saved without issue.

Neo · November 21, 2019, 11:36am

Sorry, but for security reasons I cannot permit you to post a PDF attachment which may be corrupted.

Did you follow my recommendation to run it though a PDF checker?

apmcd47 · November 21, 2019, 11:36am

As I said in my earlier post you will have to see if you have a way to convert to text, such as "Edit Text and Images" menu option in Acrobat.

Andrew

Neo · November 21, 2019, 11:52am

As I said in my earlier post... and I am the founder and lead admin here:

Did you follow my recommendation to run it though a PDF checker?

Note: Do not post PDF files here which our team has indicated they believe may be corrupted, especially if you have not validated the PDF is not corrupted. Thanks.

kenlenard · November 21, 2019, 2:46pm

Guys, wow... I did not know that I should not upload a PDF. I apologize. I did not expect that it would get the original thread closed. I was up looking at this issue late last night and did not run the PDF checker until this afternoon. I was not able to check the accessibility of the PDF using my version of Acrobat. I have the PDF checker results. Is that file permissible to upload or is there something from the report that I can post here? The results suggest that the file is a normal PDF but using this utility is out of my experience.

Neo · November 21, 2019, 9:17pm

It is important that you and everyone who posts here follow our instructions.

I asked you directly to run your PDF file (a file which you indicated had issues, and our team members advised may have issues) though a PDF integrity checker, and you did not do that, and then up loaded it to our site.

You must follow moderator instructions, and especially admin instructions. This is a requirement is not optional.

So, I do not understand to be frank your "Guys, wow... "... geez wiz reply. I am the creator, lead admin and the person responsible for the integrity of this site for nearly two decades. If I ask you to run your file though a PDF checker, you should do so; but instead, you uploaded a potentially problematic file to this site. It would be less than responsible of me not to delete this file.

But moreover, you did not follow my instructions.

Deleting the PDF which you uploaded and closing your thread was out of an "act of kindness" on my part, as I did not issue you any infraction nor did I change your status to read only for not following my instructions.

This site gets over 1 million visitors a month. Quite frankly, and it is not personal toward you or you good self, I do not have time for those who post here and do not follow my instructions as the admin for this site.

I hope this is clear. Please follow my requests and instructions.

Thank you.

Yes, you can post the results of your file integrity check (in code tags) which in all frankness you should have done before posting the file in the first place, per my request. Thanks.

kenlenard · November 21, 2019, 9:27pm

Okay, here is the PDF_Checker results.

PDF Checker 1.5.0  Copyright 2018-2019 Datalogics, Inc. All Rights Reserved

Thu Nov 21 13:39:51 2019

JSON Profile: everything.json

Input Document: TTSNEW.pdf

File Size: 426 KB

<<=CHECKER_SUMMARY_START=>>
general:born-digital
images:color:resolution-too-low
sizeInBytes:435947
<<=CHECKER_SUMMARY_END=>>

Optimization Assessment
    Document is appropriately optimized

General Results
    Errors:
        None
    Information:
        Document was born digital.  It was produced from PDF authoring software and so it may contain text, images, tables, forms, and other objects.  These types of PDFs typically do not require OCR.
    Checks Completed:
        born-digital
        claims-pdfa-conformance
        claims-pdfe-conformance
        claims-pdfua-conformance
        claims-pdfvt-conformance
        claims-pdfx-conformance
        contains-owner-password
        contains-signature
        damaged
        image-only
        password-protected
        pdf-v2
        unable-to-open
        xfa-type

Userdata Results
    Errors:
        None
    Information:
        None
    Checks Completed:
        contains-annots
        contains-annots-not-for-printing
        contains-annots-not-for-viewing
        contains-annots-without-normal-appearances
        contains-embedded-files
        contains-metadata
        contains-optional-content
        contains-private-data
        contains-transparency

Fonts Results
    Errors:
        None
    Information:
        None
    Checks Completed:
        fontdescriptor-missing-capheight
        fontdescriptor-missing-fields
        uses-base14fonts-not-embedded
        uses-fonts-fully-embedded
        uses-fonts-not-embedded

Objects Results
    Errors:
        None
    Information:
        None
    Checks Completed:
        contains-javascript-actions
        contains-thumbnails

Cleanup Results
    Errors:
        None
    Information:
        None
    Checks Completed:
        suboptimal-compression

Image Results
    Errors:
        None
    Information:
        None
    Checks Completed:
        alternate-images

    Color Images
    Errors:
        None
    Information:
        Low resolution color image(s) present: 
            Total: (1 instance)
    Checks Completed:
        image-depth
        resolution-too-high
        resolution-too-low
        uses-jpeg2000-compression

    Grayscale Images
    Errors:
        None
    Information:
        None
    Checks Completed:
        resolution-too-high
        resolution-too-low
        uses-jpeg2000-compression

    Monochrome Images
    Errors:
        None
    Information:
        None
    Checks Completed:
        resolution-too-high
        resolution-too-low
        uses-jbig2-compression

My apologies again. I have been up working on a number of different emergencies this week until about 3am each night. This PDF issue is just one problem I am having at the moment and my attention is divided. I'm not trying to rile anyone up. Thank you again for looking at this.

Neo · November 21, 2019, 9:45pm

No worries. I completely understand the stress of working many IT coding issues at once and juggling many balls all up in the air at the same time.

I will reopen the original thread and merge this post. into it.

OBTW, I am not "riled" or upset or angry in any way. I am like a "admin robot"... I just insure this site is healthy, running fast and smooth, protect the site from harm, and insure our mission, rules and guidelines are followed.

SOAP BOX COMMENT: Sidebar (not specific to your post):

Sometimes I ask a question or ask for input, to insure that questions are clear, not only for me, but for future generations who visit the site and have similar questions. This site is not a "put a nickel in and get an answer out site", as some would like it to be. Our mission is to teach people to solve their own problems, not to do other's work for them, like the old saying (paraphrasing) which I am sure you have heard before:

"Give a person a fish and you feed them for a day. Teach that same person to fish, and you feed them for a lifetime."

In the age of the Internet and social media, people have become too dependant on others to do their problem solving (and thinking) for them. When I created this site decades ago, long before FB, reddit, stack*, medium, and more; our goals were always to have a very high "signal to noise" ratio and to never become a "put a nickel in and get an answer site", to encourage people to describe and solve their own problems with our help.

I will continue to encourage all users in that direction, even if we are the last site on the Internet to be this way.

END OF SOAP BOX COMMENT: Sidebar (not specific to your post):

kenlenard · November 21, 2019, 10:20pm

If it helps, I can run the PDF_Checker on a PDF from this same trading partner that actually processes properly. A pdftotext creates a readable text file and inside of Acrobat FILE -> SAVE AS TEXT works as well. The trading partner updated their PDF and this latest one is the result. Maybe a comparison between old and new would point to the answer. Thanks again for the help.

Neo · November 21, 2019, 10:59pm

Thanks for the update. Did you try Jim's suggestion here: ?

https://www.unix.com/303041312-post2.html

Neo · November 21, 2019, 11:06pm

Also, according to the pdftotext man page:

BUGS

       Some  PDF  files  contain  fonts whose encodings have been mangled beyond recognition.  There is no way (short of OCR) to extract text from
       these files.

EXIT CODES

       The Xpdf tools use the following exit codes:

       0      No error.

       1      Error opening a PDF file.

       2      Error opening an output file.

       3      Error related to PDF permissions.

       99     Other error.

This would indicate that the first place to look would be at the fonts, since the man page says:

BUGS -- Some PDF files contain fonts whose encodings have been mangled beyond recognition. There is no way (short of OCR) to extract text from these files.

Did you check the file and list all the fonts and compare that list of fonts to a working PDF file (which converts to text properly)?

kenlenard · November 21, 2019, 11:14pm

I did. I reported back that my version of Acrobat does not have the accessibility tool (apparently). When I click on it it shows that it's a "pro" feature that I do not have. But I have tried to save the document in Acrobat and it will save it under another filename without issue.

--- Post updated at 05:14 AM ---

neo:

Also, according to the pdftotext man page:

pdftotext(1) [linux man page]
BUGS

   Some  PDF  files  contain  fonts whose encodings have been mangled beyond recognition.  There is no way (short of OCR) to extract text from
   these files.
EXIT CODES

   The Xpdf tools use the following exit codes:

   0      No error.

   1      Error opening a PDF file.

   2      Error opening an output file.

   3      Error related to PDF permissions.

   99     Other error.
This would indicate that the first place to look would be at the fonts, since the man page says:

BUGS -- Some PDF files contain fonts whose encodings have been mangled beyond recognition. There is no way (short of OCR) to extract text from these files.

Did you check the file and list all the fonts and compare that list of fonts to a working PDF file (which converts to text properly)?

I was just looking at that and comparing the old version to the new version. The PDF_checker for the old version (which DOES convert) says that there are font errors...

Fonts Results
    Errors:
        Uses Base 14 fonts not embedded in document: 
            Helvetica (1 instance)
            Helvetica-Bold (1 instance)

I'm in a bit of deep water here because I'm an application programmer and rarely lift the hood on PDF structure. On this project where I use 'pdftotext', I simply use the command line instructions, take my text file and move on. Once the utility doesn't work (for whatever reason), I'm at a loss. My guess is that the size of the PDF (425kb for the bad one compared to about 17kb for the ones that work properly) suggests that it's actually an image. Does the PDF_Checker information I posted earlier tell us that or no? Thanks again.

Neo · November 21, 2019, 11:37pm

I have not looked into it but I doubt that particular PDF checker checks for fonts not compatible with the Linux pdftotext utility.

My guess is that you will need to preprocess your PDF files and strip out any fonts which are causing issues or not compatible with pdftotext .

Or... less likely,

You could to instruct everyone who provides PDF not to use unsupported fonts. LOL, but controlling users usually does not work..... so that "administrative" option may not help and you will need a technical solution to preprocess.

What do you think?

kenlenard · November 21, 2019, 11:41pm

Also, I saw that bug report about font encodings being mangled beyond recognition. What does that suggest? That the fonts are unusual and unable to be picked up? I have seen that statement on a number of 'pdftotext' websites but I'm not sure what they're trying to say unless it just comes down to some fonts being unusable by the utility. The font in this particular PDF does not seem to be unusual but I have no real reference.

--- Post updated at 05:41 AM ---

I think we posted at the same time there. Yeah, asking the trading partner to conform to something is dicey to say the least. What I find unusual is that this structure has been in place for quite awhile and AFAIK, this is the first time that a PDF simply will not process using 'pdftotext'. That along with the size suggests that this particular PDF was created under unusual circumstances. What I need to do is tell my customer that this PDF is incompatible but I would like to tell them WHY so that the trading partner might be able to do something different. I dislike mysteries and I don't like to say that something doesn't work without understanding why. It's definitely mysterious. Thanks again for the help. I appreciate it.