Problem in refreshing a text editor (gedit) for scanned pdf

anespa · May 27, 2016, 5:08am

Dear Friends,
I am using Ubuntu 15.10, 34 bit system. I added a Nautilus-Actions script in shell script to convert PDF files to text. There are 2 types of PDF

Scanned PDF -- Not OCR type -- When I convert it to text it work , but as the part it must (text file) open in gedit . But I can see a blank file
eventhough it came in real file...
For Normal PDF (searcable one) -- it works fine

I add my code for your reference ... please advise what I do to avoid this issue..

#!/bin/bash
cd $1
if [[ $2 = *.pdf ]]; then
  #echo pdf > "anes.txt"
  MYFONTS=$(pdffonts -l 5 "$3" | tail -n +3 | cut -d' ' -f1 | sort | uniq)
  if [ "$MYFONTS" = '' ] || [ "$MYFONTS" = '[none]' ]; then
    #Scanned PDF
    convert -density 300 "$3" "${3%.*}.tiff"
    tesseract "${3%.*}.tiff" "$3"
    sleep 2
    rm -f "${3%.*}.tiff"
    gedit "${3/%.*}.txt"
  else
    pdftotext "$3"
    gedit "${3/%.pdf/.txt}"
  fi
elif [[ $2 = *.tif ]] ||  [[ $2 = *.tiff ]] || [[ $2 = *.jpg ]] || [[ $2 = *.jpeg ]] || [[ $2 = *.png ]] || [[ $2 = *.gif ]]; then
   tesseract "$3" "${3%.*}"
   gedit "${3/%.*}.txt"
else
  # Not implemented case...
  #echo Nothing to do > "anes.txt"
fi

Waiting for your fast response

Thanks

Anes

Corona688 · May 27, 2016, 11:19am

Just because you had a real file doesn't mean it had real contents. If it's a scanned PDF, something has to do optical character recogntion, and pdf2text does not. Hence the PDF's you get are empty.

neutronscott · May 27, 2016, 11:58am

It's not clear what $2 and $3 are ..

It looks like you convert $3, so I assume that's file.pdf

tesseract's output would be in file.pdf.txt since you didn't use ${3%.*} as you did elsewhere. But then you try to open ${3/%.*}.txt instead...