Regular expression for finding OCR mistakes.

gencon · May 17, 2012, 2:03pm

I have a large file of plain text, created using some OCR software. Some words have inevitably been got wrong. I've been trying to create grep or sed, etc., regular expressions to find them - but haven't quite managed to get it right. Here's what I'm trying to achieve:

Output all lines which contain a word which begins with, or contains, a number or non-alpha-numeric character. Eg. th1s, mi|k, !nert, etc.

Output all lines which contain a word which ends with a number or non-alpha-numeric character which is also not a common punctuation symbol like, '.', ','. Eg. Cra6, Chemica(, etc.

If possible it would be great to have the line numbers printed as well, but not essential at all.

Can you gurus help please? Thanks.

Corona688 · May 17, 2012, 2:18pm

$ cat data

This line contains a 1 but is not a mistake
The small brown fox jumped over the lazy dog.
This line contains a 1 but is a m1stake
How are you today?
mi|k
That's fine;  this isn't.
!nert
Hey hey hey!
cra6
chemica(

$ cat ocr.awk

{
        P=0
        for(N=1; (!P) && (N<=NF); N++)
        {
                # Ignore words that are pure numbers?
                if($N ~ /^[0-9]*$/) continue;
                # Flag words that contain non a-zA-Z'
                if($N ~ /[^a-zA-Z']./) P=1;
                # Flag words that end in non a-zA-Z.,;?!
                if($N ~ /[^a-zA-Z.,;?!]$/) P=1;
        }

        $0=NR"\t"$0;
} P

$ awk -f ocr.awk data

3       This line contains a 1 but is a m1stake
5       mi|k
7       !nert
9       cra6
10      chemica(

$

gencon · May 17, 2012, 4:08pm

Thank you so much Corona, I really appreciate it. That works brilliantly, well, with a few modifications of things I hadn't mentioned, but just minor details. Now I've got to plough through the results - oh well just a few hours work, but that's instead of reading the whole thing. Many, many, thanks, that's saved me hours. Cheers.