OCR text that needs cleaning

safran · September 22, 2016, 3:41am

Hi,

I have OCR'ed text that needs cleaning.
Lines are delimited by parts of speech (POS), for example,
each line will have either an
adj. OR s. f. OR s. m. etc
I need to uppercase all text before the POS
but all text within parentheses to be lowercase
Text after (and including) the POS to remain as is

filename: munge

fuiASSO, FIEIASSO (b.), fuluasso (a. l.), foulhasso (for.), (b. lat. folz�acia), s. f. grosse feuille,
FUMFULHUT  (l.), felhut (g.), FOULhuolhut, (it.) FOGLIUTO, adj. Feuillu, ue, v. uiaru, pampous,
FUIEMT, fuiret  (rh.), fulheiret, ramoner  (l.), fulhoret (rouerg.), s. m. Feuilleret, petit rabot qui sert faire des feuillures.
FULmjnacioun, FULMINACIEN  (m.), fulminacieu  (l.),  (rom. lat. fulminatzo, cat. fulminaci�, esp. fulminacion, it. fwlminasione), s. f. Fulmination, v. trounado.
FULMINANT, ANTO  (port. fulminante), adj. Fulminant, ante, v. trounant. R. fulmana.

I have uppercased everything before POS with

sed -r -i -f doup.sed munge

doup.sed

s/ n. de l. /^ n. de l. /
s/ s. m. /^ s. m. /
s/ s. f. /^ s. f. /
s/ adj. /^ adj. /
s/ n. p. /^ n. p. /
s/ v. n. /^ v. n. /
s/ v. a. /^ v. a. /
s/ adv. /^ adv. /
s/^(.*)\^/\U\1\E/

and tried to lowercase between the parentheses with

sed -r -i 's/\((.*)\)/\L&/g' munge

but this retains uppercaseing until first parentheses and lowercases everything else up the POS like:

FUIASSO, FIEIASSO (b.), fuluasso (a. l.), foulhasso (for.), (b. lat. folz�acia), s. f. grosse feuille,
etc
etc

Any GNU sed 4.2.2 or GAWK 4.1.3 solutions please
Thanks in advance

RavinderSingh13 · September 22, 2016, 4:31am

Hello safran,

Could you please try following and let me know if this helps you.

awk '{match($0,/.*s\. f\./);if(substr($0,RSTART,RLENGTH)){print toupper(substr($0,RSTART,RLENGTH-5)) substr($0,RLENGTH-5)};match($0,/.*s\. m\./);if(substr($0,RSTART,RLENGTH)){print toupper(substr($0,RSTART,RLENGTH-5)) substr($0,RLENGTH-5)};match($0,/.*adj\./);if(substr($0,RSTART,RLENGTH)){print toupper(substr($0,RSTART,RLENGTH-4)) substr($0,RLENGTH-4)};}'  Input_file
OR a non-one liner form of above solution:
awk '{match($0,/.*s\. f\./);
      if(substr($0,RSTART,RLENGTH))  {
                                        print toupper(substr($0,RSTART,RLENGTH-5)) substr($0,RLENGTH-5)
                                     };
      match($0,/.*s\. m\./);
      if(substr($0,RSTART,RLENGTH)){
                                        print toupper(substr($0,RSTART,RLENGTH-5)) substr($0,RLENGTH-5)
                                     };
      match($0,/.*adj\./);
      if(substr($0,RSTART,RLENGTH))  {
                                        print toupper(substr($0,RSTART,RLENGTH-4)) substr($0,RLENGTH-4)
                                     };
     }
    '  Input_file

In case you need to get all strings upper case till the POS then following may help you in same.

awk '{match($0,/.*s\. f\.|.*adj\.|.*s\. m\./);print toupper(substr($0,RSTART,RLENGTH)) substr($0,RLENGTH+1)}'  Input_file

NOTE: I am trying to do it with a function, will post when able to do so.

Thanks,
R. Singh

safran · September 22, 2016, 4:51am

Hi,

Thanks for the quick response but your AWK one-liners just uppercase everything before the POS.
I'm already doing this uppercasing when I run doup.sed
The code I'm stuck on is the lowercasing of everything within the parentheses before the POS

Thanks

RudiC · September 22, 2016, 5:14am

Try

s/^(.*) (n. de l|s. m|s. f|adj|n. p|v. n|v. a|adv)/\U\1\E \2/
s/\([^)]*\)/\L&/g

for doup.sed to result in

FUIASSO, FIEIASSO (b.), FULUASSO (a. l.), FOULHASSO (for.), (b. lat. folz�acia), s. f. grosse feuille,
FUMFULHUT  (l.), FELHUT (g.), FOULHUOLHUT, (it.) FOGLIUTO, adj. Feuillu, ue, v. uiaru, pampous,
FUIEMT, FUIRET  (rh.), FULHEIRET, RAMONER  (l.), FULHORET (rouerg.), s. m. Feuilleret, petit rabot qui sert faire des feuillures.
FULMJNACIOUN, FULMINACIEN  (m.), FULMINACIEU  (l.),  (rom. lat. fulminatzo, cat. fulminaci�, esp. fulminacion, it. fwlminasione), s. f. Fulmination, v. trounado.
FULMINANT, ANTO  (port. fulminante), adj. Fulminant, ante, v. trounant. R. fulmana.

safran · September 22, 2016, 6:50am

Thank you RudiC, that work fine

RudiC · September 22, 2016, 7:03am

You still could try improving the regex, e.g. like

s/^(.*) (n\. (de l|p)|s\. [mf]|ad[jv]|v\. [na])/\U\1\E \2/

interesting esp. when the list of POS' gets longer and longer.

safran · September 22, 2016, 9:18am

Hi RudiC,
I prefer to keep the POS list as

s/ n. de l. /^ n. de l. /
s/ s. m. /^ s. m. /
s/ s. f. /^ s. f. /
s/ adj. /^ adj. /
s/ n. p. /^ n. p. /
s/ v. n. /^ v. n. /
s/ v. a. /^ v. a. /
s/ adv. /^ adv. /
s/^(.*)\^/\U\1\E/

adding to it as needs be and use your second sed line

s/\([^)]*\)/\L&/g

to do the lowercasing.

There will be more items to add to the POS list, as with english, words can have different parts of speech, for example,
in english, 'fast' can be a verb, a noun, an adjective and an adverb.

I've already seen examples such as
s. m. pl. and s. f. pl. (both plural forms of masculine/feminine nouns)
but I don't think there will be more than 20 to 25 catogeries

Thanks again for your help