Find and replace in a file from another file

gimley · October 24, 2017, 12:26am

Hello,
I am writing a phonetic converter for French from written French to IPA. French has the convention of putting an apostrophe and joining 2 words if the first word ends in an e.

l'homme
d'air
s'est

Loading all such words in my dictionary just over loads the database.
Apart from this the Engine which I have written does not convert words followed by punctuation markers.
What I need is an awk or perl script which can act as a preprocessor and preprocess the data separating the apostrophe by a space.

l' becomes l[space]'
d' becomes d[space]'

I am giving below a sample rule file which is in UTF8 which needs to handle all such cases. The convention is that the lefthand side string is converted to a righthand side string with

as the delimiter

'= ' 
.= . 
,= ,
"= " 
;= ; 
:= : 
!= ! 
?= ? 
(= ( 
)= ) 
l'=l 
-= -

Some samples for testing are given below

s'est
l'air
d'homme
l'issue
bleu-blanc-rouge
(SDF)
a-t-il?

Many thanks for your kind help. Caveat.: I work under Windows.

RudiC · October 24, 2017, 4:38am

Any attempts / ideas / thoughts from your side?

Is the list given complete, or does your request apply to ALL punctuation chars?

gimley · October 24, 2017, 5:02am

Many thanks for your queries:
My answers to both

I tried to write a search and replace tool in Java. But unfortunately it does not accept punctuation markers and the preprocess.rul file fails and does not do the requisite replacement. In fact the rule file I posted is the rule file which I used for the Java search and replace.
No the list is not integral it is partial and as I explore the database, some more cases may arise. But on the whole, all punctuation markers need to be replaced by a space followed by the punctuation marker.

Any script in Awk or Perl which can do this.
Thanks a lot

RudiC · October 24, 2017, 5:16am

How about

sed 's/[[:punct:]]/ &/g' file
s 'est
l 'air
d 'homme
l 'issue
bleu -blanc -rouge
 (SDF )
a -t -il ?

gimley · October 24, 2017, 5:22am

Hello,
I am not too familiar with sed
If I run the sed script you have given

sed 's/[[:punct:]]/ &/g' file

would it do the job.
A naive question maybe, but how do I run the script?
Thanks a lot for your kind help.

RudiC · October 24, 2017, 5:27am

You tell me if "it does the job". Enter the command as given on the terminal command line, replacing "file" with your input file name. Redirect the output to a destination file if happy with what you see. You may want to test it on a small sample of the input, though.

gimley · October 24, 2017, 5:56am

Sorry for the delay.
I had to download a sed which works on Windows10.
Yes, it works and "does the job". "Cleans" up all the punctuations
Thanks a lot.