Hello,
I am writing a phonetic converter for French from written French to IPA. French has the convention of putting an apostrophe and joining 2 words if the first word ends in an e.
l'homme
d'air
s'est
Loading all such words in my dictionary just over loads the database.
Apart from this the Engine which I have written does not convert words followed by punctuation markers.
What I need is an awk or perl script which can act as a preprocessor and preprocess the data separating the apostrophe by a space.
l' becomes l[space]'
d' becomes d[space]'
I am giving below a sample rule file which is in UTF8 which needs to handle all such cases. The convention is that the lefthand side string is converted to a righthand side string with
I tried to write a search and replace tool in Java. But unfortunately it does not accept punctuation markers and the preprocess.rul file fails and does not do the requisite replacement. In fact the rule file I posted is the rule file which I used for the Java search and replace.
No the list is not integral it is partial and as I explore the database, some more cases may arise. But on the whole, all punctuation markers need to be replaced by a space followed by the punctuation marker.
Any script in Awk or Perl which can do this.
Thanks a lot
You tell me if "it does the job". Enter the command as given on the terminal command line, replacing "file" with your input file name. Redirect the output to a destination file if happy with what you see. You may want to test it on a small sample of the input, though.
Sorry for the delay.
I had to download a sed which works on Windows10.
Yes, it works and "does the job". "Cleans" up all the punctuations
Thanks a lot.