Hello,
I want to create a test bed for Urdu ligatural forms. One of the main components is to create a delimiter list. These are forms after which no connectors can be formed.
What I need is a tool which will take a running text or a list of words in a file and split them as soon as a delimiter is encountered. A sample will explain the process:
I am using Latin script for easy facilitation.
DELIMITERS:Let us assume that the delimiters are:
a,e,i,o,u
Each delimiter separated by a comma
INPUT:
baker
convoluted
perspicacity
EXPECTED OUTPUT
ba ke r
co nvo lu te d
pe rspi ca ci ty
i.e. after each delimiter the string is splitted and a space is placed.
Please note that if I had put
aeo
as a delimiter. Then a string such as :
archaeological
would be split as
a rchaeo lo gi ca l
At present I use a macro to do the job. But the process is extremely slow.
An AWK or PERL Script would be of great help, since my OS is Windows.
Many thanks
p.s. Just in case someone is interested in tweaking Urdu, a sample delimiter list is provided below:
[user@host ~]$ cat file
baker
convoluted
perspicacity
[user@host ~]$ cat test.pl
#! /usr/bin/perl
my @delims = qw / a e i o u /;
my ($str, $x) = (undef, undef);
open I, "< file";
while ($str = <I>) {
chomp ($str);
for $x(split('', $str)) {
(grep {$_ eq $x} @delims) ? print "$x " : print "$x";
}
print "\n";
}
close I;
[user@host ~]$
[user@host ~]$ ./test.pl
ba ke r
co nvo lu te d
pe rspi ca ci ty
[user@host ~]$
Hello,
It worked beautifully for the English samples. However the momnet I plugged in the Urdu delimiters, it did not work.
I suppose this is because PERL does not support UTF8. I even tried saving the script as UTF8 with no Byte Order mark, but it did not work.
The only change I made in the script was to replace it with my delimiters.
my @delims = qw / /;
each separated by a space as in your case
Just for testing here is a small sample on which I tried
Basically even if the script is alien, you should see a space between the ligatural forms, but the script spews out the sample file as such.
How do you get around this issue?
Any help or suggestions, please.
Many thanks