I am working on a database of a language using Arabic Script. One of the major issues is that the shape of the characters changes according to their initial, medial or final positioning. Another major issue is that of the clustering of vowels within the word: the clustering changes totally the pronunciation.
What I am looking for is a concordance of such clusters read from a file and their display in initial medial or final position with a couple of examples read from the database.
Two files will be provided:
A look-up file called clusters and a database termed dictionary
An example will make this clear: (I will use English to make this understandable)
The cluster file will be repertoire of just single characters or two or more letter characters as in the example below
Clusters
a
oi
oa
ai
ea
ui
The dictionary will comprise of the word followed by its mapping delimited by an equal to sign as in the example below. The mappings are pseudo since in the real dictionary these will be in the International phonetic alphabet.
Dictionary
act=akt
ball=ball
beta=bita
coat=kot
load=lod
approach=eproch
goal=gol
rain=ren
paint=pent
rail=rel
failure=felyer
sea=si
beans=bins
easy=izi
please=pliz
beach=bich
leather=lethar
already=alredi
early=erli
break=brek
bread=bred
juice=jus
fruit=frut
suit=sut
The expected output would be as under.
keyword from cluster
position Initial Medial or Final [In case no example is found just a dash]
Frequency of occurence
Two or three examples of the word from the database
Only one example is given below
a
Init 3 act=akt,approach=eproch,already=alredi
Mid 1 ball
Fin 1 beta=bita
There is one condition. Only the largest string from the clusters file will be considered. If the character is already found in the large cluster it will be ignored. Thus
a in final position also occurs in sea but is ignored because the cluster ea is already there.
Similarly a in medial position has only one example, since it occurs elsewhere in different combinations.
Since I work under Windows a Perl or Awk script could help. I do write scripts in Perl and Awk, but this is beyond my skill-set.
Any help would be greatly appreciated, since the final output will help create standards for that particular linguistic community and this work will be put up free for use.
---------- Post updated 08-07-15 at 03:27 AM ---------- Previous update was 08-06-15 at 08:35 PM ----------
My sincere apologies to all who took pains to read the request. I guess my memory isn't what it used to be (I am nearly 70 years old). Still, I should have checked on the forum before posting, which I did not. I will be more careful next time.
I found that I had already written a similar code in Perl and which was bettered by folks on the forum. Here is the code which was put up:
#! /usr/bin/perl
use strict; # These two lines save you endless trouble
use warnings; # without them typos and such errors get missed
open (my $corpus_file, '<', 'Corpus'); # Created a test corpus with just the contained lines
# $/="\r\n"; # Again with the DOS files
chomp(my @corpus = (<$corpus_file>)); # Load the corpus file into an array for faster access
open (my $syllables_file, '<', 'Syllables');
while(<$syllables_file>){
chomp(my $syllable = $_);
my $count = 0;
my $init = my $med = my $fin = my $stdalone = "NONE";
for my $word (@corpus) {
if ( $word =~ /^$syllable.+/) {
if ($init eq "NONE") {
$init = $word;
$count++;
}
}
elsif ($word =~ /.+$syllable.+/) {
if ($med eq "NONE") {
$med = $word;
$count++;
}
}
elsif ($word =~ /.+$syllable$/) {
if ($fin eq "NONE") {
$fin = $word;
$count++;
}
}
elsif ($word =~ /^$syllable$/) {
if ($stdalone eq "NONE") {
$stdalone = $word;
$count++;
}
}
last if $count == 4;
}
print "$syllable\nInitial $init\nMedial $med\nFinal $fin\nStandalone $stdalone\n";
#print "$init\t$med\t$fin\t$stdalone\n";
}
However, I would still appreciate if as I had requested earlier two changes could be incorporated.
Since the data contains the Perso-Arabic script and its IPA delimited by an equal to sign, the present code does not correctly identify the intial syllables. This may be because of the delimiter and the IPA string that follows.
If the output could contain frequency, that would also be a great help and if the number of sample occurences could be increased to at least 4 or 5.
Sorry once more for the lapse of memory and many thanks for your comprehension.