CREATING A SYLLABLE CONCORDANCE WITH POSITIONAL VARIANTS

Hello,
Some time back I had posted a request for a syllable concordance in which if a syllable was provided in a file, the program would extract a word from a file entitled "Corpus" matching that syllable. The program was
The following script was provided which did the job and for which I am really thankful:

#! /usr/bin/perl

use strict;   # These two lines save you endless trouble 
use warnings; # without them typos and such errors get missed

open (my $corpus_file, '<', 'Corpus'); # Created a test corpus with just the contained lines
$/="\r\n"; # Again with the DOS files
chomp(my @corpus = (<$corpus_file>));  # Load the corpus file into an array for faster access
open (my $syllables_file, '<', 'Syllables');
while(<$syllables_file>){
    chomp(my $syllable = $_);
    my $found = 0;
    for my $word (@corpus){
        if ( $word =~ /$syllable/){  # use a regular expression to find a match for the syllable
            print "$syllable=$word\n";
            $found = 1;
            last; #Stop processing the array of words as we have an example
        }
    }
    print "$syllable wasn't matched in the supplied corpus\n" if (! $found);
}

However I need one more refinement
I need to modify the program such that it finds the syllable in three different environents Initial medial Final Standalone(whole word)
example (theoretical: I know somebody will say "a" here is not a syllable. But I am working with Indian languages).
Syllable "a"
Intial Medial Final Standalone
ago bare gonna a
It could be that the syllable may not appear in all environments as in the case of stri
Intial Medial Final Standalone
strip Astrid NONE NONE
I have tried to factor in the environmental constraints using regexes but the results are disastrous
Please help. I have spent quite a few hours and the results get more ludicrous each time.
Many thanks and my gratitutde to the generous people on the forum who give their time and energy to helping out tyros like me.

Well, regex for white space vary: Regex Tutorial - \b Word Boundaries

I used to say \< and \> for word boundary, but the PERL guys got to the POSIX and changed it after decades, so both may be \b!

So, you need to check for

  • standalone \<a\>
  • initial \<a[a-z]
  • final [a-z]a\>
  • medial [a-z]a[a-z]

but since the [a-z] check is more expensive, you might be able to check in this order, since if not \<a\> then \<a is initial and a\> is final, and medial is none of the above.

Hello,
With a little help from colleagues, I finally managed to get the concordance going. Here is the code in case someone else would like to use it:

#! /usr/bin/perl

use strict;  # These two lines save you endless trouble
use warnings; # without them typos and such errors get missed

open (my $corpus_file, '<', 'Corpus'); # Created a test corpus with just the contained lines
# $/="\r\n"; # Again with the DOS files
chomp(my @corpus = (<$corpus_file>)); # Load the corpus file into an array for faster access
open (my $syllables_file, '<', 'Syllables');
while(<$syllables_file>){
    chomp(my $syllable = $_);
    my $count = 0;
    my $init = my $med = my $fin = my $stdalone = "NONE";
    for my $word (@corpus) {
        if ( $word =~ /^$syllable.+/) {
            if ($init eq "NONE") {
                $init = $word;
                $count++;
            }
        }
        elsif ($word =~ /.+$syllable.+/) {
            if ($med eq "NONE") {
                $med = $word;
                $count++;
            }
        }
        elsif ($word =~ /.+$syllable$/) {
            if ($fin eq "NONE") {
                $fin = $word;
                $count++;
            }
        }
        elsif ($word =~ /^$syllable$/) {
            if ($stdalone eq "NONE") {
                $stdalone = $word;
                $count++;
            }
        }
        last if $count == 4;
    }
    print "$syllable\nInitial $init\nMedial $med\nFinal $fin\nStandalone $stdalone\n";
    #print "$init\t$med\t$fin\t$stdalone\n";
}

Many thanks for the information re. Regex.

A logic tree and removing redundant tests save time. If it has a prefix char, it is medial or final else it is initial or standalone, and for prefix'ed, if not medial it is always final, no test needed.

1 Like