Writing a clustering concordance for a Perso-Arabic script

gimley · August 7, 2015, 4:27am

I am working on a database of a language using Arabic Script. One of the major issues is that the shape of the characters changes according to their initial, medial or final positioning. Another major issue is that of the clustering of vowels within the word: the clustering changes totally the pronunciation.
What I am looking for is a concordance of such clusters read from a file and their display in initial medial or final position with a couple of examples read from the database.
Two files will be provided:

A look-up file called clusters and a database termed dictionary

An example will make this clear: (I will use English to make this understandable)
The cluster file will be repertoire of just single characters or two or more letter characters as in the example below

Clusters
a
oi
oa 
ai
ea
ui

The dictionary will comprise of the word followed by its mapping delimited by an equal to sign as in the example below. The mappings are pseudo since in the real dictionary these will be in the International phonetic alphabet.

Dictionary
act=akt
ball=ball
beta=bita
coat=kot
load=lod
approach=eproch
goal=gol
rain=ren
paint=pent
rail=rel
failure=felyer
sea=si
beans=bins
easy=izi
please=pliz
beach=bich
leather=lethar
already=alredi
early=erli
break=brek
bread=bred
juice=jus
fruit=frut
suit=sut

The expected output would be as under.

keyword from cluster
position Initial Medial or Final [In case no example is found just a dash]
Frequency of occurence
Two or three examples of the word from the database

Only one example is given below

a
Init	3	act=akt,approach=eproch,already=alredi
Mid	1	ball
Fin	1	beta=bita

There is one condition. Only the largest string from the clusters file will be considered. If the character is already found in the large cluster it will be ignored. Thus

a in final position also occurs in sea but is ignored because the cluster ea is already there.

Similarly a in medial position has only one example, since it occurs elsewhere in different combinations.

Since I work under Windows a Perl or Awk script could help. I do write scripts in Perl and Awk, but this is beyond my skill-set.
Any help would be greatly appreciated, since the final output will help create standards for that particular linguistic community and this work will be put up free for use.

---------- Post updated 08-07-15 at 03:27 AM ---------- Previous update was 08-06-15 at 08:35 PM ----------

My sincere apologies to all who took pains to read the request. I guess my memory isn't what it used to be (I am nearly 70 years old). Still, I should have checked on the forum before posting, which I did not. I will be more careful next time.
I found that I had already written a similar code in Perl and which was bettered by folks on the forum. Here is the code which was put up:

#! /usr/bin/perl

use strict;  # These two lines save you endless trouble
use warnings; # without them typos and such errors get missed

open (my $corpus_file, '<', 'Corpus'); # Created a test corpus with just the contained lines
# $/="\r\n"; # Again with the DOS files
chomp(my @corpus = (<$corpus_file>)); # Load the corpus file into an array for faster access
open (my $syllables_file, '<', 'Syllables');
while(<$syllables_file>){
    chomp(my $syllable = $_);
    my $count = 0;
    my $init = my $med = my $fin = my $stdalone = "NONE";
    for my $word (@corpus) {
        if ( $word =~ /^$syllable.+/) {
            if ($init eq "NONE") {
                $init = $word;
                $count++;
            }
        }
        elsif ($word =~ /.+$syllable.+/) {
            if ($med eq "NONE") {
                $med = $word;
                $count++;
            }
        }
        elsif ($word =~ /.+$syllable$/) {
            if ($fin eq "NONE") {
                $fin = $word;
                $count++;
            }
        }
        elsif ($word =~ /^$syllable$/) {
            if ($stdalone eq "NONE") {
                $stdalone = $word;
                $count++;
            }
        }
        last if $count == 4;
    }
    print "$syllable\nInitial $init\nMedial $med\nFinal $fin\nStandalone $stdalone\n";
    #print "$init\t$med\t$fin\t$stdalone\n";
}

However, I would still appreciate if as I had requested earlier two changes could be incorporated.
Since the data contains the Perso-Arabic script and its IPA delimited by an equal to sign, the present code does not correctly identify the intial syllables. This may be because of the delimiter and the IPA string that follows.
If the output could contain frequency, that would also be a great help and if the number of sample occurences could be increased to at least 4 or 5.
Sorry once more for the lapse of memory and many thanks for your comprehension.

Don_Cragun · August 9, 2015, 12:16am

Are the words Clusters and Dictionary in your sample input files intended to be the names of those files, or are they headers that actually appear as the first line in those files? When you first mentioned those files, you said they had the names clusters and dictionary (with the 1st character being a lowercase letter). In the code in your update to your post, they had the names Syllables and Corpus , respectively. So what are your actual filenames (note that case does matter in filenames on standards-conforming UNIX and Linux filesystems).

Does the order in which the clusters appear in the output matter? If it does matter, will the lines in the cluster input file always be in an order such that all lines with the same number of characters are adjacent and the lines with fewer characters come before lines with more characters? (The code I'm playing with now will produce output with the shortest clusters first and, within each cluster length, will output the clusters in the order in which they appear in the input file. Extra code will be needed if that is not acceptable.)

In your sample output:

a
Init	3	act=akt,approach=eproch,already=alredi
Mid	1	ball
Fin	1	beta=bita

why is the format of the 3rd line different than the format of the 2nd and 4th lines? Why aren't the 2nd and 4th lines:

a
Init	3	act,approach,already
Mid	1	ball
Fin	1	beta

or, why isn't the 3rd line:

a
Init	3	act=akt,approach=eproch,already=alredi
Mid	1	ball=ball
Fin	1	beta=bita

so all of the output is in the same format?

gimley · August 9, 2015, 1:20am

Many thanks for responding.
I understand your queries and in fact I would like to clarify the details so that the Script is more comprehensible
DETAILS
The script invokes 2 files:

Syllables: A list of all the syllables.
Corpus: A list of words in Arabic script followed by their Indic equivalent, delimited by

EXPECTED FORMAT
In each case the output is supposed to spew out
a. The syllable in question whether it is Initial Medial or Final.
b. At least 6 to 10 examples (at present only one is spewed out)
c. Additional Bells and whistles: A frequency count of all the words [not present in my script: I don't know how to tailor two sets of counts]
In other words the output should be as under:

SYLLABLE: FREQUENCY 
Initial 6 EXAMPLES 
Medial 6 EXAMPLES 
Final 6 EXAMPLES 
Standalone 6 EXAMPLES

The example should have the String in Arabic and also in Indic script.
If there are none or less, then it should specify the same. At present only one example is spewed out
It does work to a certain extent but the following major problems are there
PROBLEMS
1.The script should address only the Perso-Arabic side using the

delimiter and ignore the Indic side. It does not do that as a result of which all final occurrences are not shown. This is because of the delimiter and therefore valid final occurences in Arabic are not detected. I don't know how to instruct the program to delimit analysis only to the Arabic side of the corpus and ignore the rest
2. I need at least 6-10 instances of tokens from the corpus file. At present only one is given
3. If possible the frequency.should be provided: [ I don't know how to tailor two sets of counts]
I have racked my brains over this and all attempts to get this type of output have failed.
To make the scenario more clear I am attaching the data files as well as the script file.
I have tried again and again to modify the script but the desired formatted output is not spewed out.
One is never too old to learn and I still feel at my age I can master the intricacies of Perl and handle strings.
Many thanks for your help

Don_Cragun · August 9, 2015, 2:19am

Thanks for the information. You didn't answer all of the questions, and I now have a new question. What is "Standalone"? Does it mean that the syllable appeared as an entire word? Am I correct in assuming that with a in Syllables and with a=a in Corpus , the standalone count (and only that count) should be incremented by 1 and with the word abracadabra=whatever in Corpus , the initial and final counts should each be incremented by 1, the medial counter should be incremented by 3, and the standalone counter should not change?

I am much more fluent with awk than perl . Will you accept an awk script instead of a perl script?

You didn't answer the question about output order. I assume the output order doesn't matter.

You didn't answer the question about headings in your input files. I assume that there are no headings in either of your input files.

gimley · August 9, 2015, 5:48am

Many thanks for your interest.
A Standalone means that the particular syllable is also a word. To take an example from English

of

is a standalone syllable in the word

of

, but an initial in the word

office

Coming to your second question what I needed and tried to do was identify syllables in the corpus in terms of their positions
An Initial syllable would be a string from the Syllables list which comes in the beginning of the word.
A Medial would be a string from the Syllables list which comes in the middle of the word.
A Final would be a string from the Syllables list which comes at the end of the word.
In all cases I would be working only with the Arabic data and ignore all data to the right hand side of the delimiter

The Script would identify each syllable as per its position and identify the frequency and then provide for each at least six examples [in some cases there would be None or less than 6].
I trust the above clarifies a bit more the issue.
I do not mind an Awk script. In fact I started with AWK but felt the problem at hand was too complex for Awk to handle. I seem to be mistaken.
Many thanks once again for your help which eventually will help the community to develop a standard.

RudiC · August 9, 2015, 8:39am

Here an awk essay that addresses some but not all of your conditions/problems:

awk '
FNR==NR         {SYL[FNR]=$1
                 MX=FNR
                 next
                }
                {for (c=1; c<=MX; c++)
                        {s=SYL[c]
                         if ($1==s)     {STDAL++
                                         next
                                        }
                         else if ($1 ~ "^"s)
                                        {if (++INIT <= FREQMX)
                                                EXI=EXI "," $0
                                         next
                                        }
                         else if ($1 ~ s"$")
                                        {if (++FIN <= FREQMX)
                                                EXF=EXF "," $0
                                         next
                                        }
                         else if ($1 ~ s)
                                        {MID+=gsub(s,s)
                                         if (MID <= FREQMX)
                                                EXM=EXM "," $0
                                         next
                                        }

                        }
                }
END             {for (c=1; c<=MX; c++)
                                {s=SYL[c]
                                 print s
                                 printf "Init:  %3d\tExample: %s\n", INIT, substr (EXI,2)
                                 printf "Mid:   %3d\tExample: %s\n", MID,  substr (EXM,2)
                                 printf "Fin:   %3d\tExample: %s\n", FIN,  substr (EXF,2)
                                 printf "Alone: %3d", STDAL
                                        if (STDAL > 0) printf "\tExample: %s", s
                                        printf "\n"
                                }
                }

' syllables FS="=" FREQMX=6 corpus
oi
Init:    0    Example: 
Mid:     0    Example: 
Fin:     0    Example: 
Alone:   0
oa
Init:    0    Example: 
Mid:     4    Example: coat=kot,load=lod,approach=eproch,goal=gol
Fin:     0    Example: 
Alone:   0
ai
Init:    0    Example: 
Mid:     4    Example: rain=ren,paint=pent,rail=rel,failure=felyer
Fin:     0    Example: 
Alone:   0
ea
Init:    2    Example: easy=izi,early=erli
Mid:     9    Example: beans=bins,please=pliz,beach=bich,leather=lethar,already=alredi,break=brek
Fin:     1    Example: sea=si
Alone:   0
ui
Init:    0    Example: 
Mid:     3    Example: juice=jus,fruit=frut,suit=sut
Fin:     0    Example: 
Alone:   0
a
Init:    1    Example: act=akt
Mid:     2    Example: ball=ball
Fin:     1    Example: beta=bita
Alone:   1    Example: a

It is tested on an extended version of your samples in post#1; the condition that "Only the largest string from the clusters file will be considered." is covered by having the larger clusters in front of the smaller ones, i.e. "a" is analysed after "ea" and "oa" etc. Unfortunately, some hits on "a" (already, approach) are lost as they are already counted in those clusters ("ea", "oa"). However, if you think this a promising approach, one could try to refine...

---------- Post updated at 14:30 ---------- Previous update was at 14:17 ----------

OK, this one

awk '
FNR==NR         {SYL[FNR]=$1
                 MX=FNR
                 next
                }
                {TOTLINE=$0
                 for (c=1; c<=MX; c++)
                        {s=SYL[c]
                         if ($1==s)     {STDAL++
                                         next
                                        }
                         else if (gsub ("^"s, "@", $1))
                                        {if (++INIT <= FREQMX)
                                                EXI=EXI "," TOTLINE
                                        }
                         else if (gsub (s"$", "@", $1))
                                        {if (++FIN <= FREQMX)
                                                EXF=EXF "," TOTLINE
                                        }
                         else if (n=gsub (s, "@", $1))
                                        {MID+=n
                                         if (MID <= FREQMX)
                                                EXM=EXM "," TOTLINE
                                        }

                        }
                }
END             {for (c=1; c<=MX; c++)
                                {s=SYL[c]
                                 print s
                                 printf "Init:  %3d\tExample: %s\n", INIT, substr (EXI,2)
                                 printf "Mid:   %3d\tExample: %s\n", MID,  substr (EXM,2)
                                 printf "Fin:   %3d\tExample: %s\n", FIN,  substr (EXF,2)
                                 printf "Alone: %3d", STDAL
                                        if (STDAL > 0) printf "\tExample: %s", s
                                        printf "\n"
                                }
                }

' syllables~ FS="=" FREQMX=8 corpus
oi
Init:    0    Example: 
Mid:     0    Example: 
Fin:     0    Example: 
Alone:   0
oa
Init:    0    Example: 
Mid:     4    Example: coat=kot,load=lod,approach=eproch,goal=gol
Fin:     0    Example: 
Alone:   0
ai
Init:    0    Example: 
Mid:     4    Example: rain=ren,paint=pent,rail=rel,failure=felyer
Fin:     0    Example: 
Alone:   0
ea
Init:    2    Example: easy=izi,early=erli
Mid:     9    Example: beans=bins,please=pliz,beach=bich,leather=lethar,already=alredi,break=brek,bread=bred,heading=heding
Fin:     1    Example: sea=si
Alone:   0
ui
Init:    0    Example: 
Mid:     3    Example: juice=jus,fruit=frut,suit=sut
Fin:     0    Example: 
Alone:   0
a
Init:    3    Example: act=akt,approach=eproch,already=alredi
Mid:     1    Example: ball=ball
Fin:     1    Example: beta=bita
Alone:   1    Example: a

should cover the above mentioned problem. Please report back!

---------- Post updated at 14:39 ---------- Previous update was at 14:30 ----------

And this one

awk '
FNR==NR         {SYL[FNR]=$1
                 MX=FNR
                 next  
                }
                {TOTLINE=$0
                 for (c=1; c<=MX; c++)
                        {s=SYL[c]
                         if ($1==s)     {STDAL++
                                         next
                                        }
                         if (gsub ("^"s, "@", $1))
                                        {if (++INIT <= FREQMX)
                                                EXI=EXI "," TOTLINE
                                        }
                         if (gsub (s"$", "@", $1))
                                        {if (++FIN <= FREQMX)
                                                EXF=EXF "," TOTLINE
                                        }
                         if (n=gsub (s, "@", $1))
                                        {MID+=n
                                         if (MID <= FREQMX)
                                                EXM=EXM "," TOTLINE
                                        }
                                         
                        }
                }
END             {for (c=1; c<=MX; c++)
                                {s=SYL[c]
                                 print s 
                                 printf "Init:  %3d\tExample: %s\n", INIT, substr (EXI,2)
                                 printf "Mid:   %3d\tExample: %s\n", MID,  substr (EXM,2)
                                 printf "Fin:   %3d\tExample: %s\n", FIN,  substr (EXF,2)
                                 printf "Alone: %3d", STDAL
                                        if (STDAL > 0) printf "\tExample: %s", s
                                        printf "\n"
                                }
                }

' syllables~ FS="=" FREQMX=10 corpus
a
Init:    4    Example: act=akt,approach=eproch,alabama=asdfjg,already=alredi
Mid:     3    Example: ball=ball,alabama=asdfjg
Fin:     2    Example: beta=bita,alabama=asdfjg
Alone:   1    Example: a

would cover even the case of "alabama".

Don_Cragun · August 9, 2015, 9:28am

The following seems to address all of your issues (although I had to guess on the format of some things):

#!/bin/ksh
sf="syllables"
cf="corpus"
sample_max=${3:-10}
awk -v sm="$sample_max" '
{	gsub(/\r/, "")
}
FNR == NR {
	# Read the list of syllables to be processed...
	len = length($1)
	if(NR == 1) 
		l = L = len
	else	if(l > len)
			l = len
		else if(L < len)
			L = len
	syl[len, ++lenc[len]] = $1
	next
}
{	# Accumulate data from dictionary entries...
	len = length(word = $1)
	for(i = (L > len) ? len : L; i > 0 && len > 0; i--)
		for(j = 1; j <= lenc; j++) {
			s = syl[i, j]
			if(s == word) {
				# Process standalone match...
				++sa
				len = 0
				if(sasc < sm)
					sasam[s, ++sasc] = $0
			} else {if(sub("^"s, "\a", word)) {
					# Process initial match...
					++init
					len--
					if(insc < sm)
						insam[s, ++insc] = $0
				}
				if(sub(s"$", "\a", word)) {
					# Process finish match...
					++fin
					len--
					if(fisc < sm)
						fisam[s, ++fisc] = $0
				}
				if(c = gsub(s, "\a", word)) {
					# Process medial matches...
					med += c
					len -= c
					if(mesc < sm)
						mesam[s, ++mesc] = $0
				}
			}
		}
}
END {	# Dump collected data...
	for(i = l; i <= L; i++) {
		for(j = 1; j <= lenc; j++) {
			printf("%-11s %7s\n", (s = syl[i, j]) ":",
				sa + init + fin + med)
			printf("%-11s %7s%s", "Initial",
				init ? init: "NONE",
				init ? "\t" : "\n")
			for(k = 1; k <= insc; k++)
				printf("%s%s", insam[s, k],
					k == insc ? "\n" : ",")
			printf("%-11s %7s%s", "Medial",
				med ? med: "NONE",
				med ? "\t" : "\n")
			for(k = 1; k <= mesc; k++)
				printf("%s%s", mesam[s, k],
					k == mesc ? "\n" : ",")
			printf("%-11s %7s%s", "Final",
				fin ? fin: "NONE",
				fin ? "\t" : "\n")
			for(k = 1; k <= fisc; k++)
				printf("%s%s", fisam[s, k],
					k == fisc ? "\n" : ",")
			printf("%-11s %7s%s", "Standalone",
				sa ? sa: "NONE",
				sa ? "\t" : "\n")
			for(k = 1; k <= sasc; k++)
				printf("%s%s", sasam[s, k],
					k == sasc ? "\n" : ",")
			print ""
		}
	}
}' "$sf" FS="[=]" "$cf"

Note that this uses the default FS when processing syllables so it ignores extraneous spaces (such as the space after ca ) in your sample syllables file and uses the = as the field separator for the corpus file.

With the file syllables containing:

a
ue
ueue
oi
oa 
ai
ea
ui

(which has additional lines shown in red). And, with the corpus file containing:

act=akt
ball=ball
beta=bita
coat=kot
load=lod
approach=eproch
goal=gol
rain=ren
paint=pent
rail=rel
failure=felyer
sea=si
beans=bins
easy=izi
please=pliz
beach=bich
leather=lethar
already=alredi
early=erli
break=brek
bread=bred
juice=jus
fruit=frut
suit=sut
queue=ku
query=kwiri
a=a1
a=a2
a=a3
a=a4
a=a5
a=a6
a=a7
a=a8
a=a9
a=a10
a=a11
abracadabra=abracadabra

(again with additional lines shown in red), and with the code above stored in a file name conc that has been made executable, you can see the results from running that code below:

$ ./conc
a:               21
Initial           4	act=akt,approach=eproch,already=alredi,abracadabra=abracadabra
Medial            4	ball=ball,abracadabra=abracadabra
Final             2	beta=bita,abracadabra=abracadabra
Standalone       11	a=a1,a=a2,a=a3,a=a4,a=a5,a=a6,a=a7,a=a8,a=a9,a=a10

ue:               1
Initial        NONE
Medial            1	query=kwiri
Final          NONE
Standalone     NONE

oi:               0
Initial        NONE
Medial         NONE
Final          NONE
Standalone     NONE

oa:               4
Initial        NONE
Medial            4	coat=kot,load=lod,approach=eproch,goal=gol
Final          NONE
Standalone     NONE

ai:               4
Initial        NONE
Medial            4	rain=ren,paint=pent,rail=rel,failure=felyer
Final          NONE
Standalone     NONE

ea:              10
Initial           2	easy=izi,early=erli
Medial            7	beans=bins,please=pliz,beach=bich,leather=lethar,already=alredi,break=brek,bread=bred
Final             1	sea=si
Standalone     NONE

ui:               3
Initial        NONE
Medial            3	juice=jus,fruit=frut,suit=sut
Final          NONE
Standalone     NONE

ueue:             1
Initial        NONE
Medial         NONE
Final             1	queue=ku
Standalone     NONE

$

which I think is close (if not exactly) what you want.

Unfortunately, it doesn't even come close to working with the data you supplied in the zip file you uploaded. The code has been set up to remove the DOS format <carriage-return> characters in both of your input files, but that can't make up for the fact that your uploaded corpus file contains LOTS of lines with no equal sign characters and LOTS of lines that have a 1st character that is an equal sign. Both files also contain lots of byte sequences that do not form valid UTF-8 characters. So, with the two files you uploaded, it produces the following output:

:             555
Initial         548	=,=,=,=,=,=,=,=,=,=
Medial            6	=,=,=,=,=,=
Final          NONE
Standalone        1	=

:              1
Initial           1	=
Medial         NONE
Final          NONE
Standalone     NONE

:          1367
Initial        1352	=,=,=,=,=,=,=,=,=,=
Medial           14	=,=,=,=,=,=,=,=,=,=
Final             1	=
Standalone     NONE

:           942
Initial         426	=,=,=,=,=,=,=,=,=,=
Medial          449	=,=,=,=,=,=,=,=,=,=
Final            67	=,=,=,=,=,=,=,=,=,=
Standalone     NONE

:          1086
Initial         318	=,=,=,=,=,=,=,=,=,=
Medial          737	=,=,=,=,=,=,=,=,=,=
Final            31	=,=,=,=,=*,=,=,=,=,=
Standalone     NONE

:          7941
Initial        NONE
Medial         4998	=,=,=,=,=,=,=,=,=,=
Final          2943	=,=,=,=,=,=,=,=,=,=
Standalone     NONE

which to me seems to be garbage. I don't know enough about Arabic or Indic to make any guess at whether or not your input files could be cleaned up programmatically. If they can be, I don't have the expertise to do it unless you can provide explicit directions on how to do it.

gimley · August 9, 2015, 9:44am

Many thanks for the awk script. It seems to be running. Initially I used a very large syllable list in which some syllables were part of larger syllables and hence did not give any output, since once a larger syllable was admitted a subset of that syllable would automatically be excluded. I assume this is why the large syllable list did not yield results. I also sorted the syllable list on length with the largest first and this did improve the output.
Thanks a lot
p.s. I just got a mail from Mr Don Cragun who has also proposed an awk solution saying that my corpus had flaws. I corrected the same and the output is as desired. Many thanks for your help

---------- Post updated at 08:44 AM ---------- Previous update was at 08:39 AM ----------

Thanks a lot. I am so sorry that the corpus data was faulty. I guess I should have checked it out. I removed all garbage from the data and the script ran just fine. I had a syllable list of over 300 syllables and a corpus of around 37000 and the script sent by The data was sent by the community and since it was entered by hand it had flaws. The responsibility is entirely mine.
Your solution as wella s Rudic's worked just great. Many thanks once more to the forum for their generous help.

Don_Cragun · August 9, 2015, 2:48pm

You might also want to try the following:

#!/bin/ksh
sf="syllables"
cf="corpus"
sample_max=10
awk -v sm="$sample_max" '
{	gsub(/\r/, "")
}
FNR == NR {
	# Read the list of syllables to be processed...
	len = length($1)
	if(NR == 1) 
		l = L = len
	else	if(l > len)
			l = len
		else if(L < len)
			L = len
	syl[len, ++lenc[len]] = $1
	next
}
{	# Accumulate data from dictionary entries...
	len = length(word = $1)
	for(i = (L > len) ? len : L; i >= l && len >= l; i--)
		for(j = 1; j <= lenc; j++) {
			# If syllables we have matched leave fewer unmached
			# character in word than we are currently trying to
			# match, short circuit to a shorter syllable length...
			if(len < i)
				break
			s = syl[i, j]
			if(s == word) {
				# Process standalone match...
				++sa
				len = 0
				if(sasc < sm)
					sasam = (sasc++ ? \
					    sasam "," : "\t") $0
			} else {if(sub("^"s, "\a", word)) {
					# Process initial match...
					++init
					len -= i
					if(insc < sm)
						insam = (insc++ ? \
						    insam "," : "\t") $0
				}
				if(sub(s"$", "\a", word)) {
					# Process finish match...
					++fin
					len -= i
					if(fisc < sm)
						fisam = (fisc++ ? \
						    fisam "," : "\t") $0
				}
				if(c = gsub(s, "\a", word)) {
					# Process medial matches...
					med += c
					len -= c * i
					if(mesc < sm)
						mesam = (mesc++ ? \
						    mesam "," : "\t") $0
				}
			}
		}
}
END {	# Dump collected data...
	for(i = l; i <= L; i++) {
		for(j = 1; j <= lenc; j++) {
			printf("%-11s %7s\n", (s = syl[i, j]) ":",
			    sa + init + fin + med)
			printf("%-11s %7s%s\n", "Initial",
			    init ? init : "NONE", insam)
			printf("%-11s %7s%s\n", "Medial",
			    med ? med : "NONE", mesam)
			printf("%-11s %7s%s\n", "Final",
			    fin ? fin : "NONE", fisam)
			printf("%-11s %7s%s\n\n", "Standalone",
			    sa ? sa : "NONE", sasam)
		}
	}
}' "$sf" FS="[=]" "$cf"

It produces exactly the same output as my earlier suggestion, but incorporates improvements from RudiC's suggestion and finishes performance enhancements that were incomplete in my earlier post. If you have a lot of relatively short words and some relatively long syllables, words that match relatively long syllables (leaving a small number of unmatched characters), and words that contain a few medium syllables that have been matched (leaving a small number of unmatched characters); this script will run faster. (Further improvements could be made by keeping track of the longest sequence of unmatched characters instead of just keeping track of the number of unmatched characters, but I'll leave that as an exercise for the reader.)

Cheers,
Don

gimley · August 9, 2015, 9:35pm

Sorry for the late response: it was night here when you posted this solution. I tested this script and it does give better results. Many thanks for all your help.