Modifying an awk script for syllable splitting

I have found this syllable splitter in awk. The code is given below. Basically the script cuts words and names into syllables. However it fails when the word contains 2 consonants which constitute a single syllable. An example is given below

ashford
raphael

The output is as under:

ashford	as-hford	2	 VC-CCVrC
raphael	rap-ha-el	3	 rVC-CV-VC

instead of

ashford	ash-ford	2	 VCC-CVrC
raphael	ra-pha-el	3	 rVC-CV-VC

How do I modify the code to allow sh or ph to be treated as a single syllable.
I contacted the authors who have not reponded since the code is old and maybe they do not see any merit in changing the code.
A single example of modification either for ph or sh would help. I can then modify the code for all other such combinations.
Out of respect for the authors I have removed their names from the script.
Many thanks
Awk script follows

# This script reads a tab-separated file and syllabifies the columns pointed to by the variable'phons' (ot the first column, by default).
# gawk -f syll.gk fn>fn.out

BEGIN {
  FS="\t"; 
  OFS="\t";
  
  if (code=="brulex") {
    V="[aiouy�����^eE�AO_]"; # vowels
    C="[ptkbdgfs/vzjmn/shN�]"; # consonants except liquids & semivowels
    C1="[pkbgfs/vzj]";
    L="[lR]"; # liquids 
    Y="[��\377]"; # semi-vowels \377 stands for y-umlaut
    X="[ptkbdgfs/vzjmnN�xlR��\377]"; # all consonants 
  } else { # code == LAIPTTS)
    V="[iYeE2591a@oO�uy*]";   # Vowels
    C="[pbmfvtdnNkgszxSZGh/sh]";  # Consonants except liquids & semivowels
    C1="[pkbgfsSvzZ]";
    L="[lR]"; # liquids
    Y="[j8w]"; # semi-vowels
    X="[pbmfvtdnNkgszSZGlRrhxGj8w]";   # all consonants, including semivowels
  }
  if (phons==0) phons=1;
}

{
 a=$phons;
 n=1
}

{
   while (i= match (a, V V)) {
    a=substr(a,1,i) "-" substr(a,i+1,length(a)); n++; }

  while (i= match(a, V X V)) { 
    a=substr(a,1,i) "-" substr(a,i+1,length(a)); n++}

  while (i=match(a, V Y Y V)) {
    a=substr(a,1,i+1) "-" substr(a,i+2, length(a)); n++} 

  while (i=match(a, V C Y V)) {
    a=substr(a,1,i) "-" substr(a,i+1, length(a)); n++} 

  while (i=match(a, V L Y V)) {
    a=substr(a,1,i) "-" substr(a,i+1, length(a)); n++}

  while (i=match(a, V "[td]R" V)) {
    a=substr(a,1,i) "-" substr(a,i+1, length(a)); n++} 

  while (i=match(a, V "[td]R" Y V)) {
    a=substr(a,1,i) "-" substr(a,i+1, length(a)); n++} 

  while (i=match(a, V C1 L V)) {
    a=substr(a,1,i) "-" substr (a,i+1,length(a)); n++}

  while (i=match(a, V X X V)) {
    a=substr(a,1,i+1) "-" substr(a,i+2, length(a)); n++}

  while (i= match(a, V X X X V)) {
    a=substr(a,1,i+1) "-" substr(a,i+2,length(a)); n++}

  while (i=match(a, V X X X X V)) {
    a=substr(a,1,i+1) "-" substr(a,i+2,length(a)); n++}

  while (i=match(a, V X X X X X V)) {
    a=substr(a,1,i+1) "-" substr(a,i+2,length(a)); n++}

# suppress the final schwa (^) in some multisyllabic words 
# notr^ -> notR
# ar-bR^   =>  aRbR
  b=gensub(/-([^-]+)\^$/,"\\1",1,a) ;  
  if (b!=a) { # there is a schwa to delete
    a=b; 
    $phons=substr($phons,1,length($phons)-1);
    n--;
      }
# meme chose quand schwa='*'
  b=gensub(/-([^-]+)\*$/,"\\1",1,a) ;  
  if (b!=a) { # there is a schwa to delete
    a=b; 
    $phons=substr($phons,1,length($phons)-1);
    n--;
      }


# compute the CVY skeleton
  sk= " ";
  for (i=1;i<=length(a);i++) {
    ph=substr(a,i,1);
    if (ph~V) sk=sk"V";
    else if ((ph~C)||(ph~L)) sk=sk"C";
    else if (ph~Y) sk=sk"Y";
    else sk=sk ph;
  }
}

{ print $0,a,n,sk }

Well, THAT is some piece o' code! While just getting a remote idea of how it works and what it does, and not pretending this will be a generally correct solution, adding

  gsub (/sh/, "&-",a)
  gsub (/ph/, "-&",a)

just above the first while (i = match... will result in

ashford    ash-ford    1     VCC-CVrC
raphael    ra-pha-el    2     rV-CCV-VC

---------- Post updated at 15:53 ---------- Previous update was at 15:36 ----------

And this will correct for the syllable count:

  n+=gsub (/sh/, "&-",a)
  n+=gsub (/ph/, "-&",a)

resulting in

ashford    ash-ford    2     VCC-CVrC
raphael    ra-pha-el    3     rV-CCV-VC

---------- Post updated at 15:57 ---------- Previous update was at 15:53 ----------

Howsoever, with the overall algorithm,YMMV:

reel    re-el    2     rV-VC
real    re-al    2     rV-VC
cooperation    co-o-pe-ra-ti-on    6     cV-V-CV-rV-CV-VC
Liverpool    Li-ver-po-ol    4     LV-CVr-CV-VC

---------- Post updated at 16:07 ---------- Previous update was at 15:57 ----------

An, not sure if you now like the way it hyphenates shepherd:

shepherd    sh-e-pherd    3     CC-V-CCVrC
1 Like

Thanks for the help. I agree

Shepherd    sh-e-pherd 

gets tagged incorrectly
But at least the pointers you gave allow for a better split.

Hi.

  • This is a file of consonant combinations that I occasionally use:
*       This file contains legitimate letter combinations.
*       I should probably add vowel combinations, ie, ey, etc.
*
*       First section, beginning of word.
pt ps
-
*       Second section, beginning and middle.
bl br
ch cl cr
dr
fl fr
gl gn gr
kl kr
pl pr pt
qu
sc sh sl sm sn sp sr st str sw
th tr
wh
-
*       Third section, middle and end word 
bj bs
ct
dg ds
ft
gh
ks
lch lk ls lv
mp ms
nd ng ns nt
ps
rch rk rg rs rt
tch ts
-
*       Fourth section, end of word.
dst dth ght nth rst
-
*       Fifth section, doubled letters.
bb cc dd gg ll mm nn pp ss tt

I use these to compose English-like words, but they may be useful for splitting as well. There might be others that could be added, e.g. "ff".

See also results of search, like: split by syllable, such as: Syllable Rules: Divide Into Syllables

Best wishes ... cheers, drl

1 Like

Many thanks for the useful pointers. I am trying to divide words in Indian languages which are romanised into English. These follow slightly different rules. But some of the rules you have provided apply to the transliterations also. The rules you have provided have given me a better insight into how the splitter should work.