Splitting Concatenated Words With Largest Strings First

gimley · April 5, 2011, 9:09pm

hello,
I had posted earlier help for a script for splitting concatenated words . The script was supposed to read words from a master file and split concatenated words in the slave/input file.
Thanks to the help I got, the following script which works very well was posted. It detects residues by placing a ! before the residual element.
However the script does not take the largest string for splitting which leads to problems.
An example will help:
given that the master file has

narayan 
narayana 
prakash
aprak
ash

In the case of narayanaprakash, I get:

narayan, aprak and ash

instead of

narayana prakash.

How do I get the script to produce the second instead of the first?

Many thanks for all the earlier help and hope this problem of largest string first can be resolved:

#Util to split names which are conjoined
NR==FNR{a[$1]; next}
function lsr(c,p) {
    for(p=length(c);p;p--)
        if(tolower(substr(c,1,p)) in a) break;
    if (p) return substr(c,1,p);
    return "";
}
{while(length) {
    s=lsr($0);
    if (!s) printf "!";
    while (!s && length) {
        printf substr($0,1,1);
        $0=substr($0,2);
        s=lsr($0);
        if (s) printf "! ";
    }
    printf "%s ", s;
    $0=substr($0,length(s)+1)
}
printf "\n"; }

alister · April 5, 2011, 9:18pm

It's a lot easier to read code and sample data if they're surrounded by code tags.

gimley · April 5, 2011, 10:38pm

Sorry am a newbie. Will try and post the tags next time.

Chubler_XL · April 6, 2011, 1:02am

Seems to be working fine for me:

$ cat lookup
narayan 
narayana 
prakash
aprak
ash
 
$ cat raw
narayanaprakash
 
$ awk 'NR==FNR{a[$1]; next}
>  function lsr(c,p) {
>     for(p=length(c);p;p--)
>            if(tolower(substr(c,1,p)) in a) break;
>         if (p) return substr(c,1,p);
>         return "";
>  }
>  {while(length) {
>      s=lsr($0);
>          if(!s) printf "!";
>          while (!s && length) {
>            printf substr($0,1,1);
>            $0=substr($0,2);
>        s=lsr($0);
>            if (s) printf "! ";
>          }
>          printf "%s ", s;
>          $0=substr($0,length(s)+1)
>   }
>   printf "\n" }' lookup raw
narayana prakash

gimley · April 6, 2011, 11:24pm

Hello,
It seems to be working fine. Can I test it thoroughly out and get back to you.
Many thanks.

---------- Post updated at 10:24 PM ---------- Previous update was at 11:09 AM ----------

Hello,
I tested the script and it works fine. I know I asked for the largest string extraction but this proves troublesome at times since the valid words within the master dictionary are eaten up and treated as residue. An example will make this clear. Imagine that the master has the following:
hassam
alikhan
ali
khan
khana

and the slave is made up of a single word to be split: hassamalikhan
Technically with the present script I get hassam alikhan !a
What I need (after checking carefully the data) is an output such as: hassam ali khana
I am sorry the goof up was mine. Instead of asking for a large string search, I should have asked for a valid string search.
Help in the script to get valid strings and split accordingly would be most appreciated.
Many thanks.

Chubler_XL · April 7, 2011, 1:12am

lol, I remember mentioning this exact problem in this post nearly 2 months ago.

The best way to tackle this is to try all possible substrings and select solution with smallest number of residual characters. Let me try and throw something together.

---------- Post updated at 02:36 PM ---------- Previous update was at 01:52 PM ----------

OK have a solution but it's much slower as it has to try all possible combinations:

awk 'NR==FNR{a[$1]; next}
function clean(s) {
   split(cs(s),S,SUBSEP);
   return S[1]
}
function cs(s,i,p,r,b,bs,t) {
 b=9999;
 for(i=length(s);b && i;i--) {
   r=0;
   p=tolower(substr(s,1,i));
   if(!(p in a)) r=i;
   t=cs(substr(s,i+1));
   split(t,V,SUBSEP);
   if(r+V[2]<b) { b=r+V[2]; bs=substr(s,1,i)" "V[1] SUBSEP b }
  }
  return bs;
}
{ print clean($0) }' lookup raw

---------- Post updated at 03:12 PM ---------- Previous update was at 02:36 PM ----------

Couple of Performance improvement

No need to check strings longer than longest word
Skip if current mismatch is worse than best found so far

awk 'NR==FNR{a[$1]; m=m<length?length:m; next}
function clean(s) {
   split(cs(s),S,SUBSEP);
   return S[1]
}
function cs(s,i,p,r,b,bs,t) {
 b=9999;
 for(i=length(s)>m?m:length(s);b && i;i--) {
   r=0;
   p=tolower(substr(s,1,i));
   if(!(p in a)) r=i;
   if(r<b) {
     t=cs(substr(s,i+1));
     split(t,V,SUBSEP);
     if(r+V[2]<b) { b=r+V[2]; bs=substr(s,1,i)" "V[1] SUBSEP b }
   }
  }
  return bs;
}
{ print clean($0) }'

gimley · April 7, 2011, 12:37pm

Many thanks.
Many thanks. It works pretty well. Sorry could not respond earlier. Am in review meetings the whole day and the only free time-slot is late night (now) or early morning (when I posted the message. Will be free this weekend. But will have tested the code by then.
Many thanks once again for all your hard efforts.

---------- Post updated at 11:37 AM ---------- Previous update was at 11:17 AM ----------

Have tested it on a dictionary (master of around 108527 sorted words) and gave it an input file of around (20 words. It is slow. The first split is spewed out fast but the succeeding ones just don't come out.
I know that it has to test a lot of possible combos and hence the slow speed. Any method of speeding up the script.
Many thanks once more for all your help.

Chubler_XL · April 7, 2011, 7:59pm

Here are some more performance tweaks (it can still be slow if there are lots of residual characters in the string):

awk 'NR==FNR{a[$1]; m=m<length?length:m; next}
function clean(s) {
   split(cs(s,9999),S,SUBSEP);
   return S[1]
}
function cs(s,b,i,p,r,bs,t) {
 for(i=length(s)>m?m:length(s);b && i;i--) {
   r=0;
   p=tolower(substr(s,1,i));
   if(!(p in a)) r=i;
   if(r<b) {
     t=cs(substr(s,i+1),b-r);
     split(t,V,SUBSEP);
     if(r+V[2]<b) { b=r+V[2]; bs=substr(s,1,i)" "V[1] SUBSEP b }
   }
  }
  if (bs == "") return s SUBSEP length(s);
  return bs;
}
{ print clean($0) }' lookup raw

gimley · April 7, 2011, 8:42pm

Hello,
Many thanks for the tweak but all the awk programs which I have signal an error in execution:

'C:\Users\XP-HOME\Desktop>gawk -f splitternew.gk en.dic slave
gawk: splitternew.gk:7:  for(i=length(s)>m?m:length(s);b&&amp;amp;amp;i;i--) {
gawk: splitternew.gk:7:                                          ^ parse error'

The ; is consistently treated as a parse error.
I have been testing the earlier script quite intensively and have found that that the spliter slows down only when
a. the strings are long and
b. when an unknown residue is located in the string to be parsed.
Many thanks for all the help.

Chubler_XL · April 7, 2011, 9:26pm

sorry the forum is expanding & to & for some reason - I fixed original post by putting spaces around the &&.

gimley · April 7, 2011, 11:03pm

Wow, it works and is blazing fast. Many thanks. Will get back to you in case of a bug, but I doubt if there is any.

---------- Post updated at 10:03 PM ---------- Previous update was at 08:36 PM ----------

Hello,
Tested the script
here is the speed: 22000 word split in around half a minute and pretty accurate.

C:\Users\XP-HOME\Desktop>time
The current time is:  8:29:27.99
C:\Users\XP-HOME\Desktop>gawk -f splitternew.gk en.lng hyd 1>telu.txt
C:\Users\XP-HOME\Desktop>time
The current time is:  8:29:55.85

Many thanks. One last request: Tried flagging the residues with a !> But there is no marker for this. Would appreciate if the residues were flagged.
Many thanks once more,

Chubler_XL · April 8, 2011, 12:18am

Flagging residues:

awk 'NR==FNR{a[$1]; m=m<length?length:m; next}
function clean(s) {
   split(cs(s,9999),S,SUBSEP);
   return S[1]
}
function cs(s,b,i,p,r,bs,t) {
 for(i=length(s)>m?m:length(s);b&&i;i--) {
   r=0;
   p=tolower(substr(s,1,i));
   gsub("[-0-9=,():_?\\.]", "",p);
   if(!(p in a)) r=i;
   if(r<b) {
     t=cs(substr(s,i+1),b-r);
     split(t,V,SUBSEP);
     if(r+V[2]<b) { b=r+V[2]; bs= (r?"!":"") substr(s,1,i) (r?"!":"") " "V[1] SUBSEP b }
   }
  }
  if (bs == "") return (s?"!"s"!":"") SUBSEP length(s);
  return bs;
}
{ print clean($0) }' lookup raw

gimley · April 8, 2011, 12:14pm

Hello,
Many thanks. Sorry for the delay, but was testing extensively the script. The script works just fine and the output is perfect, especially when the residue is flagged with a !.
Speedwise it is really blazing fast and can handle all types of long strings.
Many thanks once more for all your help.

ctsgnb · April 8, 2011, 12:48pm

This problem can be ambiguous for example if

you have some valid pattern like :

ana
anak
ali
kali

And you must parse

anakali

you will not be able to determine wheter it should be

ana!kali!

or

anak!ali!

...and in both case the matching is exact (0 residue)

So inconsistencies may theoritically occure however the chosen parsing rules

gimley · April 9, 2011, 2:41am

Hello,
I agree that if there are two correct representations and splits, ONLY one will be picked up and the other ignored/bypassed. To do this, the script would require to make multiple passes and sort out the possible breaks possible and store them in the file. This would mean memory issues.
I don't know how an AWK script could do this. When I made the request for a splitter, this was on my "wish lisT' but I guess it will remain as a wish rather than a reality. I am thankful to the community and especially CHUBLER_XL who helped me get this far.