Splitting Concatenated Words in Input File with Words from a Master File

gimley · February 23, 2011, 6:44pm

Hello,
I have a complex problem. I have a file in which words have been joined together:
Theboy ranslowly
I want to be able to correctly split the words using a lookup file in which all the words occur:
the
boy
ran
slowly
slow
put
child
ly
The lookup file which is meant for look up for splitting the words is huge and serves as a look up to correctly segment the input file which has �runon� words. The input file could also be very large.
It could also contain upto three to four words concatenated together.
I have 2 requirements:

Only the largest string should be used for splitting. Thus given that both slow and ly occur, I do not want the split to be :
the boy ran slow ly
But rather
the boy ran slowly.
In case a word is not found in the master list, all other largest strings should be spewed out
E.g. Assume that boy is not in the lookup file, I would still want the cut to be:
The boy ran slowly
i.e.� boy� is flagged as residue and tagged as such if possible.
I have tried to write a program which does this (both in Perl as well as in AWK, but it just fails and spews out incorrect forms, especially when I try to meet condition 1.
I am still a tyro at PERL and AWK since all my experience has been in C for the past 20 years and I am fascinated by AWK as well as PERL because of their speed and elegance.
Help would be most appreciated and gratefully acknowledged to help me learn a new skill. A commented code would be a great learning experience, if someone could have the patience to do that for me as well as for others like me who are learners,
Manythanks, (Many thanks)

GIMLEY

yinyuemi · February 23, 2011, 7:15pm

try:
**** first, make a sort on your lookup file based on length of word, largest to smalles, like:
slowly
child
slow
put
the
boy
ran
ly

then run it.

 awk 'NR==FNR{a[NR]=$1;b[$1]=1;x=NR}
NR>FNR{IGNORECASE=1;{for (j=1;j<=x;j++){for(i=1;i<=NF;i++) if(length($i)>length(a[j]) && !($i in b) && $i~a[j] && $i!=a[j])
{gsub(a[j]," "a[j]" ",$0)}}}}END{$1=toupper(substr($1,1,1))substr($1,2);print}' lookup raw
The boy ran slowly

gimley · February 23, 2011, 7:36pm

Hello,
Many thanks for the prompt reply. It does work
But there are two issues:
A small glitch instead of handling the largest string:
slowly
it takes slow and ly and breaks up the catted sentence as :
the boy ran slow ly.

Residual data at thend is identified correctly but when the unknown word is in the middle, things seem to go wrong:

When I gave the string
theboyranthroughslowly
The output was:
the boy ran throughs low ly
Since low is not in the small lookup file, I am perplexed how it was generated.
Many thanks once more for the script and I hope these two bugs are soluble.

Gimley
This is precisely the problem, I have not been able to solve apart from the residue issue.
An add on to the awk script to handle this would be of great help.

yinyuemi · February 23, 2011, 7:47pm

Hi Gimley,

I have modified the code as the above,please try it,see how it is?

Best,

Y

gimley · February 23, 2011, 7:53pm

Hi Yinyuemi,
Many thanks for the timely help. The residue problem seems to be sorted with the new code. However the largest string issue still remains.
I used the code which you had posted (reproduced below)

awk 'NR==FNR{a[NR]=$1;b[$1]=1;x=NR}
NR>FNR{IGNORECASE=1;{for (j=1;j<=x;j++){for(i=1;i<=NF;i++) if(length($i)>length(a[j]) && !($i in b) && $i~a[j] && $i!=a[j])
{gsub(a[j]," "a[j]" ",$0)}}}}END{$1=toupper(substr($1,1,1))substr($1,2);print}' lookup raw

And I still get
The boy ran through slow ly
for
theboyranthroughslowly

Sorry to hassle you, but the largest string split is vital for the dictionary work I am doing.
Many thanks once again and hoping to read you,
Best regards,
Gimley

yinyuemi · February 23, 2011, 7:57pm

It seems to work on my computer:confused:

cat lookup
slowly
child
slow
put
the 
boy 
ran
ly
 
cat raw
theboyranthroughslowly
 
awk 'NR==FNR{a[NR]=$1;b[$1]=1;x=NR}
NR>FNR{IGNORECASE=1;{for (j=1;j<=x;j++){for(i=1;i<=NF;i++) if(length($i)>length(a[j]) && !($i in b) && $i~a[j] && $i!=a[j])
{gsub(a[j]," "a[j]" ",$0)}}}}END{$1=toupper(substr($1,1,1))substr($1,2);print}' lookup raw
The boy ran through slowly

gimley · February 23, 2011, 8:03pm

Sorry I did not see the sor routine and just blindly copied the code. I got the idea. it grabs the largest string first.
Two questions:

Any awk command to handle the largest to smallest sort.
I gave three sentences: it worked only on the first. How do I get the code to loop through the whole input file.
Sorry for such stupid questions but I am still learning awk programming,
Many thanks and excuses for not reading through your mail,

Best regards,

Gimley

Chubler_XL · February 23, 2011, 8:06pm

I know this is longer but I feel it sould be safer (no gsub calls):

awk 'NR==FNR{a[$1]; next}
 function lsr(c,p) {
    for(p=length(c);p;p--)
           if(substr(c,1,p) in a) break;
    if (p) return substr(c,1,p);
    return "";
 }
 {IGNORECASE=1;
  while(length) {
     s=lsr($0);
     while (!s) {
         printf substr($0,1,1);
         $0=substr($0,2);
         s=lsr($0);
         if (s) printf " ";
     }
     printf "%s ", s;
     $0=substr($0,length(s)+1)
  }
  printf "\n"; }' lookup raw

gimley · February 23, 2011, 8:10pm

It works beautifully. Many thanks to you and Y for your timely help.
I'll walk though the code and in case I don't get something, I'll try and hassle the forum for an answer.
Many thanks once again,

Gimley

yinyuemi · February 23, 2011, 8:14pm

 awk '{a[$1]=length($1)}END{for(i in a) print a,i|"sort -nr"}' lookup |awk '{print $2}' >new_lookup

awk 'NR==FNR{a[NR]=$1;b[$1]=1;x=NR;next}
{IGNORECASE=1;{for (j=1;j<=x;j++){for(i=1;i<=NF;i++) if(length($i)>length(a[j]) && !($i in b) && $i~a[j] && $i!=a[j])
{gsub(a[j]," "a[j]" ",$0)}}}}{$1=toupper(substr($1,1,1))substr($1,2);print}' new_lookup raw

hopefully it works for you

gimley · February 23, 2011, 8:19pm

Sorry for the hassle. While it worked beautifully for the earlier strings I tried the following:
LOOKUP
subramanian
raghava
rajendra
manian
prasad

INPUT
rajendraprasadsubramaniam
perisubramaniam
rajendraperisubramaniam

The program gave the first answer and then did not progress further. I had to CTRL C to get out of the dos prompt.
Any answer to that please. Many thanks
Gimley

Chubler_XL · February 23, 2011, 8:25pm

Y, your revised solution works for me now but it dosn't work for this

lookup

slowball
slowly
play
child
quick
slow
not
put
the 
boy 
ran
ly
is

theboyranthroughslowly
heistoslowtoplayslowball

---------- Post updated at 11:25 AM ---------- Previous update was at 11:20 AM ----------

OK fixed it now change

while (!s) {

to

while (length && !s) {

yinyuemi · February 23, 2011, 8:37pm

a little change on "gsub" to "sub" and adding "$0=$0" to make NF changed

awk 'NR==FNR{a[NR]=$1;b[$1]=1;x=NR;next}
{IGNORECASE=1;{for (j=1;j<=x;j++){for(i=1;i<=NF;i++) if(length($i)>length(a[j]) && !($i in b) && $i~a[j] && $i!=a[j])
{sub(a[j]," "a[j]" ",$i);$0=$0}}}}{$1=toupper(substr($1,1,1))substr($1,2);print}' lookup raw

Chubler_XL · February 23, 2011, 9:01pm

Y did you try against my posted test data it's outputting "slow ly" again.

Gimley, This method is still pretty poor - I loaded an english dictionary into lookup (140K words) and run a test against the ls manual with all spaces taken out, it was quick but result is pretty average:

SEE ALSO 
The full documentation for l si s maintained as aTe xi n fo manual . If 
thein fo and l sprog rams are properly installed at yours it e , the com  
man d

This test was good cause I found that IGNORECASE wasn't working properly (fix below):

awk 'NR==FNR{a[$1]; next}
 function lsr(c,p) {
    for(p=length(c);p;p--)
           if(tolower(substr(c,1,p)) in a) break;
    if (p) return substr(c,1,p);
    return "";
 }
 {while(length) {
     s=lsr($0);
     while (!s && length) {
         printf substr($0,1,1);
         $0=substr($0,2);
         s=lsr($0);
         if (s) printf " ";
     }
     printf "%s ", s;
     $0=substr($0,length(s)+1)
  }
  printf "\n"; }' lookup raw

yinyuemi · February 23, 2011, 9:08pm

Hi Chubler XL,

I have modified the code again (above), can you give it a test? thanks!

Chubler_XL · February 23, 2011, 9:25pm

I get core dump if input file has a blank line in it:

assertion "(n->flags & WSTRCUR) == 0" failed: file "../gawk-3.1.8/field.c", line 217, function: rebuild_record
Aborted (core dumped)

gimley · February 23, 2011, 9:27pm

Hi Chubler_XL,
Many thanks, but the code still does not handle residue. I gave it as sample:
LOOKUP

prasad
manian
raghava
rajendra
subramanian

INPUT:

rajendraprasadsubramaniam
perisubramaniam
rajendraperisubramaniam

where peri was a "residual element".
The output was

Rajendra prasad subramaniam
Perisubramaniam
Rajendra perisubramaniam

showing that peri was attached to one of the two strings. Any way around this problem ?
Best regards and thanks,

Gimley

yinyuemi · February 23, 2011, 9:31pm

I guess the problem is the word "subramanian" in the lookup file, it should be "subramaniam", is it right?

Chubler_XL · February 23, 2011, 9:36pm

Yes, lol just found it myself and was posting the same thing.

Gimley, did you read my warning on post #14 the solution still might not be 100% accurate.

Perhaps some sort of least-cost analysis of each replacement could improve the results, I would involve recursive code and may be a fair bit slower.

gimley · February 24, 2011, 4:43am

Sorry for goof-up. I have been up since 4.00 a.m. and i guess somewhere the goofup did occur.
Sorry for the bother. I tested the code and it is working. Will test it against a huge file and let you all know the results

Many thanks once again for the timely help,

Gimley

---------- Post updated 02-24-11 at 04:43 AM ---------- Previous update was 02-23-11 at 09:38 PM ----------

Dear Chubler_XL,
I have safely tested the script and it runs beautifully. I have also digested the thinking and the commands. One last request. Is it possible to add to the code a flag when a residual element is detected i.e. an element which is not found in the dictionary. This would help me speed up analysis of the data.
Sorry for imposing once again and many many thanks for all the help given. It has been a great learning experience.
p.s.
I am posting the "final" code which is working:

NR==FNR{a[$1]; next}
function lsr(c,p) {
for(p=length(c);p;p--)
if(tolower(substr(c,1,p)) in a) break;
if (p) return substr(c,1,p);
return "";
}
{while(length) {
s=lsr($0);
while (!s && length) {
printf substr($0,1,1);
$0=substr($0,2);
s=lsr($0);
if (s) printf " ";
}
printf "%s ", s;
$0=substr($0,length(s)+1)
}
printf "\n"; }