Splitting concatenated words in input file with words from the same file

gimley · May 2, 2012, 8:59pm

Dear all,
I am working with names and I have a large file of names in which some words are written together (upto 4 or 5) and their corresponding single forms are also present in the word-list.
An example would make this clear

annamarie
mariechristine
johnsmith
johnjoseph smith
john
smith
anna
marie
mary
christine

The program should split the words in the list basing itself on the single forms which are there. Thus

annamarie anna-marie
mariechristine marie christine
johnsmith john smith
johnjosephsmith

In the case of the last since

joseph

is missing, the program could suitably tag the missing element and show the word as

john !joseph! smith

The script would prove especially helpful in separating words in languages such as German whch have a large number of compounded words.
Could the script in awk posted on this site (thanks to yinyuemi) and which I am posting below (which does something similar but it takes words from an external dictionary), be modified to work within the same database instead of referring to an external dictionary. I have tried to modify it but it just does not work.

awk 'NR==FNR{a[NR]=$1;b[$1]=1;x=NR}
NR>FNR{IGNORECASE=1;{for (j=1;j<=x;j++){for(i=1;i<=NF;i++) if(length($i)>length(a[j]) && !($i in b) && $i~a[j] && $i!=a[j])
{gsub(a[j]," "a[j]" ",$0)}}}}END{$1=toupper(substr($1,1,1))substr($1,2);print}' lookup raw

Any help given would be gratefully acknowledged.

Chubler_XL · May 2, 2012, 11:31pm

How about this:

awk '
NR==FNR{a[NR]=$1;b[$1]=1;x=NR}
NR>FNR{
  IGNORECASE=1;
  for(j=1;j<=x;j++){
      for(i=1;i<=NF;i++) {
          if(length($i)>length(a[j]) && $i~a[j] && $i!=a[j])
             gsub(a[j]," "a[j]" ",$0)
          }
      }
      for(i=1;i<=NF;i++)
         printf (i>1?" ":"") (($i in b)?$i:"!"$i"!")
         print ""
}' infile infile

gimley · May 3, 2012, 12:08am

Many thanks. I copied the script and ran it on the file which I had proposed as a sample. I got no results.
Have I done something wrong ? I am on Windows and maybe this is the cause; but awk/gawk should run on any environment.
This is tantalising to see a solution and not be able to use it.
Many thanks once more for your kind help.

Chubler_XL · May 3, 2012, 12:21am

Make sure you copy solution exactly as it appears (including the file name on the end of the line twice):

Output:

anna marie
marie christine
john smith
john !joseph! smith
john
smith
anna
marie
mary
christine

gimley · May 3, 2012, 2:20am

Many thanks for taking the trouble to help me out. I copied the program as such retaining the instructions you had given:

When I retain the

, I get the following error message:

gawk: singlesplit.awk:13: }' infile infile
gawk: singlesplit.awk:13:  ^ Invalid char ''' in expression

When I do away with the

, I get no response: the output file does not pop up on the screen.
Is there a problem in copying the code. This what I copied and got :

NR==FNR{a[NR]=$1;b[$1]=1;x=NR}
NR>FNR{
  IGNORECASE=1;
  for(j=1;j<=x;j++){
      for(i=1;i<=NF;i++) {
          if(length($i)>length(a[j]) && $i~a[j] && $i!=a[j])
             gsub(a[j]," "a[j]" ",$0)
          }
      }
      for(i=1;i<=NF;i++)
         printf (i>1?" ":"") (($i in b)?$i:"!"$i"!")
         print ""
}' infile infile

Sorry to hassle you like this, but I am really desperate to get the solution.
Many thanks once again for your patience and kindness.

Chubler_XL · May 3, 2012, 6:43pm

OK I think I might know what is going on, you are putting the awk code in a file and then calling it with the awk -f progfile option

Remove ' infile infile from your singlesplit.awk program file and call awk like this:

awk -f singlesplit.awk infile infile

gimley · May 3, 2012, 11:05pm

Many thanks. You made my day. The script works. I should have thought of removing the infile infile and giving them at command prompt.
Many thanks once again for all your kind help and your patience.

---------- Post updated at 10:05 PM ---------- Previous update was at 07:42 PM ----------

Sorry to sound ungrateful. The script works. But my file is around 300 thousand words and the script is very slow.
Any means of speeding it up, an array or some such device. Many thanks for all your help and sorry to pester you like this.

shamrock · May 3, 2012, 11:07pm

Yet another way to do the same thing without reading the input file twice...

awk '{
    a[$0] = $0
    x[$0] = $0
} END {
    for (i in a) {
        for (j in x)
            if (length(a) < length(x[j]))
               if (gsub(a, " "a" ", x[j]))
                  z[j] = x[j]
    }
    for (i in z) {
        m = split(z, u, " ")
        for (j = 1; j <= m; j++) {
            if (u[j] in a)
               r = r ? r " " u[j] : u[j]
            else
               r = r ? r " !" u[j] "!" : "!" u[j] "!"
        }
        print r
    }
}' file

Chubler_XL · May 8, 2012, 12:13am

This is because the solution involves many sequencial searchs through the whole array.

Remember the solution we use in your Splitting concatenated words thread? We split each input line up into a number of substrings and did a direct lookup on each, I think a similar solution is required here.

Try this:

NR==FNR{for(i=1;i<=NF;i++)a[$i]; next}
function lsr(c,p) {
    for(p=1;p<=length(c);p++)
        if(tolower(substr(c,1,p)) in a) break;
    if (p<=length(c)) return substr(c,1,p);
    return "";
}
{
 for(i=1;i<=NF;i++) {
    A=$i
    while(length(A)) {
      s=lsr(A);
      if (!s) printf "!";
      while (!s && length(A)) {
        printf substr(A,1,1);
        A=substr(A,2);
        s=lsr(A);
        if (s || !length(A)) printf "! ";
      }
      printf "%s ", s;
      A=substr(A,length(s)+1)
    }
  }
  printf "\n";