Cleaning through perl or awk a Stemmer dictionary

gimley · May 26, 2013, 9:24pm

Hello,
I work under Windows Vista and I am compiling an open-source stemmer dictionary for English and eventually for other Indian languages. The Engine which I have written has spewed out all lemmatised/expanded forms of the words: Nouns, Adjectives, Adverbs etc. Each set of expanded forms is separated by a hard return. Since each root word was treated as a separate entity according to its grammatical function, the expanded forms sometimes have duplicate sets.
An example will make this clear:

coil
coiled
coiling
coils

coil
coils

coin's
coin
coins
coins'

coin
coined
coining
coins

As can be seen the two sets for

coil and coin

have been created. It is evident that since they share the same root word, they should have been merged together but for the reason given above, are treated as separate entities.
Is it possible to write a script which would go through the sets, if a common word is found in set A and set B, both sets will merge together and if possible be sorted and the duplicate forms removed.
The output of the above would look something like this:

coil
coiled
coiling
coils

coin's
coin
coins
coined
coining

The sets are not necessarily contiguous and at times could be separated by another set of words.
Since the data is huge, a perl or awk script or would go a long way in speeding up the process.
Many thanks in advance for helping a work which will aid researchers to create better stemming for English and other languages.

Chubler_XL · May 26, 2013, 10:37pm

no sorting but this should merge your forms and remove duplicates:

awk '
BEGIN {RS=""}
{ root=$1;
  for(i=1;i<=NF;i++) if($i in LEM) root=LEM[$i]
  for(i=1;i<=NF;i++) if(!($i in LEM)) {
      LEM[$i]=root
      base[root]=base[root] OFS $i
  }
}
END {
  for(w in base) {
    forms=split(base[w], form);
    for(i=0;i<forms;i++)
      if(length(form)) print form;
    print "";
  } 
}' infile > outfile

gimley · May 26, 2013, 10:44pm

Many thanks. It worked beautifully. No hassles about the sort. I can do that very easily by creating a new script .

Chubler_XL · May 26, 2013, 10:47pm

Here is a version that sorts:

awk '
BEGIN {RS=""}
{ root=$1;
  for(i=1;i<=NF;i++) if($i in LEM) root=LEM[$i]
  for(i=1;i<=NF;i++) if(!($i in LEM)) {
      LEM[$i]=root
      base[root]=base[root] OFS $i
  }
}
END {
  for(w in base) {
    forms=split(base[w], form);
    for(i=0;i<forms;i++)
      if(length(form)) print w","form;
    print w"?";
  }
}' infile | sort | awk -F, '{ print $2}'

gimley · May 26, 2013, 10:49pm

Many thanks. I tested it out on a small sample and it sorts just great.