Identifying suffixes in a file and printing them out

gimley · April 4, 2012, 12:14pm

Hello,
I am interested in finding and identifying suffixes for Indian names through an awk script or a perl program. Suffixes normally are found at the end of a word as is shown in the sample given below.
What I need is a perl script which will identify suffixes of a defined lenght to be given in the command line and spew them out in a separate file (if possible with their frequency).
My perl and awk scripting skills do not go that far and hence this request for help for an interesting problem which could have utility in other cases also.
The script should identify suffixes more than two character in length
A sample is given below:

chandrashekhar
hansa
hansaben
hemant
hemantbhai
hemaprasad
mohanchandra
raj
rajchandra
rajprasad
rajshekhar
sharadbhai
shardaben

The expected output in a separate file would be

ben	2
bhai	2
chandra	2
prasad	2
shekhar	2

Many thanks for any help given. The database is large and would be around 80,000 words

balajesuri · April 4, 2012, 12:23pm

How would one know which suffixes to look for? For e.g., you're looking for suffix 'ben' in a file containing list of names. How would one know its 'ben' or 'bhai' that is to be looked for? Are these suffixes defined in a separate file?

gimley · April 4, 2012, 8:18pm

Many thanks for your query.
The answer is :unfortunately no. I agree that this is a major issue. I have written a rev sort in PERL which sorts the words in reverse order and tried to extract the suffixes from the list: a laborious and tedious problem. This is why I thought of trying to get results programatically. I know that I will not always get the right answers but the false positives can always be weeded out

balajesuri · April 4, 2012, 9:02pm

What logic did you use to extract suffixes from names?

They're going to be of variable length.
There could be names without a suffix, e.g., 'raj'. How would one make the computer understand "raj doesn't contain a suffix, so leave it" ?

It would be easier if you can get the list of suffixes you're looking for. You don't suppose it would be a huge list, do you?

#! /usr/bin/perl -w
use strict;

my @suffixes = qw / # Place all the required suffixes in this list.
ben
bhai
chandra
prasad
shekhar
/;

my (%x, $s, $nm);

open I, "< inputfile.txt"; # This file contains names in which suffixes are to be looked.
for $nm (<I>) {
    for $s (@suffixes) {
        if ($nm =~ /$s$/) {
            $x{$s}++;
        }
    }
}
close I;

for (sort keys %x) {
    print "$_ $x{$_}\n";
}

$ ./test.pl
ben 2
bhai 2
chandra 2
prasad 2
shekhar 2

Scrutinizer · April 5, 2012, 1:36am

Or perhaps the other way around, we can find suffixes if there are names without those suffixes in the list, so ben, bhai and chandra can easily be found, but to find prasad and shekbar are more difficult, since there is no name without those suffixes.. Another complication would be a name like hemaprasad, which I presume is short for hemantprasad ( I am just guessing, I am not Indian ), but how does the algorithm know?

Anyway, as long as there are "names without suffixes" for every "name with suffix" present ("easy ones"), this algorithm might find the right result:

sort -u infile |                       # first do a unique sort of the input file and pipe that into awk as the first file (at the point of - )
awk '
NR==FNR{                               # if we are processing the first file
  if(p && $1~"^"p){                    # if a previous name exists and there is a match at the beginning 
    sub(p,x,$1)                        # then delete the match from the word
    S[$1]                              # and store the result as a suffix in array S
  }
  else
    p=$1                               # else set the previous name to $1
  next                                 # process the next line
}
{
  for(i in S) if($1~i"$"){             # This is the second file, for every name if there is a partial match at the end with
    S++                             # the list of suffixes then increase their incidence..
    next                               # process the next line
  }    
}                                      # the list of suffixes then increase their incidence..
END{
  for(i in S)print i,S              # Print out all the suffixes and the incidences..

}
' - infile                             # use the unique sort as the first file (-) and the file itself as the second.

output:

prasad 2
ben 2
bhai 2
chandra 2
shekhar 2

This algorithm could be further optimized by sorting the suffix array such that a longest match in the second part of the script is always found first..