Hello,
I am writing an open-source stemmer in Java for Indic languages which admit a large number of suffixes.
The Java stemmer requires that each suffix string be sorted as per its length and that all strings of the same length are arranged in a single group, sorted alphabetically. Moreover as a header I need to specify the numeric value of the string, say
5
6
7
8
etc.
Since the languages in question have over 300 and more suffixes, trying to sort on length and identifying the length of each string and counting it becomes a difficult issue.
An example will make this clear.
Input:
Expected output
1
2
3
4
5
Since handling such a large database is laborious, is it possible to write a script in AWK or PERL which would enable the above output.
Your help would go a long way in putting java-based stemmers in different languages in the open-source community.
Many thanks in advance for your kind help