Sorting on length with identification of number of characters

gimley · January 19, 2013, 7:59am

Hello,
I am writing an open-source stemmer in Java for Indic languages which admit a large number of suffixes.
The Java stemmer requires that each suffix string be sorted as per its length and that all strings of the same length are arranged in a single group, sorted alphabetically. Moreover as a header I need to specify the numeric value of the string, say

5
6
7
8
etc.

Since the languages in question have over 300 and more suffixes, trying to sort on length and identifying the length of each string and counting it becomes a difficult issue.
An example will make this clear.
Input:

Expected output

Since handling such a large database is laborious, is it possible to write a script in AWK or PERL which would enable the above output.
Your help would go a long way in putting java-based stemmers in different languages in the open-source community.
Many thanks in advance for your kind help

Scrutinizer · January 19, 2013, 8:17am

Try:

gawk '{print length, $1}' infile | sort -n | gawk '$1!=p{print $1}{print $2; p=$1}'

You would need to use a version of awk that correctly counts multi-byte characters:

gimley · January 19, 2013, 11:36pm

Many thanks it worked. However since GAWK under windows does not allow pipes, I had to create 3 scripts
one for processing count, the other for sorting and the third for printing out the count.
Desperate situations call for desperate measures. Any way in which I can handle pipes under windows.

drl · January 20, 2013, 12:22pm

Hi.

I thought cmd and command in MS Windows could handle simple pipes, like dir | more . Even in the case that they do, you may not find sort, et al, to be available.

So ... see Cygwin for a very complete solution.

Best wishes ... cheers, drl