Sorting by length

khoremand · October 31, 2012, 11:53pm

Hello,
I have a very large file: a dictionary of headwords of around 40000 and would like to have the dictionary sorted by its length i.e. the largest string first and the smallest at the end.
I have hunted for a perl or awk script on the forum which can do the job but there is none available.
I am a newbie to Perl and more accustomed to C programming. The C program I wrote takes ages and I believe Perl or Awk are blazing fast.
Could anybody provide me with a script and if possible help me out by useful comments so that I can start off writing scripts on my own.
Many thanks for your kind help

Yoda · November 1, 2012, 12:12am

This thread might help.

Note that for reverse order sorting you can use sort -nr

drl · November 1, 2012, 7:14am

Hi.

The available utility msort allows a number of different comparison types, and among them is string length. Here's an example on a sample of dictionary data:

#!/usr/bin/env bash

# @(#) s1	Demonstrate sort lines by length, msort.
# See: http://freecode.com/projects/msort

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C msort

FILE=${1-data1}

pl " Input data file $FILE:"
cat $FILE

pl " Results from msort:"
msort -q --line -n 1,1 --comparison-type size $FILE

exit 0

producing:

% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
bash GNU bash 3.2.39
msort 8.44

-----
 Input data file data1:
camera
excitatory
incense
liken
offertory
peregrine
prairie
proportion
redwood
Riemannian

-----
 Results from msort:
liken
camera
redwood
incense
prairie
offertory
peregrine
excitatory
proportion
Riemannian

If msort is not in your system repository, see URL in the script comments.

Best wishes ... cheers, drl

ripat · November 1, 2012, 8:08am

awk oneliner alternative

awk '{w[length] = w[length] ? w[length]"\n"$0 : $0} END {for(l in w) print w[l]}' file

khoremand · November 1, 2012, 8:45am

Hello,
I would really appreciate if you could comment the first part of the code

{w[length] = w[length] ? w[length]"\n"$0 : $0} END

Suppose I wanted to tweak it to sort from smallest to largest, how would I do it.
Am still a newbie in AWK and PERL and this is a great learning experience.

ripat · November 1, 2012, 8:53am

For the reverse order, just pipe the output in tac:

awk '{w[length] = w[length] ? w[length]"\n"$0 : $0} END {for(l in w) print w[l]}' file | tac

length() is a awk function that output the, ahem, length of a string. With no argument, it output the length of $0.

w is an array whose indexes will be words length. Every time a word of a given length is seen, it will be concatenated to the w array for that index. Except if that length is seen for the first time in which case the w[length] is initialized with $0. The ternary expression w[length] = w[length] ? w[length]"\n"$0 : $0 coud be written as:

if (w[length]) {
    w[length]=w[length]"\n"$0
} else {
    w[length]=$0
}

Scrutinizer · November 1, 2012, 9:02am

Or, instead of tac, expanding on ripat's suggestion:

awk '{l=length; if(l>m)m=l; w[l]=w[l] $0 RS} END{for(l=m;l>=1;l--) if(w[l])printf "%s",w[l]}' infile

ripat · November 1, 2012, 9:09am

Nice trick for the reverse order!

Edit:
My solution above does not return the expected result on large dictionaries. The w array gets out of sequence. Updated version inspired by Scrutinizer's backwards loop:

awk '{le=length;w[le] = w[le] ? w[le]"\n"$0 : $0} END {for(i=length(w);i>=1;i--)  print w}'

alister · November 1, 2012, 12:27pm

For your specific AWK implementation, the number of member's in the array may affect the order in which the members are retrieved, but, more generally, the problem is that you were depending on undefined behavior. A solution that gives the desired result with gawk could fail on nawk. A solution that works with version N of some awk implementation could fail on version N+1 of that same implementation. And in none of those cases is an implementation not complying with the standard.

From Opengroup :: AWK:

The OP never stated their operating system. If it's not GNU/Linux, tac may not be available.

In my opinion, in post #2, bipinajith linked to the nicest solution. The only thing it needs is a cut to dedecorate:

awk '{print length "\t" $0}' | sort -n | cut -f2-

These days, 40,000 lines isn't very many. Any machine that can run the perl interpreter can make short work of such a file. Unless the sort pipeline will be executed many times in a tight loop, there's no point in sacrificing simplicity, readability, and maintainability for efficiency.

Sounds like your C program is buggy. Perhaps you should post your C program to the programming forum for help.

Perl and AWK (gawk, mawk, nawk, busybox awk) are C programs themselves. And since they're general purpose interpreters, your specialized C program should not be outperformed.

Regards,
Alister