Script to sort large file with frequency

gimley · June 27, 2012, 11:34pm

Hello,
I have a very large file of around 2 million records which has the following structure:

I have used the standard awk program to sort:

# wordfreq.awk --- print list of word frequencies
{
# remove punctuation
#gsub(/[^[:alnum:]_[:blank:]]/, "", $0)
for (i = 1; i <= NF; i++)
freq[$i]++
}
END {
for (word in freq)
printf "%s\t%d\n", word, freq[word]
}

and a PERL program I found on the net:

my %seen=();
while(<>)
{
    chomp;
    foreach my $word ( grep /\w/, split )
    {
       # $word =~ s/[. ,]*$//; # strip off punctuation, etc.
        $seen{$word}++;
    }
}

use Data::Dumper;
$Data::Dumper::Terse = 1;
print Dumper \%seen;

While both work beautifully for small files of around fifty thousand lines when I execute them on the very large file, they run out of memory.
I am working on a Windows machine VISTA OS and have even tries increasing the paging memory size to around 8Mb but to no avail.
I believe there is a function in Perl where you can set the variable to 99999 which allows for very large file execution. I have tried to insert that in the Perl program but I get an out of memory call.
Could anybody provide with a solution where the program can run on a very large file of around 9 mb.
Many thanks.

spacebar · June 28, 2012, 1:36am

Check out this perl module:
Sort::External - search.cpan.org

binlib · June 28, 2012, 9:11am

You can either split the file (by the first character of each line, for example) and process each separately or not splitting the file but do a multi-pass in your awk/perl.

Scrutinizer · June 28, 2012, 10:36am

Try:

sort -t# -k1,1 file | uniq -c

---edit---
OK : Vista... hmmm