Finding the most common entry in a column

Donkey25 · November 21, 2007, 10:38am

Hi,

I have a file with 3 columns in it that are comma separated and it has about 5000 lines. What I want to do is find the most common value in column 3 using awk or a shell script or whatever works! I'm totally stuck on how to do this.

e.g.

value1,value2,bob
value1,value2,bob
value1,value2,bob
value1,value2,dave
value1,value2,james

Clearly in the above example the most popular value in column3 is "bob", but how would I write a script to work this out?

Many thanks

vgersh99 · November 21, 2007, 10:53am

nawk -f don.awk myFile

don.awk:

BEGIN {
  FS=","
}
{a[$3]++; if (a[$3] > comV) { comN=$3; comV=a[$3]} }
END {
    printf("Most Common Name: [%s] = [%d]\n", comN, comV)
}

summer_cherry · November 21, 2007, 9:29pm

Hi,
This one should also be ok for you. Actually, this case involved persormance issue, since your file has thousound and hunderds of lines. So difficult logic will have different result.

To be honest, i only know how to get the result, but i have no idea to give out a high-performance code. So you'd better ask some expert for help.

Here comes my code:

awk 'BEGIN{
FS=","
n=0
}
{
sum[$3]++
if (sum[$3]>n)
{
	n=sum[$3]
	m=$3
}
}
END{
print m
}' filename

drl · November 21, 2007, 10:37pm

Hi.

So you're willing to accept a (more or less) random result of any of the winners if there is a tie among two or more names? ... cheers, drl

radoulov · November 22, 2007, 4:25am

It will be sufficient to turn comN/m into array.

Donkey25 · November 22, 2007, 6:59am

Thanks guys,

I got both of the above to work but my CPU usage hit 100% lol! Any ideas on either making this more efficient or limiting the amount of CPU that this awk script can hog?

Thanks again

Klashxx · November 22, 2007, 7:51am

Hi, can u check this?

awk -F\, '{print $NF}' file|sort -u|xargs -i ksh -c 'echo "{} \c";grep -wc ",{}$" file'|sort -r -k 2,2|head -1|awk '{print $1}'

Donkey25 · November 22, 2007, 9:12am

Hi,

I tried it on this test file:

1,2,bob
1,2,bob
1,2,bob
1,2,jay
1,2,tim

and it returned Tim.....

Regards

matrixmadhan · November 22, 2007, 9:17am

#! /opt/third-party/bin/perl

open(FILE, "<", "a2");

while(<FILE>) {
  chomp;
  my @arr = split(/,/);
  $fileHash{$arr[2]}++;
}

close(FILE);

foreach my $k ( keys %fileHash ) {
  my $tmp = $fileHash{$k};
  if( $cnt < $tmp ) {
    $cnt = $tmp;
    $val = $k;
  }
}
print "$val : $cnt\n";

exit 0

drl · November 22, 2007, 9:44am

Hi.

With standard commands:

#!/usr/bin/env sh

# @(#) s1       Demonstrate determination of maximum string occurrence.

set -o nounset
echo

debug=":"
debug="echo"

## Use local command version for the commands in this demonstration.

echo "(Versions displayed with local utility \"version\")"
version >/dev/null 2>&1 && version bash cut sort uniq sed

echo

FILE=${1-data1}

echo
echo " Input file:"
cat data1

echo
echo " Results from pipeline ( extract, sort, count, isolate ):"

cut -d, -f3 $FILE |
sort |
uniq -c |
sort -nr |
sed -n -e '1s/^ *[0-9][0-9]* *//p;q'

exit 0

Producing:

% ./s1

(Versions displayed with local utility "version")
GNU bash 2.05b.0
cut (coreutils) 5.2.1
sort (coreutils) 5.2.1
uniq (coreutils) 5.2.1
GNU sed version 4.1.2


 Input file:
value1,value2,bob
value1,value2,bob
value1,value2,bob
value1,value2,dave
value1,value2,james

 Results from pipeline ( extract, sort, count, isolate ):
bob

See man pages for details ... cheers, drl

Donkey25 · November 22, 2007, 10:41am

drl that's awesome!

I processed a file with 188,216 lines in about 3 seconds!

Thanks very much

regards

drl · November 22, 2007, 11:10am

Hi, Donkey25.

Yes, the standard utilities are generally quite fast; glad it worked out ... cheers, drl

Klashxx · November 22, 2007, 11:16am

Opps , small mistake:

> cat lis
1,2,bob
1,2,bob
1,2,bob
1,2,jay
1,2,tim

>awk -F\, '{print $NF}' lis|sort -u|xargs -i ksh -c 'echo "{} \c";grep -wc ".*,{}$" lis'|sort -r -k 2,2|head -1|awk '{print $1}'
bob