Finding the most common entry in a column

Hi,

I have a file with 3 columns in it that are comma separated and it has about 5000 lines. What I want to do is find the most common value in column 3 using awk or a shell script or whatever works! I'm totally stuck on how to do this.

e.g.

value1,value2,bob
value1,value2,bob
value1,value2,bob
value1,value2,dave
value1,value2,james

Clearly in the above example the most popular value in column3 is "bob", but how would I write a script to work this out?

Many thanks

nawk -f don.awk myFile

don.awk:

BEGIN {
  FS=","
}
{a[$3]++; if (a[$3] > comV) { comN=$3; comV=a[$3]} }
END {
    printf("Most Common Name: [%s] = [%d]\n", comN, comV)
}

Hi,
This one should also be ok for you. Actually, this case involved persormance issue, since your file has thousound and hunderds of lines. So difficult logic will have different result.

To be honest, i only know how to get the result, but i have no idea to give out a high-performance code. So you'd better ask some expert for help.

Here comes my code:

awk 'BEGIN{
FS=","
n=0
}
{
sum[$3]++
if (sum[$3]>n)
{
	n=sum[$3]
	m=$3
}
}
END{
print m
}' filename

Hi.

So you're willing to accept a (more or less) random result of any of the winners if there is a tie among two or more names? ... cheers, drl

It will be sufficient to turn comN/m into array.

Thanks guys,

I got both of the above to work but my CPU usage hit 100% lol! Any ideas on either making this more efficient or limiting the amount of CPU that this awk script can hog?

Thanks again

Hi, can u check this?

awk -F\, '{print $NF}' file|sort -u|xargs -i ksh -c 'echo "{} \c";grep -wc ",{}$" file'|sort -r -k 2,2|head -1|awk '{print $1}'

Hi,

I tried it on this test file:

1,2,bob
1,2,bob
1,2,bob
1,2,jay
1,2,tim

and it returned Tim.....

Regards

#! /opt/third-party/bin/perl

open(FILE, "<", "a2");

while(<FILE>) {
  chomp;
  my @arr = split(/,/);
  $fileHash{$arr[2]}++;
}

close(FILE);

foreach my $k ( keys %fileHash ) {
  my $tmp = $fileHash{$k};
  if( $cnt < $tmp ) {
    $cnt = $tmp;
    $val = $k;
  }
}
print "$val : $cnt\n";

exit 0

Hi.

With standard commands:

#!/usr/bin/env sh

# @(#) s1       Demonstrate determination of maximum string occurrence.

set -o nounset
echo

debug=":"
debug="echo"

## Use local command version for the commands in this demonstration.

echo "(Versions displayed with local utility \"version\")"
version >/dev/null 2>&1 && version bash cut sort uniq sed

echo

FILE=${1-data1}

echo
echo " Input file:"
cat data1

echo
echo " Results from pipeline ( extract, sort, count, isolate ):"

cut -d, -f3 $FILE |
sort |
uniq -c |
sort -nr |
sed -n -e '1s/^ *[0-9][0-9]* *//p;q'

exit 0

Producing:

% ./s1

(Versions displayed with local utility "version")
GNU bash 2.05b.0
cut (coreutils) 5.2.1
sort (coreutils) 5.2.1
uniq (coreutils) 5.2.1
GNU sed version 4.1.2


 Input file:
value1,value2,bob
value1,value2,bob
value1,value2,bob
value1,value2,dave
value1,value2,james

 Results from pipeline ( extract, sort, count, isolate ):
bob

See man pages for details ... cheers, drl

drl that's awesome!

I processed a file with 188,216 lines in about 3 seconds!

Thanks very much

regards

Hi, Donkey25.

Yes, the standard utilities are generally quite fast; glad it worked out ... cheers, drl

Opps , small mistake:

> cat lis
1,2,bob
1,2,bob
1,2,bob
1,2,jay
1,2,tim
>awk -F\, '{print $NF}' lis|sort -u|xargs -i ksh -c 'echo "{} \c";grep -wc ".*,{}$" lis'|sort -r -k 2,2|head -1|awk '{print $1}'
bob