Finding most repeated entry in a column and giving the count

necro98 · July 25, 2012, 10:50pm

Please can you help in providing the most repeated entry in the 2nd column and give its count

Here is an input file

 
1, This , is a forum
2, This ,  is a forum
1, There , is a forum
2, This ,  is not right

Here the most repeated entry is "This" and count is 3

So output shout contain all lines with the word and look like this

 
 
This = 3 
 
1, This , is a forum
2, This ,  is a forum
2, This ,  is not right

PikK45 · July 25, 2012, 10:56pm

Did you try googling or searching through this forum?

agama · July 25, 2012, 11:02pm

Given the unknown input file size, this is one case where I think awk followed by grep is approrprate:

#!/usr/bin/env ksh
what=$( awk ' { c[$2]++; if( c[$2] > max ) max = $2; } END { printf( "%s = %d\n", max, c[max] ); }'  input-file)
printf "%s\n\n" "$what"
grep ${what%% *}  input-file

Should also work in bash if you prefer

summer_cherry · July 25, 2012, 11:27pm

perl

open $fh,"<", "a";
while(<$fh>){
    chomp;
    my @tmp = split(",",$_);
    $hash{$tmp[1]}->{'CNT'}++;
    $hash{$tmp[1]}->{'CONTENT'}=$hash{$tmp[1]}->{'CONTENT'}."\n".$_;
}
close $fh;
my $key = (sort {$hash{$b}->{'CNT'} cmp $hash{$a}->{'CNT'}} keys %hash)[0];
print $key,"=",$hash{$key}->{'CNT'},"\n";
print $hash{$key}->{'CONTENT'};

awk:

awk -F"," '{
    cnt[$2]++
    content[$2]=sprintf("%s\n%s",content[$2],$0)
}
END{
    for(i in cnt){
        if(ind ==""){
            ind=i
            max=cnt
        }
        else{
            if(cnt>=max){
                ind=i
                max=cnt
            }
        }
    }
    print ind"="cnt[ind]
    print content[ind]
}' a

necro98 · July 26, 2012, 2:31am

summer_cherry:

perl

open $fh,"<", "a";
while(<$fh>){
   chomp;
   my @tmp = split(",",$_);
   $hash{$tmp[1]}->{'CNT'}++;
   $hash{$tmp[1]}->{'CONTENT'}=$hash{$tmp[1]}->{'CONTENT'}."\n".$_;
}
close $fh;
my $key = (sort {$hash{$b}->{'CNT'} cmp $hash{$a}->{'CNT'}} keys %hash)[0];
print $key,"=",$hash{$key}->{'CNT'},"\n";
print $hash{$key}->{'CONTENT'};

awk:

awk -F"," '{
   cnt[$2]++
   content[$2]=sprintf("%s\n%s",content[$2],$0)
}
END{
   for(i in cnt){
   if(ind ==""){
   ind=i
   max=cnt
   }
   else{
   if(cnt>=max){
   ind=i
   max=cnt
   }
   }
   }
   print ind"="cnt[ind]
   print content[ind]
}' a

Thanks very much f this ,it worked

In addition Some of the lines in the same file contain the letter C: with a value
Here the value is 0

1,00: This , is a good script c:0

I want to output of the lines with top 3 highest value for c:

1,00: This , is a nice script c:9999
1,00: This , is a good script c:9998
1,00: This , is a cool script c:9000
1,00: This , is a fun script c:12

So the output should be

1,00: This , is a nice script c:9999
1,00: This , is a good script c:9998
1,00: This , is a cool script c:9000

---------- Post updated at 01:31 AM ---------- Previous update was at 12:30 AM ----------

Hi summer , Please can you help with the above