Choosing between repeated entries based on a column field

Sanchari · September 21, 2013, 12:41pm

Hello, I have an input file:

LOC_Os04g01890\LOC_Os05g17604    0.051307       
LOC_Os04g01890\LOC_Os05g17604    0.150977       
LOC_Os04g01890\LOC_Os05g17604    0.306231      
LOC_Os04g01890\LOC_Os06g33100    0.168037       
LOC_Os04g01890\LOC_Os06g33100    0.236293       
LOC_Os04g01890\LOC_Os07g03590    0.109948      
LOC_Os04g01890\LOC_Os07g03590    0.12325

I want to select the largest of each repeated entries based on Column 2
Desired output:

      LOC_Os04g01890\LOC_Os05g17604    0.306231       
  
      LOC_Os04g01890\LOC_Os06g33100    0.236293       
  
      LOC_Os04g01890\LOC_Os07g03590    0.12325

Wanted to know a shell command which can do this. Thanks

Yoda · September 21, 2013, 12:57pm

Using awk:

awk '{if(A[$1]<$2||!(A[$1])) A[$1]=$2}END{for(k in A) print k,A[k]}' file

elixir_sinari · September 21, 2013, 1:10pm

If order of the output does not matter:

perl -lane '$max{$F[0]} = $F[1] unless exists $max{$F[0]}; $max{$F[0]} = $F[1] if $F[1] > $max{$F[0]};
END{ print "$_ $max{$_}" for keys %max}' file

disedorgue · September 21, 2013, 1:39pm

Hi,
With sort:

sort -k2,2 -rn file | sort -k1,1 -u

Regards.