Hello:
I want to print out the entire row with max value in column 3 based on column 2. Input file is millions rows. test.dat:
Contig1 lcl|1DL 111 155 265 27
Contig2 lcl|1DS 100 73 172 100
Contig3 lcl|1DL 140 698 837 140
Contig3 lcl|6DS 107 1488 1594 1
Contig5 lcl|6DL 193 59 251 374
Contig5 lcl|4DS 119 1 119 119
Contig5 lcl|6DL 107 145 251 596
Contig6 lcl|6DS 153 90 242 674
Contig7 lcl|4DL 103 913 1015 6590
Contig7 lcl|6DL 107 1016 1122 1152
Contig8 lcl|6DS 291 2700 2990 291
Contig8 lcl|4DS 279 2594 2872 279
Contig8 lcl|6DS 244 3711 3954 1
Contig8 lcl|6DS 159 3796 3954 1
Contig8 lcl|6DL 194 3237 3430 194
Contig8 lcl|1DS 109 4069 4177 269
I first tried:
awk '{if(! ($2 in a)) a[$2]=$3; else if($3 > a[$2]) a[$2]=$3; max[$2]=$0} END {for (i in max) print i, a}' test.dat
and the output is:
lcl|4DL 103
lcl|4DS 279
lcl|6DL 194
lcl|6DS 291
lcl|1DL 140
lcl|1DS 109
As I want to print out the whole row of the max values of each item, then I tried:
awk '{if(! ($2 in a)) a[$2]=$3; else if($3 > a[$2]) a[$2]=$3; max[$2]=$0} END {for (i in max) print max}' test.dat
and the output is:
Contig7 lcl|4DL 103 913 1015 6590
Contig8 lcl|4DS 279 2594 2872 279
Contig8 lcl|6DL 194 3237 3430 194
Contig8 lcl|6DS 159 3796 3954 1
Contig3 lcl|1DL 140 698 837 140
Contig8 lcl|1DS 109 4069 4177 269
Obviously I had something wrong with the second script. I am very nervous with the second script for millions of rows, but could not figure it out myself. Thanks in advance!
YT