awk, max value, array, row

Hello:
I want to print out the entire row with max value in column 3 based on column 2. Input file is millions rows. test.dat:

Contig1 lcl|1DL 111     155     265     27
Contig2 lcl|1DS 100     73      172     100
Contig3 lcl|1DL 140     698     837     140
Contig3 lcl|6DS 107     1488    1594    1
Contig5 lcl|6DL 193     59      251     374
Contig5 lcl|4DS 119     1       119     119
Contig5 lcl|6DL 107     145     251     596
Contig6 lcl|6DS 153     90      242     674
Contig7 lcl|4DL 103     913     1015    6590
Contig7 lcl|6DL 107     1016    1122    1152
Contig8 lcl|6DS 291     2700    2990    291
Contig8 lcl|4DS 279     2594    2872    279
Contig8 lcl|6DS 244     3711    3954    1
Contig8 lcl|6DS 159     3796    3954    1
Contig8 lcl|6DL 194     3237    3430    194
Contig8 lcl|1DS 109     4069    4177    269

I first tried:

awk '{if(! ($2 in a)) a[$2]=$3; else if($3 > a[$2]) a[$2]=$3; max[$2]=$0} END {for (i in max) print i, a}' test.dat 

and the output is:

lcl|4DL 103
lcl|4DS 279
lcl|6DL 194
lcl|6DS 291
lcl|1DL 140
lcl|1DS 109

As I want to print out the whole row of the max values of each item, then I tried:

awk '{if(! ($2 in a)) a[$2]=$3; else if($3 > a[$2]) a[$2]=$3; max[$2]=$0} END {for (i in max) print  max}' test.dat

and the output is:

Contig7 lcl|4DL 103     913     1015    6590
Contig8 lcl|4DS 279     2594    2872    279
Contig8 lcl|6DL 194     3237    3430    194
Contig8 lcl|6DS 159     3796    3954    1
Contig3 lcl|1DL 140     698     837     140
Contig8 lcl|1DS 109     4069    4177    269

Obviously I had something wrong with the second script. I am very nervous with the second script for millions of rows, but could not figure it out myself. Thanks in advance!
YT

The max assignment is outside the if-else statement, so it always occurs.

Regards,
Alister

---------- Post updated at 02:59 PM ---------- Previous update was at 02:55 PM ----------

Also, you don't really need to bother with checking for the existence of an array member in awk. If it does not exist, it's treated as a zero or an empty string (depends on the context). Your script could be simplified a bit:

awk '$3 > a[$2] {a[$2]=$3; max[$2]=$0} END {for (i in max) print  max}' test.dat

Regards,
Alister

1 Like