Print a row with the max number in a column

coppuca · February 13, 2020, 1:46pm

Hello,

I have this table:

chr1_16857_17742         -        chr1    17369   17436   "ENST00000619216.1";    "MIR6859-1";    -        67
chr1_16857_17742         -        chr1    14404   29570   "ENST00000488147.1";    "WASH7P";       -        885
chr1_16857_18061        -        chr1    17369   17436   "ENST00000619216.1";    "MIR6859-1";    -        67
chr1_16857_18061        -        chr1    14404   29570   "ENST00000488147.1";    "WASH7P";       -        1204
chr1_16857_18061        -        chr1    17369   17436   "ENST00000619216.1";    "MIR6859-1";    -        67
chr1_16857_18061        -        chr1    14404   29570   "ENST00000488147.1";    "WASH7P";       -        1204
chr1_17232_18061        -        chr1    17369   17436   "ENST00000619216.1";    "MIR6859-1";    -        67
chr1_17232_18061        -        chr1    14404   29570   "ENST00000488147.1";    "WASH7P";       -        829
chr1_17914_24891        -        chr1    14404   29570   "ENST00000488147.1";    "WASH7P";       -        6977
  chr1_18267_29570        -        chr1    14404   29570   "ENST00000488147.1";    "WASH7P";       -        11303

where I need based on the first column go through all the instances with the same pattern and print out only the rows where the last column has MAX value.

Desired output:

chr1_16857_17742         -       chr1    14404   29570   "ENST00000488147.1";    "WASH7P";       -       885
chr1_16857_18061        -       chr1    14404   29570   "ENST00000488147.1";    "WASH7P";       -       1204
chr1_17232_18061        -        chr1    14404   29570   "ENST00000488147.1";    "WASH7P";       -        829
chr1_17914_24891        -       chr1    14404   29570   "ENST00000488147.1";    "WASH7P";       -       6977
chr1_18267_29570        -       chr1    14404   29570   "ENST00000488147.1";    "WASH7P";       -       11303

I tried this:

awk ' ! ( $1 in A_max ) { A_max[$1] = $NF } {A_max[$1] = ( A_max[$1] > $NF ? A_max[$1] : $NF )} END { for ( k in A_max ) print A_max[k], k }'

but I'm stuck on how to print out the whole row.

Would appreciate any help!

rdrtx1 · February 13, 2020, 1:58pm

try:

sort -k9 -n -r file | awk '!a[$1]++'

In output shown above, what happened to the max line for chr1_17232_18061?

coppuca · February 13, 2020, 2:10pm

Thank you! It works The chr1_17232_18061 was an accident.
Could you please explain how does

awk '!a[$1]++'

work? It collapses the rows with the same pattern in given column?

rdrtx1 · February 13, 2020, 2:20pm

It prints the line occurrence of column 1 ($1). a[$1]++ fails for the first occurrence increment since the array value for that key does not exist. The ! means not true so the expression is evaluated to true . The default action for true in awk is to print the line. In other terms: if the previous value of column 1 in array "a" cannot be incremented then print the line. The line could have been written as: '{if (! a[$1]++) print $0}'

RudiC · February 13, 2020, 4:46pm

Try as well

sort -k1,1 -k9nr  file | uniq -w16
chr1_16857_17742        -        chr1    14404   29570   "ENST00000488147.1";    "WASH7P";       -        885
chr1_16857_18061        -        chr1    14404   29570   "ENST00000488147.1";    "WASH7P";       -        1204
chr1_17232_18061        -        chr1    14404   29570   "ENST00000488147.1";    "WASH7P";       -        829
chr1_17914_24891        -        chr1    14404   29570   "ENST00000488147.1";    "WASH7P";       -        6977
chr1_18267_29570        -        chr1    14404   29570   "ENST00000488147.1";    "WASH7P";       -        11303

MadeInGermany · February 14, 2020, 4:17am

Regarding your initial attempt, it is quite close.
First, let me make a simplified version:

awk '! ($1 in A_Max) || $1 > A_Max[$1] { A_Max[$1]=$NF } END { for (a in A_Max) print A_Max[a],a }' file

Now, the final step is to store the lines in another array, along with the A_max[]

awk '! ($1 in A_Max) || $1 > A_Max[$1] { A_Max[$1]=$NF; L_Max[$1]=$0 } END { for (a in A_Max) print L_Max[a] }' file

The two arrays take some memory; the key field $1 is stored in both arrays.
It would be possible to only use a line array ...