Find lines with matching column 1 value, retain only the one with highest value in column 2

pathunkathunk · March 5, 2013, 8:53pm

I have a file like:

I would like to find lines lines with duplicate values in column 1, and retain only one based on two conditions: 1) keep line with highest value in column 3, 2) if column 3 values are equal, retain the line with the highest value in column 4.

Desired output:

I was able to find duplicate lines:

awk 'NR==FNR{a[$1]++;next;}{ if (a[$1] > 1)print;}' file1 file1

But I can't figure out how to go about filtering based on the criteria I just described.

rdrtx1 · March 5, 2013, 9:56pm

try:

awk '
!a[$1]++ {h3[$1]=$3-1; h4[$1]=$4-1}

{ b[$1]=$1;
  if ($3 > h3[$1]) {
     h3[$1]=$3; ol[$1]=$0;
  } else if ($3 == h3[$1]) {
     if ($4 > h4[$1]) {
        h3[$1]=$3; h4[$1]=$4; ol[$1]=$0
     }
  }
}

END { for (i in ol) print ol}
' infile

elixir_sinari · March 5, 2013, 9:58pm

Assuming that the order of the output records matters:

awk 'FNR==NR{
 if($1 in a)
 {
  split(a[$1],preva)
  if(($3+0 > preva[3]+0) || (($3+0 == preva[3]+0) && ($4+0 > preva[4]+0)))
   a[$1]=$0
  next
 }
 a[$1]=$0
 next
} !b[$1]++{ print a[$1]}' file file

anbu23 · March 6, 2013, 1:09am

$ sort -k1,1 -k3nr -k4nr file | awk ' !arr[$1]++ '
s_48806 comp48806_c0_seq1 100.0 86 3285 0 2838 2838 2838 -1
s_48825 comp48825_c1_seq1 100.0 60 2793 0 1683 1683 1683 -1
s_48827 comp48827_c0_seq5 100.0 40 5431 0 2147 2147 2147 -1
s_48831 comp48831_c0_seq1 73.1 50 2040 237 773 1058 1040 -1