Choosing between repeated entries based on the "absolute values" of a column

Sanchari · October 7, 2013, 4:15pm

Hello, I was looking for a way to select between the repeated entries (column1) based on the values of absolute values of column 3 (larger value). For example if the same gene id has FC value -2 and 1, I should get the output as -2. Kindly help.

GeneID          Description    FC 
LOC_Os12g44390     Os.8335.1   -0.00377            
LOC_Os12g44390     Os.8335.1   -0.00877       
LOC_Os12g44390     Os.8335.2   -0.02457     
LOC_Os12g44330     Os.17376.2   -0.00111       
LOC_Os12g44330     Os.17376.2   -0.00716       
LOC_Os12g44320     Os.55718.1   -0.13515       
LOC_Os12g44320     Os.55718.1   -0.07858       
LOC_Os12g44320     OsAffx.20117.1   0.005586

Desired output:

LOC_Os12g44390     Os.8335.2   -0.02457
LOC_Os12g44330     Os.17376.2   -0.00716    
LOC_Os12g44320     Os.55718.1   -0.13515

bartus11 · October 7, 2013, 4:47pm

Put this into "script.awk":

function abs(x){
  return ((x < 0.0) ? -x : x)
}
NR>1{
  if (max[$1]<abs($3)){
    max[$1]=abs($3)
    a[$1]=$2
    b[$1]=$3
  }
}
END{
  for (i in a){
    print i"\t"a"\t"b
  }
}

Then run:

awk -f script.awk file

disedorgue · October 7, 2013, 5:17pm

Another way with sort:

$ sed -n '2,$p' file | sort -n -k1,1 -k3,3 | sort -ur -k1,1

sed to suppress first line (column name).
Regards.