UNIX command to select the best edge values from a network file

Sanchari · March 10, 2020, 6:16pm

I have a tab-delimited data representing network data (undirected). Among the duplicated edges, I wanted to select those edges for which I have the higher absolute value of the log values.
I have written a code in python, but its taking a lot of time. I would be grateful if someone helps me with an awk command. Kindly note, the network is undirected, i.e. A--B and B--A are duplicate edges. My original file has a large number of columns, I have given a simplified test data

Test data

     Gene1    Gene2    Log
    AT1G01020    AT1G01010    1.682708
    AT1G01020    AT1G01010    -1.90043
    AT1G01020    AT1G01010    -1.832192
    AT1G01070    AT1G01060    -0.591932
    AT1G01070    AT1G01060    -1.204241
    AT1G01073    AT1G01070    0.790549
    AT1G01060    AT1G01070    1.214972

Expected Output

    AT1G01020    AT1G01010    -1.90043
    AT1G01070    AT1G01060    1.214972
    AT1G01073    AT1G01070    0.790549

gene_table=file1.readlines() # In the real file, j[12]=Gene1, j[13]=Gene2 and j[27]=log value
lfc=[]
for j in gene_table:
    j=j.split("\t")
    j[12]=j[12].strip()
    j[13]=j[13].strip()
    lfc=[]
    int_list=[]
    lfc.append(float(j[27]))
    int_list.append(j[0])
    dict_int={}
    for k in gene_table:
        k=k.split("\t")
        k[12]=k[12].strip()
        k[13]=k[13].strip()
        if (j[0]!=k[0]) and ((j[12]==k[12] and j[13]==k[13]) or (j[12]==k[13] and j[12]==k[13])):
            lfc.append(float(k[27]))
    dict_int=dict(zip(int_list, lfc))
    x=max(lfc, key=abs)
    #print x
    listOfKeys = [key  for (key, value) in dict_int.items() if value == x]
    print listOfKeys

nezabudka · March 11, 2020, 9:40am

Hi, @Sanchari
Check have you an error?

sanchari:

Test data

   Gene1    Gene2    Log
   AT1G01020    AT1G01010    1.682708
   AT1G01020    AT1G01010    -1.90043
   AT1G01020    AT1G01010    -1.832192
   AT1G01070    AT1G01060    -0.591932
   AT1G01070    AT1G01060    -1.204241
   AT1G01073    AT1G01070    0.790549
   AT1G01060    AT1G01070    1.214972

Expected Output

   AT1G01020    AT1G01010    -1.90043
   AT1G01070    AT1G01060    1.214972
   AT1G01073    AT1G01070    0.790549

If you need to display and unique fields
then the result should be

AT1G01070 AT1G01060 -1.204241
AT1G01060 AT1G01070 1.214972
AT1G01020 AT1G01010 -1.90043
AT1G01073 AT1G01070 0.790549

and if don't

AT1G01070 AT1G01060 -1.204241
AT1G01020 AT1G01010 -1.90043

Is the solution suitable for you with the 'awk' tool?

--- Post updated at 17:40 ---

uniq -Dw 26 file |
awk '
NR==1 {next}
{if(abs(A[$1 FS $2]) < abs($3)) A[$1 FS $2] = $3}
END {for(i in A) print i, A}
func abs(x) { return (x<0) ? x*-1 : x }'

awk '
NR==1 {next}
{if(abs(A[$1 FS $2]) < abs($3)) A[$1 FS $2] = $3}
END {for(i in A) print i, A}
func abs(x) { return (x<0) ? x*-1 : x }' file

vgersh99 · March 11, 2020, 9:51am

how about (a bit verbose):
awk -f san.awk myInputFile , where san.awk is:

BEGIN {
  FS=OFS="\t"
  i1=1
  i2=2
  v=3
}
function abs(x)    { return x < 0 ? -x : x }

FNR>1 {
   idx=($i1 > $i2)? $i1 OFS $i2 : $i2 OFS $i1
   if (abs(a[idx])<abs($v))
      a[idx]=$v
}
END {
  for (i in a)
    print i,a
}

results in:

AT1G01070       AT1G01060       1.214972
AT1G01020       AT1G01010       -1.90043
AT1G01073       AT1G01070       0.790549