I have a tab-delimited data representing network data (undirected). Among the duplicated edges, I wanted to select those edges for which I have the higher absolute value of the log values.
I have written a code in python, but its taking a lot of time. I would be grateful if someone helps me with an awk command. Kindly note, the network is undirected, i.e. A--B and B--A are duplicate edges. My original file has a large number of columns, I have given a simplified test data
Test data
Gene1 Gene2 Log
AT1G01020 AT1G01010 1.682708
AT1G01020 AT1G01010 -1.90043
AT1G01020 AT1G01010 -1.832192
AT1G01070 AT1G01060 -0.591932
AT1G01070 AT1G01060 -1.204241
AT1G01073 AT1G01070 0.790549
AT1G01060 AT1G01070 1.214972
Expected Output
AT1G01020 AT1G01010 -1.90043
AT1G01070 AT1G01060 1.214972
AT1G01073 AT1G01070 0.790549
gene_table=file1.readlines() # In the real file, j[12]=Gene1, j[13]=Gene2 and j[27]=log value
lfc=[]
for j in gene_table:
j=j.split("\t")
j[12]=j[12].strip()
j[13]=j[13].strip()
lfc=[]
int_list=[]
lfc.append(float(j[27]))
int_list.append(j[0])
dict_int={}
for k in gene_table:
k=k.split("\t")
k[12]=k[12].strip()
k[13]=k[13].strip()
if (j[0]!=k[0]) and ((j[12]==k[12] and j[13]==k[13]) or (j[12]==k[13] and j[12]==k[13])):
lfc.append(float(k[27]))
dict_int=dict(zip(int_list, lfc))
x=max(lfc, key=abs)
#print x
listOfKeys = [key for (key, value) in dict_int.items() if value == x]
print listOfKeys