Hello,
My input file looks like this
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Individual1 Individual2 Individual3 Individual4 Individual5 Individual6
22 10000 ID1 A G 0 PASS GT 0|1 0|1 0|1 1|1 1|1 1|1
22 10001 ID2 A G 0 PASS GT 0|1 0|1 0|1 1|0 1|1 1|1
22 10002 ID3 A G 0 PASS GT 0|1 0|1 0|1 1|0 1|1 1|1
22 10003 ID4 G A 0 PASS GT 1|0 1|0 1|0 1|0 1|1 1|1
22 10004 ID5 A G 0 PASS GT 0|1 0|1 0|1 1|0 1|1 1|1
22 10005 ID6 A G 0 PASS GT 0|1 0|1 0|1 1|0 1|1 1|1
22 10006 ID7 A G 0 PASS GT 0|1 0|1 0|1 1|1 1|1 1|1
22 10007 ID8 A G 0 PASS GT 0|1 0|1 0|1 1|0 1|1 1|1
22 10008 ID9 C T 0 PASS GT 0|1 0|1 0|1 1|0 1|1 1|1
22 10009 ID10 C T 0 PASS GT 0|1 0|1 0|1 1|0 1|1 1|1
22 10010 ID11 A G 0 PASS GT 0|1 0|1 0|1 1|0 1|1 1|1
22 10011 ID12 C T 0 PASS GT 0|1 0|1 0|1 1|0 1|1 1|1
22 10012 ID13 C A 0 PASS GT 1|0 1|0 1|0 1|0 1|1 1|1
22 10013 ID14 T G 0 PASS GT 1|0 1|0 1|0 1|0 1|1 1|1
22 10014 ID15 G A 0 PASS GT 1|0 1|0 1|0 1|0 1|1 1|1
I need to calculate the average fraction of 'ID' that are identical, averaged across all possible pairings (Individual1,Individual2), (Individual1,Individual3), (Individual1,Individual4),
The formula to calculate this is as below
Ratio(Ind1,Ind2;IDk)= (a|b, c|d)= [(a,c) + (a,d) + (b,c) + (b,d)]/4
For example for for ID1, the values for (Individual1,Individual2) are (0|1,0|1)
so the ratio will be calculated as such
Ratio(Individual1,Individual2;ID1)=(0|1,0|1)=[(0,0) + (0,1) + (1,0) + (1,1)]/4
=(0+1+1+2)/4
=1
....
For ID15
Ratio(Individual1,Individual2;ID1)=(1|0,1|0)=[(1,1) + (1,0) + (0,1) + (0,0)]/4
=2+1+1+0 /4
=1
This is averaged over all ID, and reported for all pairs of individuals. In this file, I have over 2000 IDs and several individuals and am currently doing it in excel. Is this possible to code this formula in bash or python at all to accept a text tab delimited file ?
The output I am looking for is such
ID-Pair Avg-ratio-across-all-IDs
Individual 1-Individual2 1.00
Indidivual1-Indidivual6 0.5