Suppose
file1.bim
1 rs1 0 0 G A
1 rs3 0 1 A C
2 rs8 0 0 G A
2 rs2 0 0 T C
3 rs10 0 0 0 T
3 rs11 0 0 T 0
(N*6 table, where N is arbitary,in this case 6, where 2nd column is the name of SNP, and the 5th,6th are genotype data, where 0 means missing information)
There is another file called
file1.ped
id1 id1 G A A C G G T C T T NA T
id3 id3 G G A A A G T C T T T T
id5 id5 G G A A G G T T NA NA T NA
- this ped file is M*(N*2+2) table, where M is the number of individuals, and N is the number of SNPs.
- First two columns are ID number, where first column and second column are identical
- 3,4th column correspond to the first SNP (rs1) in file1.bim file. and 5,6th column coresspond to the next SNP (rs3) in file1.bim file and so forth. Each two columns correspond to each SNP in the order of SNPs listed in the bim file.
- So dimension of ped file will be (individuals)(#of SNPS2+2 columns of ids)
So what I would like to do first is this.
Look at the a pair of alleles expressed for the each SNP (rs1) in the bim file, I want to consider the first allele as 0, and second allele as 1. If first allele and second allele are the same, they both will be 0. If any allele is expressed as 0, it will be recoded as NA.
For instance, for the first SNP, G A are recorded. so G will be recoded as 0, and A will be recoded as 1.
Then, we apply this knowledge in ped file.
Keep in mind that first 3,4th columns correspond to the first SNP in bim file, and 5,6th columns to the second SNP, and so forth.
For the first SNP, where G is expressed as 0, and A is 1,
id1 id1 G A A C G G T C T T NA T -> id1 id1 0 0 A C G G T C T T NA T
id3 id3 G G A A A G T C T T T T id3 id3 0 0 A A A G T C T T T T
id5 id5 G G A A G G T T NA NA T NA id5 id5 0 0 A A G G T T NA NA T NA
then we proceed this process for the rest of the SNP, then we would have
id1 id1 0 1 0 1 0 0 0 1 1 1 NA 0
id3 id3 0 0 0 0 1 0 0 1 1 1 0 0
id5 id5 0 0 0 0 0 0 0 0 NA NA 0 NA
then the next step is to add each two columns together.
0 1 --> 1
1 0 --> 1
0 0 --> 0
NA 1 -->1
1 NA -->1
NA 0 --> 0
0 NA -->0
NA NA -->NA
the final output will be
id1 id1 0 1 0 1 0 0 0 1 1 1 NA 0 --> id1 id1 1 1 0 1 2 0
id3 id3 0 0 0 0 1 0 0 1 1 1 0 0 id3 id3 0 0 1 1 2 0
id5 id5 0 0 0 0 0 0 0 0 NA NA 0 NA id5 id5 0 0 0 0 NA 0
N*(M+2) table
So the ultimate output that I want is
final.txt
id1 id1 1 1 0 1 2 0
id3 id3 0 0 1 1 2 0
id5 id5 0 0 0 0 NA 0
I have written a script for R, but I have trouble writing one in unix.
I appreciate your help in advance!