recoding/ converting numbers

Suppose

file1.bim

1 rs1 0 0 G A
1 rs3 0 1 A C
2 rs8 0 0 G A
2 rs2 0 0 T C
3 rs10 0 0 0 T
3 rs11 0 0 T 0      

(N*6 table, where N is arbitary,in this case 6, where 2nd column is the name of SNP, and the 5th,6th are genotype data, where 0 means missing information)
There is another file called

file1.ped

id1 id1 G A A C G G T C T T NA T
id3 id3 G G A A A G T C T T T T
id5 id5 G G A A G G T T NA NA T NA
  • this ped file is M*(N*2+2) table, where M is the number of individuals, and N is the number of SNPs.
  • First two columns are ID number, where first column and second column are identical
  • 3,4th column correspond to the first SNP (rs1) in file1.bim file. and 5,6th column coresspond to the next SNP (rs3) in file1.bim file and so forth. Each two columns correspond to each SNP in the order of SNPs listed in the bim file.
  • So dimension of ped file will be (individuals)(#of SNPS2+2 columns of ids)

So what I would like to do first is this.
Look at the a pair of alleles expressed for the each SNP (rs1) in the bim file, I want to consider the first allele as 0, and second allele as 1. If first allele and second allele are the same, they both will be 0. If any allele is expressed as 0, it will be recoded as NA.
For instance, for the first SNP, G A are recorded. so G will be recoded as 0, and A will be recoded as 1.

Then, we apply this knowledge in ped file.
Keep in mind that first 3,4th columns correspond to the first SNP in bim file, and 5,6th columns to the second SNP, and so forth.
For the first SNP, where G is expressed as 0, and A is 1,

id1 id1 G A A C G G T C T T NA T  ->  id1 id1 0 0 A C G G T C T T NA T 
id3 id3 G G A A A G T C T T T T         id3 id3 0 0 A A A G T C T T T T
id5 id5 G G A A G G T T NA NA T NA    id5 id5 0 0 A A G G T T NA NA T NA

then we proceed this process for the rest of the SNP, then we would have

id1 id1 0 1 0 1 0 0 0 1 1 1 NA 0
id3 id3 0 0 0 0 1 0 0 1 1 1 0 0
id5 id5 0 0 0 0 0 0 0 0 NA NA 0 NA
 

then the next step is to add each two columns together.

0 1 --> 1
1 0 --> 1
0 0 --> 0
NA 1 -->1
1 NA -->1
NA 0 --> 0
0 NA -->0
NA NA -->NA

the final output will be

id1 id1 0 1 0 1 0 0 0 1 1 1 NA 0 -->  id1 id1 1 1 0 1 2 0
id3 id3 0 0 0 0 1 0 0 1 1 1 0 0          id3 id3 0 0 1 1 2 0
id5 id5 0 0 0 0 0 0 0 0 NA NA 0 NA    id5 id5 0 0 0 0 NA 0

N*(M+2) table

So the ultimate output that I want is

final.txt

id1 id1 1 1 0 1 2 0
id3 id3 0 0 1 1 2 0
id5 id5 0 0 0 0 NA 0

I have written a script for R, but I have trouble writing one in unix.
I appreciate your help in advance!

Should the highlight value below be 0 1 (not 0 0)?

id1 id1 G A A C G G T C T T NA T  ->  id1 id1 0 0 A C G G T C T T NA T

On line 2 why does T T go to 1 1 for 2nd last allele and 0 0 on last allele?

id3 id3 G G A A A G T C T T T T -> id3 id3 0 0 0 0 1 0 0 1 1 1 0 0
1 Like

try this

awk 'BEGIN{i=1; while (getline < "bimfile")
		{
			if(match($5,$6)) 
				{	
					ind=i":"$5;
					a[ind]=0;
					ind=i":"$6;
					a[ind]=0;
					i++
				}
			else
				{
					if ($5==0 || $6==0)
						{
							if($5==0)
							{
								ind=i":NA";
								a[ind]="NA";
								ind=i":"$6;
								a[ind]=1;
								i++
							};

							if($6==0)
							{
								ind=i":"$5;
								a[ind]=0;
								ind=i":NA";
								a[ind]="NA";
								i++
							
							}
						}
					else
						{	
							ind=i":"$5;
							a[ind]=0;
							ind=i":"$6;
							a[ind]=1;
							i++
						}


				}
		}
}
{
	x=1;
	printf $1" "$2" ";
	for(j=3;j<=NF;j=j+2)
		{
			ind=substr(x,1)":"$j;
			ind1=substr(x,1)":"$(j+1);
			x++;
			$j=a[ind];
			$(j+1)=a[ind1];
			if($j~/NA/&&$(j+1)~/NA/)
				{
					printf "NA "
				}
			else
				{
					printf a[ind]+a[ind1]" "
				}
		};
	printf "\n"}' pedfile
1 Like

This is not working for me...

---------- Post updated at 09:24 AM ---------- Previous update was at 09:24 AM ----------

You are right, that's my mistake

You specify 0 NA -->NA but use 0 NA -->0 .

After someone takes the time to write some code for you, the least you can do is provide some useful feedback.

Regards,
Alister

Beacuse 5th SNP has 0 T alleles. T is the second allele expressed for that SNP, so T will be expressed as 1 in the ped file. Hence T T will be 1 1 in the ped file.

On line 2 why does T T go to 1 1 for 2nd last allele and 0 0 on last allele?

id3 id3 G G A A A G T C T T T T -> id3 id3 0 0 0 0 1 0 0 1 1 1 0 0

[/quote]

what the problem you are facing with my code
let me know.
i will try to rectify it.
for the input you have given its working.

I get errors like

Display all 509 possibilities? (y or n)
> A/&&$(j+1)~/NA/)
>
Display all 509 possibilities? (y or n)
> tf "NA "
>
Display all 509 possibilities? (y or n)
> tf a[ind]+a[ind1]" "
>
Display all 509 possibilities? (y or n)
> tf "\n"}' ${data}/ped.txt > ${data}/111.txt
awk: cmd. line:2: d]=0;
awk: cmd. line:2:  ^ syntax error
awk: cmd. line:4: d]=0;
awk: cmd. line:4:  ^ syntax error
awk: cmd. line:5: $6==0)
awk: cmd. line:5:      ^ syntax error
awk: cmd. line:7: d]="NA";
awk: cmd. line:7:  ^ syntax error
awk: cmd. line:9: d]=1;
awk: cmd. line:9:  ^ syntax error
awk: cmd. line:11: d]=0;
awk: cmd. line:11:  ^ syntax error
awk: cmd. line:13: d]="NA";
awk: cmd. line:13:  ^ syntax error
awk: cmd. line:15: d]=0;
awk: cmd. line:15:  ^ syntax error
awk: cmd. line:17: d]=1;
awk: cmd. line:17:  ^ syntax error
awk: cmd. line:22: d];
awk: cmd. line:22:  ^ syntax error
awk: cmd. line:23: d1];
awk: cmd. line:23:   ^ syntax error
awk: cmd. line:24: A/&&$(j+1)~/NA/)
awk: cmd. line:24:   ^ syntax error
awk: cmd. line:24: A/&&$(j+1)~/NA/)
awk: cmd. line:24:                ^ syntax error

Are you using the code on same set of data you have provided or to other data.
if you are using another set of data then please provide me that data.

As i am getting the output as

id1 id1 1 1 0 1 2 0
id3 id3 0 0 1 1 2 0
id5 id5 0 0 0 0 NA 0

for this code on data given by you.
I had tested on HP-UX

Try putting it in a file instead of pasting it into a terminal. It looks like it's trying to expand tabs in the text you're pasting.