I don't have time to test this thoroughly so try and let me know of any bugs: -
CODE
nawk '
( FNR == 1 ){
f++
header = $0
next
}
( f == 2 ){ printf("%s\n", header) ; f++ }
## Now we process each record for CC GG etc and apply our rules to them
( f == 3 ) {
for( fi = 2; fi <= NF; fi++ ){
gsub(/00/, ".", $fi)
gsub(/A[CGT]|C[GT]|GT/, "0", $fi)
gsub(/AA/, "-1", $fi)
gsub(/TT/, "1", $fi)
## When min = -1 and max = 0, then both CC and GG = 1;
## When min = 0 and max = 1, then both CC and GG = 1;
## When both the min and max = 0, then CC = -1 and GG = 1;
## When min = -1 and max = 1 NO RULE DEFINED
if( $fi == "CC" || $fi == "GG" ){
if( cls[fn, 0] ) { min = 0 ; max = 0 }
if( cls[fn, -1] )
min = -1
if( cls[fn, 1] )
max = 1
if( ( min == 0 ) && ( max == 0 ) ){
if( $fi == "CC" )
$fi = -1
else
$fi = 1
}
if( ( min == -1 ) && ( max == 0 ) )
$fi = 1
if( ( min == 0 ) && ( max == 1 ) )
$fi = 1
}
}
print $0
}
( f == 1 ){ ## First pass of file
for( i = 2; i <= NF; i++ ){
cls[NF, $i]++
}
}
' infile infile
INPUT
I have gone back to the original input file as the one you list has "00" and "1" changed to "." in the first column.
cat infile
ID 1 2 3 4 5 6 7 8
83845676 AG AC AT GT CC AA CC CC
83846900 AA AA TT GG CC AG CC TT
83847041 AA 00 AT GT 00 AG CG CT
83847004 AG AA TT TT CC AG CG CT
83847085 AG CC AT GT CG AG CG CT
83847118 00 AA TT GG 00 GG CC CT
83847162 GG AA TT GT CG AG CG CT
83847165 AA AA 00 GG CC AG GG CT
OUTPUT
ID 1 2 3 4 5 6 7 8
83845676 0 0 0 0 -1 -1 -1 -1
83846900 -1 -1 1 1 -1 0 -1 1
83847041 -1 . 0 0 . 0 0 0
83847004 0 -1 1 1 -1 0 0 0
83847085 0 -1 0 0 0 0 0 0
83847118 . -1 1 1 . 1 -1 0
83847162 1 -1 1 0 0 0 0 0
83847165 -1 -1 . 1 -1 0 1 0
PS To enter code between code tags highlight the code and then click on the # symbol on the toolbar just above the text box.
Good luck
---------- Post updated at 03:51 PM ---------- Previous update was at 07:17 AM ----------
I have had time to look at this in a little more detail and can see it needed a fix.
I still can't get the output you require but am unsure if this is because your example output is flawed or not so I need you to take a look at the output and see if it is wrong or not.
I wrote the code to do the processing you want but have tried to add in danmero's code without really understanding if it does what you want or not.
Here is the code with the fix: -
nawk '
( FNR == 1 ){
f++
header = $0
next
}
( f == 2 ){ printf("%s\n", header) ; f++ }
## Now we process each record for CC GG etc and apply our rules to them
( f == 3 ) {
tmp = $1
gsub(/00/, ".")
gsub(/A[CGT]|C[GT]|GT/, "0")
gsub(/AA/, "-1")
gsub(/TT/, "1")
$1 = tmp
for( fi = 2; fi <= NF; fi++ ){
## When min = -1 and max = 0, then both CC and GG = 1;
## When min = 0 and max = 1, then both CC and GG = 1;
## When both the min and max = 0, then CC = -1 and GG = 1;
## When min = -1 and max = 1 NO RULE DEFINED
if( $fi == "CC" || $fi == "GG" ){
if( cls[fn, 0] ) { min = 0 ; max = 0 }
if( cls[fn, -1] )
min = -1
if( cls[fn, 1] )
max = 1
if( ( min == 0 ) && ( max == 0 ) ){
if( $fi == "CC" )
$fi = -1
else
$fi = 1
}
if( ( min == -1 ) && ( max == 0 ) )
$fi = 1
if( ( min == 0 ) && ( max == 1 ) )
$fi = 1
}
}
print $0
}
( f == 1 ){ ## First pass of file
for( i = 2; i <= NF; i++ ){
cls[NF, $i]++
}
}
' infile infile
Here is the input file: -
ID 1 2 3 4 5 6 7 8
83845676 AG AC AT GT CC AA CC CC
83846900 AA AA TT GG CC AG CC TT
83847041 AA 00 AT GT 00 AG CG CT
83847004 AG AA TT TT CC AG CG CT
83847085 AG CC AT GT CG AG CG CT
83847118 00 AA TT GG 00 GG CC CT
83847162 GG AA TT GT CG AG CG CT
83847165 AA AA 00 GG CC AG GG CT
Here is the output: -
ID 1 2 3 4 5 6 7 8
83845676 0 0 0 0 -1 -1 -1 -1
83846900 -1 -1 1 1 -1 0 -1 1
83847041 -1 . 0 0 . 0 0 0
83847004 0 -1 1 1 -1 0 0 0
83847085 0 -1 0 0 0 0 0 0
83847118 . -1 1 1 . 1 -1 0
83847162 1 -1 1 0 0 0 0 0
83847165 -1 -1 . 1 -1 0 1 0
The code I filched off danmero was based on your earlier spec: -
Hello again,
Again, I apologize for the confsion. I made a mistake in the first post, the letters should be recoded to -1, 0, 1.
This is the tricky part. I need to recode the letters on a per column, alphabetical order basis.
There are several different combinations that can occur within a column:
AA, AC, CC = -1, 0, 1
AA, AG, GG = -1, 0, 1
AA, AT, TT = -1, 0, 1
CC, CG, GG = -1, 0, 1
CC, CT, TT = -1, 0, 1
GG, GT, TT = -1, 0, 1
Therefore anything with a mixed data point (AC, AG, AT, CG, CT, GT) will ALWAYS = 0, AA will ALWAYS = -1, and TT will ALWAYS = 1.
The problem come when recoding CC and GG. As you can see, in some rows CC will come first in the alphabet and will be recoded as -1
(When the combo is CC, CG, GG) . However, in some columns CC does not come first in the alphabet and will be coded as 1 (when the combo is AA, AC, CC).
The same problem occurs with GG. IS there any solution to this issue? I hope I explained it better this time!!
I don't understand this, you start by talking of columns and end talking of rows so I am just assuming danmero understood you and posted code that did what you want.
Let me know if this output is correct or not.
Cheers