Hello,
Please help create the following report.
I have a matrix
- S1 S2 S3 S4
M1 AA AA TT -
M2 AG AG AA GG
M3 GG TT - -
a first lookup table
M3 chr7 4.456
M1 chr7 28.9
M2 chr8 129.678
a second lookup table
S1 GGHBBGG/DEDD(@DCCD)
S2 GGHBBGG/DEDD(@DCCD)//B-
S3 GGHBBGG/DEDD(@HH?)//B1@NNN
S4 GGHBBGG/DEDD(@DCCDH)#-BCF1
I want to count each nucleotide (As, Ts, Cs and Gs) for each row, and a few variables calculated as
total = NF-1
missing = total number of "-"
mono = total number of ( AA + GG + CC + TT)
mix = total number of ( AT + AC + AG + CT + CG + GT)
m = 2nd highest among (A,T,C,G) / total (A,T,C,G)
data = total - missing
An example calculation of m for M1 is there are 4 As and 2 Ts for M1. The rest are 0s
So m for M1 = total number of Ts, which is 2nd highest ( 2 ) / total A,T,G,C (6) = 0.33
Here is what is my report should look like
I cant seem to line up the table for some reason, it is space delimited.
NAME M1 M2 M3
CHR chr7 chr8 chr7
POS 28.9 129.678 4.456
A 4 4 0
T 2 0 2
G 0 4 2
C 0 0 0
- 1 0 2
m 0.33 0.5 0.5
data 3 4 2
mono 3 2 2
mixed 0 2 0
total 4 4 4
S1 GGHBBGG/DEDD(@DCCD) AA AG GG
S2 GGHBBGG/DEDD(@DCCD)//B- AA AG TT
S3 GGHBBGG/DEDD(@HH?)//B1@NNN TT AA -
S4 GGHBBGG/DEDD(@DCCDH)#-BCF1 - GG -
I tried this, please help
awk 'NR==FNR{ l[$1]=$2 FS $3;next} $1 in l { $1=$1 FS l[$1]}1' lookup1 file |
awk '{ for (i=2;i++;i<=NF)
if ($i=="AA" || $i=="GG" || $i=="CC" || $i=="TT")
mono=mono+1
if ($i=="AA")
a=a+2
else if ($i=="GG")
g=g+2
if ($i=="CC")
c=c+2
else if ($i=="TT")
t=t+2
fi
else if ($i=="AT" || $i=="AG" || $i=="AC" || $i=="CT" || $i=="CG" || $i=="GT")
mix=mix+1
else if ($i="-")
missing=missing+1
fi
total=NF-1
data=(NF-1)-missing
$1= $1,a,c,t,g,mono,mix,missing,total,data
}1' | awk '
{
for (i=1; i<=NF; i++) {
a[NR,i] = $i
}
}
NF>p { p = NF }
END {
for(j=1; j<=p; j++) {
str=a[1,j]
for(i=2; i<=NR; i++){
str=str" "a[i,j];
}
print str
}
}' | awk 'NR==FNR{ l[$1]=$2 ;next} $1 in l { $1=$1 FS l[$1]}1' lookup2 - > final_report