Hi all, I need help.
I have an input text file (input.txt) like this:
21 GTGCAACACCGTCTTGAGAGG 50
21 GACCGAGACAGAATGAAAATC 73
21 CGGGTCTGTAGTAGCAAACGC 108
21 CGAAAAATGAACCCCTTTATC 220
21 CGTGATCCTGTTGAAGGGTCG 259
Now I need to count A/T/G/C numbers at each character location in column 2, in this case is always 21 characters, but can be variable.
Output (output.txt) will need to be:
A 0 1 1 1 3 3 1 2 0 3 1 1 2 1 1 2 3 2 3 0 0
T 0 0 1 0 1 1 1 1 2 0 1 2 0 1 0 1 1 1 1 2 0
G 2 3 2 2 1 0 1 1 1 1 3 0 1 1 1 2 1 2 0 2 2
C 3 0 1 2 0 1 2 1 2 1 0 1 2 1 2 0 0 0 1 1 3
I can do this in Excel, but my file is way bigger than Excel can handle.
Thanks!
awk -f thie.awk myFile
where thie.awk is:
BEGIN {
if (!chars) chars="A T G C"
nchars=split(chars, charsA, FS)
}
{
width=length($2)
for(i=1;i<=width;i++)
arr[substr($2,i,1),i]++
}
END {
for(i=1;i<=nchars;i++) {
printf("%s", charsA)
for(j=1;j<=width;j++)
printf("%s%d%s", OFS, arr[charsA,j], (j==width)?ORS:"")
}
}
2 Likes
Hi vgersh99,
You solved my problem.
I tested your codes and compared them to my Excel count with a file of 800K rows. Both had same output.
Really appreciated your help.