Count specific characters at specific column positions

thienxho · December 4, 2012, 12:06pm

Hi all, I need help.

I have an input text file (input.txt) like this:

21	GTGCAACACCGTCTTGAGAGG	50
21	GACCGAGACAGAATGAAAATC	73
21	CGGGTCTGTAGTAGCAAACGC	108
21	CGAAAAATGAACCCCTTTATC	220
21	CGTGATCCTGTTGAAGGGTCG	259

Now I need to count A/T/G/C numbers at each character location in column 2, in this case is always 21 characters, but can be variable.

Output (output.txt) will need to be:

A	0	1	1	1	3	3	1	2	0	3	1	1	2	1	1	2	3	2	3	0	0
T	0	0	1	0	1	1	1	1	2	0	1	2	0	1	0	1	1	1	1	2	0
G	2	3	2	2	1	0	1	1	1	1	3	0	1	1	1	2	1	2	0	2	2
C	3	0	1	2	0	1	2	1	2	1	0	1	2	1	2	0	0	0	1	1	3

I can do this in Excel, but my file is way bigger than Excel can handle.

Thanks!

vgersh99 · December 4, 2012, 12:33pm

awk -f thie.awk myFile
where thie.awk is:

BEGIN {
  if (!chars) chars="A T G C"
  nchars=split(chars, charsA, FS)
}
{
  width=length($2)
  for(i=1;i<=width;i++)
   arr[substr($2,i,1),i]++
}
END {
  for(i=1;i<=nchars;i++) {
    printf("%s", charsA)
    for(j=1;j<=width;j++)
      printf("%s%d%s", OFS, arr[charsA,j], (j==width)?ORS:"")
  }
}

thienxho · December 4, 2012, 12:58pm

vgersh99:

awk -f thie.awk myFile
where thie.awk is:

BEGIN {
  if (!chars) chars="A T G C"
  nchars=split(chars, charsA, FS)
}
{
  width=length($2)
  for(i=1;i<=width;i++)
   arr[substr($2,i,1),i]++
}
END {
  for(i=1;i<=nchars;i++) {
   printf("%s", charsA)
   for(j=1;j<=width;j++)
   printf("%s%d%s", OFS, arr[charsA,j], (j==width)?ORS:"")
  }
}

Hi vgersh99,

You solved my problem.

I tested your codes and compared them to my Excel count with a file of 800K rows. Both had same output.

Really appreciated your help.