Counting characters at each position

amits22 · February 8, 2013, 6:18am

Hi All, here's a question from newbie

I have a data like this, which set of small DNA sequences separated by new line

GAATCCGGAAACAGCAACTTCAAANCA
GTNATTCGGGCCAAACTGTCGAA
TTNGGCAACTGTTAGAGCTCATGCGACA
CCTGCTAAACGAGTTCGAGTTGAANGA
TTNCGGAAGTGGTCGCTGGCACGG
ACNTGCATGTACGGAGTGACGAAACC

I usually have to count frequency of each character in whole data, which I do with

awk -F "" '{ for ( i=1; i<=NF; i++) freq[$i]++} END {for (a in freq) print a, freq[a]}'

Now I am almost clueless when I need to count frequency of characters at each position, I am trying to present example with subset of data below

GAATCCGGAAACAGCAACTTCAAANCA
GTNATTCGGGCCAAACTGTCGAA
TTNGGCAACTGTTAGAGCTCATGCGACA
CCTGCTAAACGAGTTCGAGTTGAANGA
TTNCGGAAGTGGTCGCTGGCACGG
         
1st position G = 1
T = 2
C =1
 A =1 
2nd position 
T=3
C=2 
so on

Any ideas, help is most appreciated. Please tell me if I am not clearly stating the problem.

Thank you,

Amit

RudiC · February 8, 2013, 6:50am

Try this as a starting point:

$ awk -F "" '     {for ( i=1; i<=NF;  i++) {freq[$i,i]++; Base[$i]} if (NF > max) max = NF}
             END  {for ( i=1; i<=max; i++)
                    {for (a in Base) print "Pos: ", i, ", Base: ", a, ", Freq: ", freq[a,i]}}
            ' file
Pos:  1 , Base:  A , Freq:  1
Pos:  1 , Base:  C , Freq:  1
Pos:  1 , Base:  G , Freq:  2
Pos:  1 , Base:  N , Freq:  
Pos:  1 , Base:  T , Freq:  2
Pos:  2 , Base:  A , Freq:  1
Pos:  2 , Base:  C , Freq:  2
Pos:  2 , Base:  G , Freq:  
Pos:  2 , Base:  N , Freq:  
Pos:  2 , Base:  T , Freq:  3
.
.
.

amits22 · February 8, 2013, 10:24am

Thank you so much RudiC, didn't know about this trick

{freq[$i,i]++; Base[$i]}

I understand its taking your time, could I request you to explain above part a bit.

Best,

Amit

RudiC · February 8, 2013, 10:33am

Actually, it's not a trick but more a detour born out of sheer despair. While awk (at least the one I use, mawk) does accept if ( (i,j) in freq ) , it would not allow for for ( (i,j) in freq ) That's why I invented/introduced the second array, just to keep hands on the base chars.

Scrutinizer · February 8, 2013, 11:18am

Alternatively you could try:

awk '{for(i=1; i<=NF; i++) A[i OFS $i]++} END{for(i in A) print i, A}' FS= file | sort -n

amits22 · February 8, 2013, 11:53am

Thank you, very clever and concise. Sorry, I could not understand, if this

A[i OFS $i]++

is creating another array.

Scrutinizer · February 8, 2013, 12:03pm

You're welcome. There is a single array. This adds 1 to an array element with a single index that consists of the position number and the kind separated by OFS (output field separator) which defaults to a single space. So for example A["3 N"]++ and A["28 A"]++