Alphabet counting

Lucky_Ali · January 26, 2012, 9:24pm

I have a text file in the following format

CCCCCGCCCCCCCCCCcCCCCCCCCCCCCCCC
AAAATAAAAAAAAAAAaAAAAAAAAAAAAAAA
TGTTTTTTTTTTTTGGtTTTTTTTTTTTTTTT
TTTT-TTTTTTTTTCTtTTTTTTTTTTTTTTT

Each row/line will have 32 letters and each line will only have multiple occurrences of 2 letters out of a pool of ATGC (also small atgc). some may have also '-'. I would like to count the occurrence of each alphabet in a line and output the position number/ numbers of the smallest counted alphabet.

CCCCCGCCCCCCCCCCcCCCCCCCCCCCCCCC  G 7
AAAATAAAAAAAAAAAaAAAAAAAAAAAAAAA   T 5
TGTTTTTTTTTTTTGGtTTTTTTTTTTTTTTT  G 2 15 16
TTTT-TTTTTTTTTCTtTTTTTTTTTTTTTTT    C 15

Please let me know the best way to do it using awk.
Thanks

agama · January 26, 2012, 10:27pm

Have a go with this:

awk '
    {
        n = split( $0, a, "" );
        for( i = 1; i <= n; i++ )
        {
            count[a]++;
            pos[a] = sprintf( "%s%d ", pos[a], i );
        }

        min = "";
        for( x in count )
        {
            if( match( x, "[ACGT]" ) && (min == "" || count[x] < count[min] ) )
                min = x;
        }

        print $0, min, pos[min];

        delete count;
        delete pos;
    }
' input-file >output-file