Scripting help to identify words count in lines

Giorgio_C · November 10, 2011, 9:32am

Hi everybody,

i have this biological situation to fix:

> Id.1
ACGTACANNNNNNNNNNNACGTGCNNNNNNNACTGTGGT
>Id.2
ACGGGT
>Id.3
ACGTNNNNNNNNNNNNACTGGGGG
>Id.4
ACGTGCGNNNNNNNNGGTCANNNNNNNNCGTGCAAANNNNN
........
....

These are nucleotidic sequences with some "NNNN..." always of the same length but in different positions.(we have about 300.000 >Id different sequences). The "NNNN..." may occur one time,two times, 3 or max 4 time (or 0). My question is:

Is there anyway to coount how many >Id occur with one "NNNN.." how many reads 2,3,4 and 0 over the total 300.000?

I mean something that at the end would be for exemple from 300.000 >Ids

100.000 have one "NNNN..."
200.000 have two "NNNN..."
50.000 have three "NNNN..."
30.000 have four   "NNNN.."
20.000 don't have any "NNNN.."

The lines are always the same type

>Id..
letter......with or without a variable number of block of "NNNN..." .

(In the "NNNN..." block the number of N is always the same, they are adaptors in every lines across the other letter A,C,G,T)

I hope to have been clear and that anyone can help me...

Please...!!!

vgersh99 · November 10, 2011, 9:40am

nawk -F'[N]+' '/^[^>]/{a[NF-1]++}END{for(i in a) print a " have " i " NNs"}' myFile

Giorgio_C · November 10, 2011, 9:46am

Amazing !!! The best answer....you'r great !!! Thank yuo very much for your help.............It works !!!

bartus11 · November 10, 2011, 9:49am

Try this for start:

perl -ne '$count{s/N+//g}++ if /^[^>]/;END{for $i (keys %count){print "$count{$i} have $i NNNNN...\n";}}' file

Giorgio_C · November 10, 2011, 9:59am

Thanks Bartus !!! The same result and speed !!! Very good in perl !!!