Scripting help to identify words count in lines

Hi everybody,

i have this biological situation to fix:

> Id.1

These are nucleotidic sequences with some "NNNN..." always of the same length but in different positions.(we have about 300.000 >Id different sequences). The "NNNN..." may occur one time,two times, 3 or max 4 time (or 0). My question is:

Is there anyway to coount how many >Id occur with one "NNNN.." how many reads 2,3,4 and 0 over the total 300.000?

I mean something that at the end would be for exemple from 300.000 >Ids

100.000 have one "NNNN..."
200.000 have two "NNNN..."
50.000 have three "NNNN..."
30.000 have four   "NNNN.."
20.000 don't have any "NNNN.."

The lines are always the same type

letter......with or without a variable number of block of "NNNN..." .

(In the "NNNN..." block the number of N is always the same, they are adaptors in every lines across the other letter A,C,G,T)

I hope to have been clear and that anyone can help me...

Please...!!! :slight_smile:

nawk -F'[N]+' '/^[^>]/{a[NF-1]++}END{for(i in a) print a " have " i " NNs"}' myFile

Amazing !!! The best'r great !!! Thank yuo very much for your help.............It works !!!

Try this for start:

perl -ne '$count{s/N+//g}++ if /^[^>]/;END{for $i (keys %count){print "$count{$i} have $i NNNNN...\n";}}' file

Thanks Bartus !!! The same result and speed !!! Very good in perl !!!