Hi everybody,
i have this biological situation to fix:
> Id.1
ACGTACANNNNNNNNNNNACGTGCNNNNNNNACTGTGGT
>Id.2
ACGGGT
>Id.3
ACGTNNNNNNNNNNNNACTGGGGG
>Id.4
ACGTGCGNNNNNNNNGGTCANNNNNNNNCGTGCAAANNNNN
........
....
These are nucleotidic sequences with some "NNNN..." always of the same length but in different positions.(we have about 300.000 >Id different sequences). The "NNNN..." may occur one time,two times, 3 or max 4 time (or 0). My question is:
Is there anyway to coount how many >Id occur with one "NNNN.." how many reads 2,3,4 and 0 over the total 300.000?
I mean something that at the end would be for exemple from 300.000 >Ids
100.000 have one "NNNN..."
200.000 have two "NNNN..."
50.000 have three "NNNN..."
30.000 have four "NNNN.."
20.000 don't have any "NNNN.."
The lines are always the same type
>Id..
letter......with or without a variable number of block of "NNNN..." .
(In the "NNNN..." block the number of N is always the same, they are adaptors in every lines across the other letter A,C,G,T)
I hope to have been clear and that anyone can help me...
Please...!!!