In the attached file if I do a count for "aaaaaaaaaaaa" in either notepad++ or excel I get 118,456.
However, when I do either
grep -o aaaaaaaaaaaa 12A.txt | wc -w
or
awk '{
for (i=1;i<=NF;i++)
if ( $i == "aaaaaaaaaaaa")
c++
}
END{
print c}' 12A.txt
I get 116,441. I'm not sure which is right or if they is a better way? Thank you :).
The search string varies (aaaaaaaaaaaa) but the input format (file to count) is always the same.
example of file
>hg19_refGene_NM_000016 range=chr1:76190032-76229363 5'pad=0 3'pad=0 strand=+ repeatMasking=none
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
>hg19_refGene_NM_000028 range=chr1:100316045-100389579 5'pad=0 3'pad=0 strand=+ repeatMasking=none
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
>hg19_refGene_NM_000029 range=chr1:230838272-230850336 5'pad=0 3'pad=0 strand=- repeatMasking=none
>hg19_refGene_NM_000036 range=chr1:115215720-115238239 5'pad=0 3'pad=0 strand=- repeatMasking=none
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
edit:
there are 2015 AAAAAAAAAAAA
and 116,441 aaaaaaaaaaaa
for a total of 118,456 the awk and the grep take the case into account where as the programs do not.
I use:
perl -076 -nE 'chomp; s/(.+)// && say qq{>$1}; s/\s//g; say $1 while /(a{12})/gi' sequences.txt > 12A.txt
to make the file that is counted. Since that is case insensitive I guess I need to use a command that will count no matter the case. Maybe I can |
into a count? Thank you