Different counts between programs and commands

cmccabe · January 18, 2016, 2:46pm

In the attached file if I do a count for "aaaaaaaaaaaa" in either notepad++ or excel I get 118,456.

However, when I do either

 grep -o aaaaaaaaaaaa 12A.txt | wc -w

or

awk '{ 
     for (i=1;i<=NF;i++)
         if ( $i == "aaaaaaaaaaaa")
         c++
     }
END{
print c}' 12A.txt

I get 116,441. I'm not sure which is right or if they is a better way? Thank you :).

The search string varies (aaaaaaaaaaaa) but the input format (file to count) is always the same.

example of file

>hg19_refGene_NM_000016 range=chr1:76190032-76229363 5'pad=0 3'pad=0 strand=+ repeatMasking=none
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
>hg19_refGene_NM_000028 range=chr1:100316045-100389579 5'pad=0 3'pad=0 strand=+ repeatMasking=none
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa
>hg19_refGene_NM_000029 range=chr1:230838272-230850336 5'pad=0 3'pad=0 strand=- repeatMasking=none
>hg19_refGene_NM_000036 range=chr1:115215720-115238239 5'pad=0 3'pad=0 strand=- repeatMasking=none
aaaaaaaaaaaa
aaaaaaaaaaaa
aaaaaaaaaaaa

edit:

there are 2015 AAAAAAAAAAAA
and 116,441 aaaaaaaaaaaa

for a total of 118,456 the awk and the grep take the case into account where as the programs do not.

I use:

perl -076 -nE 'chomp; s/(.+)// && say qq{>$1}; s/\s//g; say $1 while /(a{12})/gi' sequences.txt > 12A.txt

to make the file that is counted. Since that is case insensitive I guess I need to use a command that will count no matter the case. Maybe I can | into a count? Thank you

disedorgue · January 18, 2016, 2:56pm

Hi,
excel and notepad++ are insensitive case by default and your file contains 'AAAAAAAAAAAA'.

Regards.

RavinderSingh13 · January 18, 2016, 3:02pm

Hello cmccabe,

You was close in your awk script, could you please add IGNORECASE and set it to TRUE as follows, hope this helps you to get exact count.

awk 'BEGIN
     {
	IGNORECASE = 1;
     }
     { 
     for (i=1;i<=NF;i++)
         if ( $i == "aaaaaaaaaaaa")
         c++
     }
END{
print c}' 12A.txt

Also IGNORECASE should work in GNU awk , if you doesn't have that then you could so following.

awk '
     { 
     for (i=1;i<=NF;i++)
         if ( tolower($i) == "aaaaaaaaaaaa")
         c++
     }
END{
print c}' 12A.txt

Thanks,
R. Singh

cmccabe · January 18, 2016, 3:15pm

Thank you both :).