Counts not matching in file

cmccabe · November 12, 2015, 2:59pm

I can not figure out why there are 56,548 unique entries in test.bed. However, perl and awk see only 56,543 and that # is what my analysis see's as well. What happened to the 5 missing? Thank you :).

The file is attached as well.

cmccabe@DTV-A5211QLM:~/Desktop/NGS/bed/bedtools$wc -l test.bed
56548 test.bed

cmccabe@DTV-A5211QLM:~/Desktop/NGS/bed/bedtools$ perl -nae '$seen{$F[3]}++;
    END{
        print "There are ", scalar keys %seen, " unique fourth fields\n";
    }' test.bed
There are 56543 unique fourth fields

cmccabe@DTV-A5211QLM:~/Desktop/NGS/bed/bedtools$ awk '$4!=d{c++;d=$4}END{print c}' test.bed
56543

cjcox · November 12, 2015, 3:16pm

sorted your file based on 4th column and saved it: sort -k4,4 test.bed >test.bed.sorted
ran my solution minus the wc -l and saved that: sort -u -k4,4 test.bed >test.bed.uniq
here are the diffs

4748d4747
< chr11	47270217	47270425	chr11:47270217-47270425	unknown-1062|gc=64.9
4970d4968
< chr11	5248271	5248449	chr11:5248271-5248449	HBB-283|gc=55.1
24883d24880
< chr19	13010118	13010237	chr19:13010118-13010237	SYCE2-864|gc=47.9
33027d33023
< chr22	38153605	38154160	chr22:38153605-38154160	TRIOBP-610|gc=68.6
54957d54952
< chrX	33357316	33359011	chrX:33357316-33359011	DMD-581|gc=33.7

---------- Post updated at 02:16 PM ---------- Previous update was at 02:13 PM ----------

adding just for clarity... so take 'chr11:47270217-47270425' and fgrep that string in the original test.bed file.

$ fgrep 'chr11:47270217-47270425' test.bed
chr11	47270217	47270425	chr11:47270217-47270425	ACP2-1062|gc=64.9
chr11	47270217	47270425	chr11:47270217-47270425	unknown-1062|gc=64.9

Feel free to do with the other values and you'll see that they are not unique.

cmccabe · November 12, 2015, 3:56pm

Thank you