cmccabe
November 12, 2015, 2:59pm
1
I can not figure out why there are 56,548 unique entries in test.bed. However, perl
and awk
see only 56,543 and that # is what my analysis see's as well. What happened to the 5 missing? Thank you :).
The file is attached as well.
cmccabe@DTV-A5211QLM:~/Desktop/NGS/bed/bedtools$wc -l test.bed
56548 test.bed
cmccabe@DTV-A5211QLM:~/Desktop/NGS/bed/bedtools$ perl -nae '$seen{$F[3]}++;
END{
print "There are ", scalar keys %seen, " unique fourth fields\n";
}' test.bed
There are 56543 unique fourth fields
cmccabe@DTV-A5211QLM:~/Desktop/NGS/bed/bedtools$ awk '$4!=d{c++;d=$4}END{print c}' test.bed
56543
cjcox
November 12, 2015, 3:16pm
2
sorted your file based on 4th column and saved it: sort -k4,4 test.bed >test.bed.sorted
ran my solution minus the wc -l and saved that: sort -u -k4,4 test.bed >test.bed.uniq
here are the diffs
4748d4747
< chr11 47270217 47270425 chr11:47270217-47270425 unknown-1062|gc=64.9
4970d4968
< chr11 5248271 5248449 chr11:5248271-5248449 HBB-283|gc=55.1
24883d24880
< chr19 13010118 13010237 chr19:13010118-13010237 SYCE2-864|gc=47.9
33027d33023
< chr22 38153605 38154160 chr22:38153605-38154160 TRIOBP-610|gc=68.6
54957d54952
< chrX 33357316 33359011 chrX:33357316-33359011 DMD-581|gc=33.7
---------- Post updated at 02:16 PM ---------- Previous update was at 02:13 PM ----------
adding just for clarity... so take 'chr11:47270217-47270425' and fgrep that string in the original test.bed file.
$ fgrep 'chr11:47270217-47270425' test.bed
chr11 47270217 47270425 chr11:47270217-47270425 ACP2-1062|gc=64.9
chr11 47270217 47270425 chr11:47270217-47270425 unknown-1062|gc=64.9
Feel free to do with the other values and you'll see that they are not unique.
1 Like