Count of unique lines in field 4

cmccabe · November 11, 2015, 5:42pm

When I use the below awk to count the unique lines in $4 for the input it seems to work. The answer is 3 because $4 is only unique 3 times in all the entries. However, when I use the same on actual data I get 56,536 and I know the answer should be 56,548. My question is there a better way to count the unique lines? Thank you :).

input

chr1    955543    955763    chr1:955543-955763    AGRN-6|gc=75    1    20
chr1    955543    955763    chr1:955543-955763    AGRN-6|gc=75    2    20
chr1    955543    955763    chr1:955543-955763    AGRN-6|gc=75    3    22
chr1    955543    955763    chr1:955543-955763    AGRN-6|gc=75    4    22
chr1    955543    955763    chr1:955543-955763    AGRN-6|gc=75    5    22
chr1    957571    957852    chr1:957571-957852    AGRN-7|gc=61.2    1    186
chr1    957571    957852    chr1:957571-957852    AGRN-7|gc=61.2    2    201
chr1    957571    957852    chr1:957571-957852    AGRN-7|gc=61.2    3    201
chr1    957571    957852    chr1:957571-957852    AGRN-7|gc=61.2    271    176
chr1    957571    957852    chr1:957571-957852    AGRN-7|gc=61.2    272    175
chr1    957571    957852    chr1:957571-957852    AGRN-7|gc=61.2    273    175
chr1    957571    957852    chr1:957571-957852    AGRN-7|gc=61.2    274    175
chr1    970621    970740    chr1:970621-970740    AGRN-8|gc=57.1    46    280
chr1    970621    970740    chr1:970621-970740    AGRN-8|gc=57.1    47    280

mjf · November 11, 2015, 5:49pm

How about including your code so we can review and help you out?

cjcox · November 11, 2015, 5:54pm

If data is in test.txt... (if not using awk is ok):

sort -u -k4,4 test.txt | wc -l

cmccabe · November 11, 2015, 6:01pm

I left the office and forgot the awk , but will post it tomorrow. Thank you and I apologize.

Aia · November 11, 2015, 11:13pm

If it is of any help:

perl -nae '$seen{$F[3]}++;
    END{for $k (sort keys %seen){
            print "$k: $seen{$k} time(s)\n";
        }
        print "There are ", scalar keys %seen, " unique fourth fields\n";
    }' cmccabe.file

chr1:955543-955763: 5 time(s)
chr1:957571-957852: 7 time(s)
chr1:970621-970740: 2 time(s)
There are 3 unique fourth fields

or just:

perl -nae '$seen{$F[3]}++;
    END{
        print "There are ", scalar keys %seen, " unique fourth fields\n";
    }' cmccabe.file

There are 3 unique fourth fields

Don_Cragun · November 12, 2015, 12:34am

Or, if the file is unsorted:

awk '!($4 in d){c++;d[$4]}END{print c}' file

Or, if the file is presorted (as in the sample provided):

awk '$4!=d{c++;d=$4}END{print c}' file

As always, if anyone wants to try these on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

cmccabe · November 12, 2015, 10:59am

Using the perl and awk commands both produced the same result 56,543, so that means that 5 id's are not there. I was going to pull out the the unique entries in that field but the below seems to be printing the entire line. Thank you :).

awk 'a !~ $4; {a=$4}' Input.txt

Desired output

chr1:955543-955763 
chr1:957571-957852 
chr1:970621-970740

Thank you :).

Don_Cragun · November 12, 2015, 12:18pm

You could lose lines using !~ instead of != . Try:

awk 'a != $4{a=$4;print a}' Input.txt

cmccabe · November 12, 2015, 12:46pm

Thank you :).