When I use the below awk
to count the unique lines in $4
for the input it seems to work. The answer is 3 because $4
is only unique 3 times in all the entries. However, when I use the same on actual data I get 56,536 and I know the answer should be 56,548. My question is there a better way to count the unique lines? Thank you :).
input
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 1 20
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 2 20
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 3 22
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 4 22
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 5 22
chr1 957571 957852 chr1:957571-957852 AGRN-7|gc=61.2 1 186
chr1 957571 957852 chr1:957571-957852 AGRN-7|gc=61.2 2 201
chr1 957571 957852 chr1:957571-957852 AGRN-7|gc=61.2 3 201
chr1 957571 957852 chr1:957571-957852 AGRN-7|gc=61.2 271 176
chr1 957571 957852 chr1:957571-957852 AGRN-7|gc=61.2 272 175
chr1 957571 957852 chr1:957571-957852 AGRN-7|gc=61.2 273 175
chr1 957571 957852 chr1:957571-957852 AGRN-7|gc=61.2 274 175
chr1 970621 970740 chr1:970621-970740 AGRN-8|gc=57.1 46 280
chr1 970621 970740 chr1:970621-970740 AGRN-8|gc=57.1 47 280