return a list of unique values of a column from csv format file

phoeberunner · October 12, 2009, 11:17pm

Hi all,

I have a huge csv file with the following format of data,

[HEADER]
Num SNPs, 549997
Total SNPs,555352
Num Samples, 157
[Data]
SNP, SampleID, Allele1, Allele2
A001,AB1,A,A
A002,AB1,A,A
A003,AB1,A,A
...
...
...

I would like to write out a list of unique SNP (column 1). Could you let me know how to do this with UNIX command? Do I need to at firstl convert csv file to text file?

Thank you for your attention!

phoebe

daptal · October 12, 2009, 11:39pm

 cat abc.csv
SNP, SampleID, Allele1, Allele2
A001,AB1,A,A
A002,AB1,A,A
A003,AB1,A,A

Instead of abc.csv substitute your csv file name

$ cat abc.csv  | cut -f1 -d , | uniq
SNP
A001
A002
A003

HTH,
PL

phoeberunner · October 13, 2009, 11:39am

Hi,

I get correct number of unique "SampleID", but not "SNP". I wonder why it didn't work for "SNP" (column 1).

I used
$ cat abc.csv | cut -f1 -d , | uniq
to get list of unique "SNP", and

$ cat abc.csv | cut -f2 -d , | uniq
to get list of unique "SmpleID"

I have total of 8,634,9539 rows in the csv file. It supposed to have 54,9997 unique SNP, but it turned out to be 8,634,9539, which is the same as total rows of file.

Again, I get correct number of unique SampleID, which is 167.

Thanks a bunch!

vidyadhar85 · October 13, 2009, 11:46am

so your CSV file is sorted one??
if not uniq won't work on it.. please read the man page of uniq..