Select unique names while removing the duplicates from a column

Amit_Pande · February 17, 2020, 10:50pm

HI,
I have a file with 2 columns:

ENSG00000003137,ENST00000001146
ENSG00000003137,ENST00000412253
ENSG00000003402,ENST00000309955
ENSG00000003402,ENST00000443227
ENSG00000003402,ENST00000341222

and I want to retain only the first entry while ignoring the rest. The output should look like this:

ENSG00000003137,ENST00000001146
ENSG00000003402,ENST00000309955

I have tried using awk : awk '!a[$1$2]++' but it does not work.
Kindly help.

balajesuri · February 17, 2020, 11:34pm

Of course with -F flag:

awk -F, '!a[$1]++' file

jim_mcnamara · February 17, 2020, 11:38pm

I think you need to specify a field separator as a comma.

Owner@Owner-PC ~
$ awk -F, '!a[$1]++' filename
ENSG00000003137,ENST00000001146
ENSG00000003402,ENST00000309955


Owner@Owner-PC ~
$ awk  '!a[$1]++' filename
ENSG00000003137,ENST00000001146
ENSG00000003137,ENST00000412253
ENSG00000003402,ENST00000309955
ENSG00000003402,ENST00000443227
ENSG00000003402,ENST00000341222

I used the sample data

anbu23 · February 18, 2020, 2:14am

$ sort -t"," -k1,1 -u file
ENSG00000003137,ENST00000001146
ENSG00000003402,ENST00000309955

MadeInGermany · February 18, 2020, 4:30am

awk -F, 'a[$1]++==0' filename

is quick and dirty because it stores an unnecessary integer value.
The full and efficient code is

awk -F, '!($1 in a) { a[$1]; print }' filename

That you can condense again to an implicit print

awk -F, '!(($1 in a) || a[$1])' filename

or

awk -F, '!($1 in a) && !a[$1]' filename