Remove duplicate

samrat_dutta · May 23, 2013, 2:15pm

Hi ,
I have a pipe seperated file repo.psv where i need to remove duplicates based on the 1st column only. Can anyone help with a Unix script ?

Input:

15277105||Common Stick|ESHR||Common Stock|CYRO AB
15277105||Common Stick|ESHR||Common Stock|CYRO AB
16111278||Common Stick|ESHR||Common Stock|STANDARD REGISTER CO
39693766||Common Stick|ESHR||Common Stock|HS AG

Output should be :

15277105||Common Stick|ESHR||Common Stock|CYRO AB
16111278||Common Stick|ESHR||Common Stock|STANDARD REGISTER CO
39693766||Common Stick|ESHR||Common Stock|HS AG

Thanks

RudiC · May 23, 2013, 2:22pm

Try

$ awk '!L[$1]++' file
15277105||Common Stick|ESHR||Common Stock|CYRO AB
16111278||Common Stick|ESHR||Common Stock|STANDARD REGISTER CO
39693766||Common Stick|ESHR||Common Stock|HS AG

You may need to redefine the field separator (man awk) if you want to use this on other files.

samrat_dutta · May 23, 2013, 2:33pm

We use solaris server. I used the below command from the command line but it resulted in 2 messages:

awk '!L[$1]++' xyz.psv

message:

awk: syntax error near line 1
awk: bailing out near line 1

vgersh99 · May 23, 2013, 3:31pm

use nawk instead of awk

samrat_dutta · May 23, 2013, 3:42pm

Thanks nawk worked
But what is the diff between awk and nawk. We use solarix server.

Jotne · May 24, 2013, 1:57am

The solution present here is wrong awk '!L[$1]++'

OP request
pipe seperated file
based on the 1st column only
The solution above using the space as separator not the pipe |
15277105||Common vs correct 15277105

Correct solution (setting field separator to pipe):

awk -F\| '!L[$1]++'

samrat_dutta · May 24, 2013, 12:00pm

Thanks Pal. It worked as well.

MadeInGermany · May 24, 2013, 6:52pm

Solaris /bin/awk -> oawk does not allow a ! operator on a number.

awk -F\| '0==L[$1]++'

samrat_dutta · June 7, 2013, 11:50am

Hi,
My .psv file is getting bigger as more columns are being added. I remove duplicates based on the last column number and currently i know the position of this which is 176 . The column name is 'auditid'. Is there a way i can find the column number of this field and assign it to the array ? .

nawk -F\| '!L[$176]++' file

RudiC · June 7, 2013, 1:00pm

Assuming the column name is a field in row 1, try

awk -F\|  'NR==1 {for (i=1; i<=NF; i++) if ($i==COLNAME){COL=i; break}} !L[$COL]++' COLNAME="auditid" file

You can use the column name as a constant for the comparison, or, like here, pass it i a variable.

juzz4fun · June 7, 2013, 1:49pm

rudic:

Try
$ awk '!L[$1]++' file
15277105||Common Stick|ESHR||Common Stock|CYRO AB
16111278||Common Stick|ESHR||Common Stock|STANDARD REGISTER CO
39693766||Common Stick|ESHR||Common Stock|HS AG
You may need to redefine the field separator (man awk) if you want to use this on other files.

Just wondering... what is the logic behind using L[$1]++ ?

RudiC · June 7, 2013, 3:21pm

L is just an arbitrary Name for an array - call it what you like, Joe, Mimi, or L (short for logical). The $1 (first field of each respective row/line) is the index into that array, and that indexed element is incremented by ++. Any value except 0 (or empty, which is equivalent) will make the reference true, its inversion (by !) false. As the default action is print, the entire command reads: Get the array element for index $1. If it does not exist (=first occurrence of this index) invert to true, print. If it does (and has a value) , invert to false, don't print. Increment it for later print suppression.

juzz4fun · June 7, 2013, 3:36pm

Thank you, RudiC for the explanation. It was really helpful