Removing duplicate records in a file based on single column

G.K.K · August 20, 2011, 7:21am

Hi,

I want to remove duplicate records including the first line based on column1. For example

inputfile(filer.txt):
-------------

1,3000,5000
1,4000,6000
2,4000,600
2,5000,700
3,60000,4000
4,7000,7777
5,999,8888

expected output:
----------------

3,60000,4000
4,7000,7777
5,999,8888

Is it possible to achieve this using awk command ??

I tried below awk command , it is working but i dont want to give two times file name(filer.txt) in the command. I am allowed to give file name only one time.

awk -F"," 'NR == FNR {  cnt[$1] ++} NR != FNR {  if (cnt[$1] == 1) print $0 }' filer.txt filer.txt

Please suggest me how to achieve this.

Thanks in advance

jgt · August 20, 2011, 7:27am

Use the unique option of the sort command.
Sort the file using the unique option. Then use diff between the original and the output (of the sort) file. Then use the diff file to remove the records from the output file of the sort.

G.K.K · August 20, 2011, 7:37am

Thanks for reply jgt :), i am allowed to use awk/sed command alone . can someone give suggestion how exactly i can code it in single command line.

jgt · August 20, 2011, 7:44am

Who makes up these rules, and why????

G.K.K · August 20, 2011, 8:04am

Got solution using single line command. Thanks. Problem resolved

birei · August 20, 2011, 6:54pm

Hi,

One solution using 'sed':

$ cat infile
1,3000,5000
1,4000,6000
2,4000,600
2,5000,700
3,60000,4000
4,7000,7777
5,999,8888
$ sed -ne '$! { /\n/! N; } ; :a ; $! { /^\([0-9]*\),.*\n\1[^\n]\+$/ { N; ba; }; } ; s/^\([0-9]*\),.*\n\1// ; tb ; P ; D ; :b ; D' infile
3,60000,4000
4,7000,7777
5,999,8888

Regards,
Birei