Remove duplicate records

svenkatareddy · February 26, 2010, 10:45am

I want to remove the records based on duplicate. I want to remove if two or more records exists with combination fields. Those records should not come once also

file abc.txt

ABC;123;XYB;HELLO;
ABC;123;HKL;HELLO;
CDE;123;LLKJ;HELLO;
ABC;123;LSDK;HELLO;
CDF;344;SLK;TEST

key fields are 1st,2nd and 4th.

Should return only

CDE;123;LLKJ;HELLO;
CDF;344;SLK;TEST

Can you give me the command for this

joeyg · February 26, 2010, 10:49am

Please clarify on your 3rd record/line.

svenkatareddy · February 26, 2010, 11:39am

Corrected the fields

dennis.jacob · February 26, 2010, 12:33pm

One approch in awk...

awk -F";" '{_s=$1" "$2" "$4; A[_s]++; B[_s]=$0;; } END { for (i in A) { if (A==1) print B; }}' file

alister · February 26, 2010, 2:00pm

This approach may yield false matches if the fields in question can contain a space and are not required to be of equal length. From the sample data, we see that the 4th field varies in length. Perhaps a space awaits as well.

False match example:

1;2 ;3;4 --> _s="1 2  4"
1;2;3; 4 --> _s="1 2  4"

Just in case, best to use the same delimiter as was used to split the input:

_s=$1";"$2";"$4

Alternatively, you can set SUBSEP (which determines what AWK will use as an internal separator for "multidimensional" array subscripts) to ";" which allows you to safely use A[$1,$2,$4].

Regards,
Alister

joeyg · February 26, 2010, 2:17pm

>cut -d";" -f1,2,4 <scottn.txt | sort | uniq -u | gawk 'IFS=OFS=FS=";"{print $1,$2,".*",$3}' >match.txt
>grep -f match.txt <scottn.txt

alister · February 26, 2010, 5:08pm

Interesting approach. Here's a different take on it:

sort -t\; -k1,1 -k2,2 -k4,4 scottn.txt | sed 's/[^;]*;/*;/3' | uniq -u > match.txt
grep -f match.txt scottn.txt

Alister

svenkatareddy · March 3, 2010, 8:46am

>cut -d";" -f1,2,4 <scottn.txt | sort | uniq -u | gawk 'IFS=OFS=FS=";"{print $1,$2,".*",$3}' >match.txt
>grep -f match.txt <scottn.txt

sort -t\; -k1,1 -k2,2 -k4,4 scottn.txt | sed 's/[^;]*;/*;/3' | uniq -u > match.txt
grep -f match.txt scottn.txt

it is failing for the fields having $ ( ) characters.