Removing Duplicates from file

tinufarid · September 2, 2011, 1:37am

Hi Experts,

Please check the following new requirement. I got data like the following in a file.

FILE_HEADER
01cbbfde7898410| 3477945| home| 1
01cbc275d2c122| 3478234| WORK| 1
01cbbe4362743da| 3496386| Rich Spare| 1
01cbc275d2c122| 3478234| WORK| 1

This is pipe separated file with column 2,3 as key columns. The file should be formatted to the following output files

1) All records other than the duplicates

FILE_HEADER
01cbbfde7898410| 3477945| home| 1
01cbbe4362743da| 3496386| Rich Spare| 1

2) The dupicate key file

3478234| WORK

Any thoughts on this.:wall::wall:

Note:- The 'FILE_HEADER' should be there in the first file.

jimmymj · September 2, 2011, 2:14am

Hi Tinu,

Please try out the below command

sort -t "|" +1 -3 test_file |uniq -u

or

sed '1d' test_file|sort -t "|" +1 -3|uniq -u (removing the header line)

~jimmy

jimmymj · September 6, 2011, 8:16am

see a better solution for unique and duplicate records from a file:

sed '1d' $FILE1 | sort  -t "|" +1 -3 > temp1
cat temp1 |  awk -F"|" ' BEGIN{a=0}{a++; b[a]=$2$3; c[a]=$0} \
END { for(i=0; i<=a; ++i) if(b[i+1]==b) print  c"\n"c[i+1]}'|uniq >temp2
cat temp1 temp2 > temp3
sort temp3 | uniq -u > temp4
echo $HEADER > $FILE1
cat temp4 >> $FILE1

radoulov · September 6, 2011, 9:36am

Another one with awk:

awk -F\| 'END {
  for (i = 1; ++i <= NR;) {
    split(d, t)
    if (c[t[2], t[3]] > 1) {
      if (!s[t[2], t[3]]++)
        print t[2], t[3] > dups
        }
    else
      print d > uniq    
    }
  }
NR == 1 {
  print > dups
  print > uniq
  next
  }
{
  c[$2, $3]++; d[NR] = $0
  }' OFS=\| dups=dups.txt uniq=uniq.txt infile