How to remove duplicate records with out sort

svenkatareddy · February 28, 2008, 5:15am

Can any one give me command How to delete duplicate records with out sort.

Suppose if the records like below:

345,bcd,789
123,abc,456
234,abc,456
712,bcd,789

out tput should be
345,bcd,789
123,abc,456

Key for the records is 2nd and 3rd fields.fields are seperated by colon(,).

Klashxx · February 28, 2008, 5:33am

Try:

awk '!($2$3 in a){a[$2$3]++;print}' FS=',' file

krishmaths · February 28, 2008, 6:52am

for i in `cat filename | cut -d "," -f2 | uniq`
do
sed -n '/'${i}'/{p;q;}' filename
done

radoulov · February 28, 2008, 7:04am

Or:

awk '!x[$2,$3]++' FS="," file

Use nawk or /usr/xpg4/bin/awk on Solaris.

girish.batra · February 29, 2008, 6:25am

you can use 'uniq' command also to remove the duplicate records which is ver easy to use. Check out the man pages of this command

radoulov · February 29, 2008, 6:28am

You need a sorted input for uniq ....

girish.batra · February 29, 2008, 6:30am

yes I agree with you that uniq like a sorted input

radoulov · February 29, 2008, 6:31am

So check the original post.

kenneth.mcbride · June 5, 2008, 3:13pm

awk '!x[$2,$3]++' FS="," file

This has been a true grit work horse one liner.

I have use it extensively.

Does anyone know if it can be instructed to work within a range of values?
I suspect it wouldn't be a one liner.

For example:

My file contents are all numbers (a mixture of intergers and floating point), where field one is a unique point number, field two is an X or Easting coordinate, field 3 is a Y or Northing coordinate, and field four is an elevation:

1,2.1,3.1,1.1
2,2.2,3.2,2.2
3,2.3,3.3,3.3
4,3.4,3.4,4.4
5,3.5,3.5,5.5
6,3.6,3.6,6.6
7,4.7,4.7,7.1
8,4.8,4.8,8.8
9,4.9,4.9,9.9

I would like to process the file via fields two and three and have the result be:

1,2.1,3.1,1.1
3,2.3,3.3,3.3
4,3.4,3.4,4.4
6,3.6,3.6,6.6
7,4.7,4.7,7.1
9,4.9,4.9,9.9

So I think what I am asking is, that field 2 and 3 be considered duplicates if they are in the range "$2-0.1 to $2+0.1" and "$3-0.1 to $3+0.1".
That's the best way I have of decribing it.

Thanks in advance,
Kenny.

radoulov · June 5, 2008, 4:46pm

If the input is ordered as the one you posted,
and if I understand correctly, you could use something like this:
(use nawk or /usr/xpg4/bin/awk on Solaris)

awk -F, '
(x-0.1 >= $2 || $2 <= x+0.1) && (y-0.1 >= $3 || $3 <= y+0.1) { next }
{ x = $2; y = $3 }
1' input

kenneth.mcbride · June 5, 2008, 6:57pm

Thank you radoulov.

Your code works well when the data is in a sorted list.

Example: using your code with the range values -1.1 and +1.1 on file:

1-a,1000001.1,500001.1,101.1
1-b,1000001.1,500001.1,101.2
2-a,1000002.2,500002.2,102.2
2-b,1000002.2,500002.2,102.3
3-a,1000003.3,500003.3,103.3
3-b,1000003.3,500003.3,103.4
4-a,1000004.4,500004.4,104.4
4-b,1000004.4,500004.4,104.5
5-a,1000005.5,500005.5,105.5
5-b,1000005.5,500005.5,105.6
6-a,1000006.6,500006.6,106.6
6-b,1000006.6,500006.6,106.7
7-a,1000007.7,500007.7,107.7
7-b,1000007.7,500007.7,107.8
8-a,1000008.8,500008.8,108.8
8-b,1000008.8,500008.8,108.9
9-a,1000009.9,500009.9,109.9
9-b,1000009.9,500009.9,110.0
10-a,1000010.0,500010.0,110.0
10-b,1000010.0,500010.0,110.1

I get the following result:

1-a,1000001.1,500001.1,101.1
3-a,1000003.3,500003.3,103.3
5-a,1000005.5,500005.5,105.5
7-a,1000007.7,500007.7,107.7
9-a,1000009.9,500009.9,109.9

This is the expected result.

However, if I jumble up the records like this:

10-a,1000010.0,500010.0,110.0
2-b,1000002.2,500002.2,102.3
9-b,1000009.9,500009.9,110.0
10-b,1000010.0,500010.0,110.1
9-a,1000009.9,500009.9,109.9
8-a,1000008.8,500008.8,108.8
3-b,1000003.3,500003.3,103.4
8-b,1000008.8,500008.8,108.9
7-a,1000007.7,500007.7,107.7
7-b,1000007.7,500007.7,107.8
6-b,1000006.6,500006.6,106.7
3-a,1000003.3,500003.3,103.3
6-a,1000006.6,500006.6,106.6
5-a,1000005.5,500005.5,105.5
5-b,1000005.5,500005.5,105.6
4-a,1000004.4,500004.4,104.4
4-b,1000004.4,500004.4,104.5
2-a,1000002.2,500002.2,102.2
1-a,1000001.1,500001.1,101.1
1-b,1000001.1,500001.1,101.2

I only get the first line as output.
10-a,1000010.0,500010.0,110.0

Is there code that will work on an unsorted list?

My data sets are almost always listed in a random order.

Thank you again,
Kenny.

radoulov · June 6, 2008, 4:50am

Do you want to preserve the order?
Otherwise you could sort the input first:

sort -t, -k2n,3n inputfile |
  awk -F, '
  (x-1.1 >= $2 || $2 <= x+1.1) && (y-1.1 >= $3 || $3 <= y+1.1) { next }
  { x = $2; y = $3 }
1'

kenneth.mcbride · June 6, 2008, 11:19am

Yes I would like to preserve the order.

Is there code to process an unsorted list?

I would assume that the programming logic would then have to be:

Keep the first record.
Compare all remaining records to it, testing for duplicates in fields $2 and $3 [within the user defined range].
Move down one record and repeat until you reach the last record.

radoulov · June 7, 2008, 6:02pm

(previous post removed)

Could you please post the desired output from this input?

10-a,1000010.0,500010.0,110.0
2-b,1000002.2,500002.2,102.3
9-b,1000009.9,500009.9,110.0
10-b,1000010.0,500010.0,110.1
9-a,1000009.9,500009.9,109.9
8-a,1000008.8,500008.8,108.8
3-b,1000003.3,500003.3,103.4
8-b,1000008.8,500008.8,108.9
7-a,1000007.7,500007.7,107.7
7-b,1000007.7,500007.7,107.8
6-b,1000006.6,500006.6,106.7
3-a,1000003.3,500003.3,103.3
6-a,1000006.6,500006.6,106.6
5-a,1000005.5,500005.5,105.5
5-b,1000005.5,500005.5,105.6
4-a,1000004.4,500004.4,104.4
4-b,1000004.4,500004.4,104.5
2-a,1000002.2,500002.2,102.2
1-a,1000001.1,500001.1,101.1
1-b,1000001.1,500001.1,101.2

kenneth.mcbride · June 9, 2008, 4:08pm

Hi radoulov,

I think this might be more than just a one liner,eh?

10-a,1000010.0,500010.0,110.0
2-b,1000002.2,500002.2,102.3
9-b,1000009.9,500009.9,110.0
10-b,1000010.0,500010.0,110.1
9-a,1000009.9,500009.9,109.9
8-a,1000008.8,500008.8,108.8
3-b,1000003.3,500003.3,103.4
8-b,1000008.8,500008.8,108.9
7-a,1000007.7,500007.7,107.7
7-b,1000007.7,500007.7,107.8
6-b,1000006.6,500006.6,106.7
3-a,1000003.3,500003.3,103.3
6-a,1000006.6,500006.6,106.6
5-a,1000005.5,500005.5,105.5
5-b,1000005.5,500005.5,105.6
4-a,1000004.4,500004.4,104.4
4-b,1000004.4,500004.4,104.5
2-a,1000002.2,500002.2,102.2
1-a,1000001.1,500001.1,101.1
1-b,1000001.1,500001.1,101.2

Not sure what the output would be exactly.

But the objective would be to create two output files.

OutputFile1 would be the remainder records (i.e. the duplicates removed).
OutputFile2 would be the duplicates.

I apologize for the long winded description, but I am not a programmer.

#####

The objective would be achieved by,
taking fields 2 and 3 from the first record

10-a,1000010.0,500010.0,110.0

and comparing them to fields 2 and 3 from the remainder of the records,
within the range $2-1.1 to $2+1.1 and range $3-1.1 to $3+1.1,
to determine which of the remaining records are to be considered duplicates.

I think the duplicates would be:

9-b,1000009.9,500009.9,110.0
10-b,1000010.0,500010.0,110.1
9-a,1000009.9,500009.9,109.9

because their $2 and $3 fields satisfy the ranges.

So with record

10-a,1000010.0,500010.0,110.0

marked as a keeper and written to OutputFile1,

and with records

9-b,1000009.9,500009.9,110.0
10-b,1000010.0,500010.0,110.1
9-a,1000009.9,500009.9,109.9

identified as duplicates and written to OutputFile2.

These 4 records would now be removed from processing by
possibly creating a new temporary file/list with these 4 records removed.

This would be the new tempory file/list with the 4 records removed.

2-b,1000002.2,500002.2,102.3
8-a,1000008.8,500008.8,108.8
3-b,1000003.3,500003.3,103.4
8-b,1000008.8,500008.8,108.9
7-a,1000007.7,500007.7,107.7
7-b,1000007.7,500007.7,107.8
6-b,1000006.6,500006.6,106.7
3-a,1000003.3,500003.3,103.3
6-a,1000006.6,500006.6,106.6
5-a,1000005.5,500005.5,105.5
5-b,1000005.5,500005.5,105.6
4-a,1000004.4,500004.4,104.4
4-b,1000004.4,500004.4,104.5
2-a,1000002.2,500002.2,102.2
1-a,1000001.1,500001.1,101.1
1-b,1000001.1,500001.1,101.2

Now the "new first record" would be:

2-b,1000002.2,500002.2,102.3

and gets written to OutputFile1.

Compare fields 2 and 3
to fields 2 and 3 from the remainder of the records
in the tempory list,
within the range $2-1.1 to $2+1.1 and range $3-1.1 to $3+1.1,
to determine which of the remaining records are to be considered duplicates.

I think the duplicates would be:

3-b,1000003.3,500003.3,103.4
3-a,1000003.3,500003.3,103.3
2-a,1000002.2,500002.2,102.2
1-a,1000001.1,500001.1,101.1
1-b,1000001.1,500001.1,101.2

and get written to OutputFile2

The new temporary file/list with these 6 records removed would be:

8-a,1000008.8,500008.8,108.8
8-b,1000008.8,500008.8,108.9
7-a,1000007.7,500007.7,107.7
7-b,1000007.7,500007.7,107.8
6-b,1000006.6,500006.6,106.7
6-a,1000006.6,500006.6,106.6
5-a,1000005.5,500005.5,105.5
5-b,1000005.5,500005.5,105.6
4-a,1000004.4,500004.4,104.4
4-b,1000004.4,500004.4,104.5

8-a,1000008.8,500008.8,108.8
becomes the "new first record"
and gets written to OutputFile1

8-b,1000008.8,500008.8,108.9
7-a,1000007.7,500007.7,107.7
7-b,1000007.7,500007.7,107.8

are the duplicates and get written to OutputFile2

The new temporary file with these 4 records removed would be:

6-b,1000006.6,500006.6,106.7
6-a,1000006.6,500006.6,106.6
5-a,1000005.5,500005.5,105.5
5-b,1000005.5,500005.5,105.6
4-a,1000004.4,500004.4,104.4
4-b,1000004.4,500004.4,104.5

6-b,1000006.6,500006.6,106.7
becomes the "new first record"
and gets written to OutputFile1

6-a,1000006.6,500006.6,106.6
5-a,1000005.5,500005.5,105.5
5-b,1000005.5,500005.5,105.6

are the duplicates and get written to OutputFile2

The new temporary file with these 4 records removed would be:

4-a,1000004.4,500004.4,104.4
4-b,1000004.4,500004.4,104.5

4-a,1000004.4,500004.4,104.4
becomes the "new first record"
and gets written to OutputFile1

4-b,1000004.4,500004.4,104.5

is the duplicate and gets written to OutputFile2

Done
Whew!

Best Regards,
Kenny.

radoulov · June 11, 2008, 7:26am

Ok,
not a one-liner

It's all about casting to number

awk -F, '{ _[NR] = $0 }
END {
  m = "%.2f"
  for (i=1; i<=NR; i++) {
    if (_) {
    print _ > "out1"
    split(_, tt)
    delete _
    for (j=1; j<=NR; j++) {
      if (_[j]) {
      split(_[j], t)
      if ((sprintf(m, t[2] - v) <= sprintf(m, tt[2]) && sprintf(m, tt[2]) <= sprintf(m, t[2] + v)) && \
      (sprintf(m, t[3] - v) <= sprintf(m, tt[3]) && sprintf(m, tt[3]) <= sprintf(m, t[3] + v))) {
        print _[j] > "out2"
        delete _[j]
      }        
     }
    }
   }
  }
 }' v=1.1 file

vijay_0209 · June 11, 2008, 7:31am

can you explain this..

sed -n '/'${i}'/{p;q;}' filename

vgersh99 · June 11, 2008, 7:45am

Pls don't hijack other people threads - start a new thread.

kenneth.mcbride · June 11, 2008, 12:26pm

Hi radoulov,

I am not having any success (yet).

But I know that with your expertise (and patience) I will eventually find the solution. And I appreciate the time that you are taking on this puzzle.

First, I probably should come clean and confess that I am not on Unix.
My operating system is Microsoft Windows XP, so I am hoping that I won't be cast out of the forum

I have some Unix for Windows utilities, mainly awk awk95 mawk and gawk.

Which of these flavours would your code work with?

I have found that things like ' and " are handled differently between Unix and Microsoft Windows. So maybe that is why I get error messages when using your code.

One other question though. If you put the awk commands in a file (to used for awk -f), does the v=1.1 go inside the awk command file or outside on the command line?

Examples of the error messages are:
And this is after putting your awk commands in a file with the v=1.1 inside the file. Note, I did not make any edits to any character in the awk commands string.
The reason that I put the awk commands in a file is:
I edited them to one long command line and used it in my DOS command window, but the operation system said it couln't find my input file. There were no errors from mawk.

=====

awk -f AwkCommandFile -F, InputFile

'{ _[NR] = $0 }
^
awk ERROR temp2.awk line 1: syntax error

=====

awk95 -f AwkCommandFile -F, InputFile

awk95: syntax error at source line 1 source file AwkCommandFile
context is
>>> ' <<<
awk95: bailing out at source line 21

=====

gawk -f AwkCommandFile -F, InputFile

gawk: AwkCommandFile:1: '{ _[NR] = $0 }
gawk: AwkCommandFile:1: ^ invalid char ''' in expression

=====

mawk -f AwkCommandFile -F, InputFile

mawk: 1: unexpected character '''
mawk: 21: unexpected character '''

=====

The contents of the AwkCommandFile was:

'{ [NR] = $0 }
END {
m = "%.2f"
for (i=1; i<=NR; i++) {
if ([i]) {
print _ [i]> "out1"
split([i], tt)
delete _
[i]for (j=1; j<=NR; j++) {
if ([j]) {
split(_[j], t)
if ((sprintf(m, t[2] - v) <= sprintf(m, tt[2]) && sprintf(m, tt[2]) <= sprintf(m, t[2] + v)) && \
(sprintf(m, t[3] - v) <= sprintf(m, tt[3]) && sprintf(m, tt[3]) <= sprintf(m, t[3] + v))) {
print _[j] > "out2"
delete _[j]
}
}
}
}
}
}' v=1.1

Again,
Many Thanks,
Kenny.

radoulov · June 11, 2008, 2:10pm

Put this into AwkCommandFile:

BEGIN { FS="," }
{ _[NR] = $0 }
END {
  m = "%.2f"
  for (i=1; i<=NR; i++) {
    if (_) {
    print _ > "out1"
    split(_, tt)
    delete _
    for (j=1; j<=NR; j++) {
      if (_[j]) {
      split(_[j], t)
      if ((sprintf(m, t[2] - v) <= sprintf(m, tt[2]) && \
sprintf(m, tt[2]) <= sprintf(m, t[2] + v)) && \
      (sprintf(m, t[3] - v) <= sprintf(m, tt[3]) && \
sprintf(m, tt[3]) <= sprintf(m, t[3] + v))) {
        print _[j] > "out2"
        delete _[j]
      }
     }
    }
   }
  }
 }

Run it like this:

awk -f AwkCommandFile v=1.1 InputFile

I put the v variable outside the script for commodity (you can run the script with different values without modifying the code).