How to remove duplicate records with out sort

Can any one give me command How to delete duplicate records with out sort.

Suppose if the records like below:

345,bcd,789
123,abc,456
234,abc,456
712,bcd,789

out tput should be
345,bcd,789
123,abc,456

Key for the records is 2nd and 3rd fields.fields are seperated by colon(,).

Try:

awk '!($2$3 in a){a[$2$3]++;print}' FS=',' file

for i in `cat filename | cut -d "," -f2 | uniq`
do
sed -n '/'${i}'/{p;q;}' filename
done

Or:

awk '!x[$2,$3]++' FS="," file

Use nawk or /usr/xpg4/bin/awk on Solaris.

you can use 'uniq' command also to remove the duplicate records which is ver easy to use. Check out the man pages of this command

You need a sorted input for uniq ....

yes I agree with you that uniq like a sorted input

So check the original post.

awk '!x[$2,$3]++' FS="," file

This has been a true grit work horse one liner.

I have use it extensively.

Does anyone know if it can be instructed to work within a range of values?
I suspect it wouldn't be a one liner.:frowning:

For example:

My file contents are all numbers (a mixture of intergers and floating point), where field one is a unique point number, field two is an X or Easting coordinate, field 3 is a Y or Northing coordinate, and field four is an elevation:

1,2.1,3.1,1.1
2,2.2,3.2,2.2
3,2.3,3.3,3.3
4,3.4,3.4,4.4
5,3.5,3.5,5.5
6,3.6,3.6,6.6
7,4.7,4.7,7.1
8,4.8,4.8,8.8
9,4.9,4.9,9.9

I would like to process the file via fields two and three and have the result be:

1,2.1,3.1,1.1
3,2.3,3.3,3.3
4,3.4,3.4,4.4
6,3.6,3.6,6.6
7,4.7,4.7,7.1
9,4.9,4.9,9.9

So I think what I am asking is, that field 2 and 3 be considered duplicates if they are in the range "$2-0.1 to $2+0.1" and "$3-0.1 to $3+0.1".
That's the best way I have of decribing it.

Thanks in advance,
Kenny.
:slight_smile:

If the input is ordered as the one you posted,
and if I understand correctly, you could use something like this:
(use nawk or /usr/xpg4/bin/awk on Solaris)

awk -F, '
(x-0.1 >= $2 || $2 <= x+0.1) && (y-0.1 >= $3 || $3 <= y+0.1) { next }
{ x = $2; y = $3 }
1' input

Thank you radoulov.

Your code works well when the data is in a sorted list.

Example: using your code with the range values -1.1 and +1.1 on file:

1-a,1000001.1,500001.1,101.1
1-b,1000001.1,500001.1,101.2
2-a,1000002.2,500002.2,102.2
2-b,1000002.2,500002.2,102.3
3-a,1000003.3,500003.3,103.3
3-b,1000003.3,500003.3,103.4
4-a,1000004.4,500004.4,104.4
4-b,1000004.4,500004.4,104.5
5-a,1000005.5,500005.5,105.5
5-b,1000005.5,500005.5,105.6
6-a,1000006.6,500006.6,106.6
6-b,1000006.6,500006.6,106.7
7-a,1000007.7,500007.7,107.7
7-b,1000007.7,500007.7,107.8
8-a,1000008.8,500008.8,108.8
8-b,1000008.8,500008.8,108.9
9-a,1000009.9,500009.9,109.9
9-b,1000009.9,500009.9,110.0
10-a,1000010.0,500010.0,110.0
10-b,1000010.0,500010.0,110.1

I get the following result:

1-a,1000001.1,500001.1,101.1
3-a,1000003.3,500003.3,103.3
5-a,1000005.5,500005.5,105.5
7-a,1000007.7,500007.7,107.7
9-a,1000009.9,500009.9,109.9

This is the expected result.

However, if I jumble up the records like this:

10-a,1000010.0,500010.0,110.0
2-b,1000002.2,500002.2,102.3
9-b,1000009.9,500009.9,110.0
10-b,1000010.0,500010.0,110.1
9-a,1000009.9,500009.9,109.9
8-a,1000008.8,500008.8,108.8
3-b,1000003.3,500003.3,103.4
8-b,1000008.8,500008.8,108.9
7-a,1000007.7,500007.7,107.7
7-b,1000007.7,500007.7,107.8
6-b,1000006.6,500006.6,106.7
3-a,1000003.3,500003.3,103.3
6-a,1000006.6,500006.6,106.6
5-a,1000005.5,500005.5,105.5
5-b,1000005.5,500005.5,105.6
4-a,1000004.4,500004.4,104.4
4-b,1000004.4,500004.4,104.5
2-a,1000002.2,500002.2,102.2
1-a,1000001.1,500001.1,101.1
1-b,1000001.1,500001.1,101.2

I only get the first line as output.
10-a,1000010.0,500010.0,110.0

Is there code that will work on an unsorted list?

My data sets are almost always listed in a random order.

Thank you again,
Kenny.

Do you want to preserve the order?
Otherwise you could sort the input first:

sort -t, -k2n,3n inputfile |
  awk -F, '
  (x-1.1 >= $2 || $2 <= x+1.1) && (y-1.1 >= $3 || $3 <= y+1.1) { next }
  { x = $2; y = $3 }
1'

Yes I would like to preserve the order.

Is there code to process an unsorted list?

I would assume that the programming logic would then have to be:

  1. Keep the first record.
  2. Compare all remaining records to it, testing for duplicates in fields $2 and $3 [within the user defined range].
  3. Move down one record and repeat until you reach the last record.

(previous post removed)

Could you please post the desired output from this input?

10-a,1000010.0,500010.0,110.0
2-b,1000002.2,500002.2,102.3
9-b,1000009.9,500009.9,110.0
10-b,1000010.0,500010.0,110.1
9-a,1000009.9,500009.9,109.9
8-a,1000008.8,500008.8,108.8
3-b,1000003.3,500003.3,103.4
8-b,1000008.8,500008.8,108.9
7-a,1000007.7,500007.7,107.7
7-b,1000007.7,500007.7,107.8
6-b,1000006.6,500006.6,106.7
3-a,1000003.3,500003.3,103.3
6-a,1000006.6,500006.6,106.6
5-a,1000005.5,500005.5,105.5
5-b,1000005.5,500005.5,105.6
4-a,1000004.4,500004.4,104.4
4-b,1000004.4,500004.4,104.5
2-a,1000002.2,500002.2,102.2
1-a,1000001.1,500001.1,101.1
1-b,1000001.1,500001.1,101.2

Hi radoulov,

I think this might be more than just a one liner,eh?

10-a,1000010.0,500010.0,110.0
2-b,1000002.2,500002.2,102.3
9-b,1000009.9,500009.9,110.0
10-b,1000010.0,500010.0,110.1
9-a,1000009.9,500009.9,109.9
8-a,1000008.8,500008.8,108.8
3-b,1000003.3,500003.3,103.4
8-b,1000008.8,500008.8,108.9
7-a,1000007.7,500007.7,107.7
7-b,1000007.7,500007.7,107.8
6-b,1000006.6,500006.6,106.7
3-a,1000003.3,500003.3,103.3
6-a,1000006.6,500006.6,106.6
5-a,1000005.5,500005.5,105.5
5-b,1000005.5,500005.5,105.6
4-a,1000004.4,500004.4,104.4
4-b,1000004.4,500004.4,104.5
2-a,1000002.2,500002.2,102.2
1-a,1000001.1,500001.1,101.1
1-b,1000001.1,500001.1,101.2

Not sure what the output would be exactly.

But the objective would be to create two output files.

OutputFile1 would be the remainder records (i.e. the duplicates removed).
OutputFile2 would be the duplicates.

I apologize for the long winded description, but I am not a programmer.

#####

The objective would be achieved by,
taking fields 2 and 3 from the first record

10-a,1000010.0,500010.0,110.0

and comparing them to fields 2 and 3 from the remainder of the records,
within the range $2-1.1 to $2+1.1 and range $3-1.1 to $3+1.1,
to determine which of the remaining records are to be considered duplicates.

I think the duplicates would be:

9-b,1000009.9,500009.9,110.0
10-b,1000010.0,500010.0,110.1
9-a,1000009.9,500009.9,109.9

because their $2 and $3 fields satisfy the ranges.

So with record

10-a,1000010.0,500010.0,110.0

marked as a keeper and written to OutputFile1,

and with records

9-b,1000009.9,500009.9,110.0
10-b,1000010.0,500010.0,110.1
9-a,1000009.9,500009.9,109.9

identified as duplicates and written to OutputFile2.

These 4 records would now be removed from processing by
possibly creating a new temporary file/list with these 4 records removed.

This would be the new tempory file/list with the 4 records removed.

2-b,1000002.2,500002.2,102.3
8-a,1000008.8,500008.8,108.8
3-b,1000003.3,500003.3,103.4
8-b,1000008.8,500008.8,108.9
7-a,1000007.7,500007.7,107.7
7-b,1000007.7,500007.7,107.8
6-b,1000006.6,500006.6,106.7
3-a,1000003.3,500003.3,103.3
6-a,1000006.6,500006.6,106.6
5-a,1000005.5,500005.5,105.5
5-b,1000005.5,500005.5,105.6
4-a,1000004.4,500004.4,104.4
4-b,1000004.4,500004.4,104.5
2-a,1000002.2,500002.2,102.2
1-a,1000001.1,500001.1,101.1
1-b,1000001.1,500001.1,101.2

Now the "new first record" would be:

2-b,1000002.2,500002.2,102.3

and gets written to OutputFile1.

Compare fields 2 and 3
to fields 2 and 3 from the remainder of the records
in the tempory list,
within the range $2-1.1 to $2+1.1 and range $3-1.1 to $3+1.1,
to determine which of the remaining records are to be considered duplicates.

I think the duplicates would be:

3-b,1000003.3,500003.3,103.4
3-a,1000003.3,500003.3,103.3
2-a,1000002.2,500002.2,102.2
1-a,1000001.1,500001.1,101.1
1-b,1000001.1,500001.1,101.2

and get written to OutputFile2

The new temporary file/list with these 6 records removed would be:

8-a,1000008.8,500008.8,108.8
8-b,1000008.8,500008.8,108.9
7-a,1000007.7,500007.7,107.7
7-b,1000007.7,500007.7,107.8
6-b,1000006.6,500006.6,106.7
6-a,1000006.6,500006.6,106.6
5-a,1000005.5,500005.5,105.5
5-b,1000005.5,500005.5,105.6
4-a,1000004.4,500004.4,104.4
4-b,1000004.4,500004.4,104.5

8-a,1000008.8,500008.8,108.8
becomes the "new first record"
and gets written to OutputFile1

8-b,1000008.8,500008.8,108.9
7-a,1000007.7,500007.7,107.7
7-b,1000007.7,500007.7,107.8

are the duplicates and get written to OutputFile2

The new temporary file with these 4 records removed would be:

6-b,1000006.6,500006.6,106.7
6-a,1000006.6,500006.6,106.6
5-a,1000005.5,500005.5,105.5
5-b,1000005.5,500005.5,105.6
4-a,1000004.4,500004.4,104.4
4-b,1000004.4,500004.4,104.5

6-b,1000006.6,500006.6,106.7
becomes the "new first record"
and gets written to OutputFile1

6-a,1000006.6,500006.6,106.6
5-a,1000005.5,500005.5,105.5
5-b,1000005.5,500005.5,105.6

are the duplicates and get written to OutputFile2

The new temporary file with these 4 records removed would be:

4-a,1000004.4,500004.4,104.4
4-b,1000004.4,500004.4,104.5

4-a,1000004.4,500004.4,104.4
becomes the "new first record"
and gets written to OutputFile1

4-b,1000004.4,500004.4,104.5

is the duplicate and gets written to OutputFile2

Done
Whew!

Best Regards,
Kenny.

Ok,
not a one-liner :slight_smile:

It's all about casting to number :slight_smile:

awk -F, '{ _[NR] = $0 }
END {
  m = "%.2f"
  for (i=1; i<=NR; i++) {
    if (_) {
    print _ > "out1"
    split(_, tt)
    delete _
    for (j=1; j<=NR; j++) {
      if (_[j]) {
      split(_[j], t)
      if ((sprintf(m, t[2] - v) <= sprintf(m, tt[2]) && sprintf(m, tt[2]) <= sprintf(m, t[2] + v)) && \
      (sprintf(m, t[3] - v) <= sprintf(m, tt[3]) && sprintf(m, tt[3]) <= sprintf(m, t[3] + v))) {
        print _[j] > "out2"
        delete _[j]
      }        
     }
    }
   }
  }
 }' v=1.1 file

can you explain this..

sed -n '/'${i}'/{p;q;}' filename

Pls don't hijack other people threads - start a new thread.

Hi radoulov,

I am not having any success (yet).

But I know that with your expertise (and patience) I will eventually find the solution. And I appreciate the time that you are taking on this puzzle.

First, I probably should come clean and confess that I am not on Unix.
My operating system is Microsoft Windows XP, so I am hoping that I won't be cast out of the forum :smiley:

I have some Unix for Windows utilities, mainly awk awk95 mawk and gawk.

Which of these flavours would your code work with?

I have found that things like ' and " are handled differently between Unix and Microsoft Windows. So maybe that is why I get error messages when using your code.

One other question though. If you put the awk commands in a file (to used for awk -f), does the v=1.1 go inside the awk command file or outside on the command line?

Examples of the error messages are:
And this is after putting your awk commands in a file with the v=1.1 inside the file. Note, I did not make any edits to any character in the awk commands string.
The reason that I put the awk commands in a file is:
I edited them to one long command line and used it in my DOS command window, but the operation system said it couln't find my input file. There were no errors from mawk.

=====

awk -f AwkCommandFile -F, InputFile

'{ _[NR] = $0 }
^
awk ERROR temp2.awk line 1: syntax error

=====

awk95 -f AwkCommandFile -F, InputFile

awk95: syntax error at source line 1 source file AwkCommandFile
context is
>>> ' <<<
awk95: bailing out at source line 21

=====

gawk -f AwkCommandFile -F, InputFile

gawk: AwkCommandFile:1: '{ _[NR] = $0 }
gawk: AwkCommandFile:1: ^ invalid char ''' in expression

=====

mawk -f AwkCommandFile -F, InputFile

mawk: 1: unexpected character '''
mawk: 21: unexpected character '''

=====

The contents of the AwkCommandFile was:

'{ [NR] = $0 }
END {
m = "%.2f"
for (i=1; i<=NR; i++) {
if (
[i]) {
print _ [i]> "out1"
split([i], tt)
delete _
[i]for (j=1; j<=NR; j++) {
if (
[j]) {
split(_[j], t)
if ((sprintf(m, t[2] - v) <= sprintf(m, tt[2]) && sprintf(m, tt[2]) <= sprintf(m, t[2] + v)) && \
(sprintf(m, t[3] - v) <= sprintf(m, tt[3]) && sprintf(m, tt[3]) <= sprintf(m, t[3] + v))) {
print _[j] > "out2"
delete _[j]
}
}
}
}
}
}' v=1.1

Again,
Many Thanks,
Kenny.

Put this into AwkCommandFile:

BEGIN { FS="," }
{ _[NR] = $0 }
END {
  m = "%.2f"
  for (i=1; i<=NR; i++) {
    if (_) {
    print _ > "out1"
    split(_, tt)
    delete _
    for (j=1; j<=NR; j++) {
      if (_[j]) {
      split(_[j], t)
      if ((sprintf(m, t[2] - v) <= sprintf(m, tt[2]) && \
sprintf(m, tt[2]) <= sprintf(m, t[2] + v)) && \
      (sprintf(m, t[3] - v) <= sprintf(m, tt[3]) && \
sprintf(m, tt[3]) <= sprintf(m, t[3] + v))) {
        print _[j] > "out2"
        delete _[j]
      }
     }
    }
   }
  }
 }

Run it like this:

awk -f AwkCommandFile v=1.1 InputFile

I put the v variable outside the script for commodity (you can run the script with different values without modifying the code).