Removing specific records from files when duplicate key

Hello

I have been trying to remove a row from a file which has the same first three columns as another row - I have tried lots of different combinations of suggestion on this forum but can't get it exactly right.

what I have is

900 - 1000 = 0
900 - 1000 =  2562
1000 - 1100 = 0
1000 - 1100 =  931
1100 - 1200 = 0
1100 - 1200 =  469
1200 - 1300 = 0
1300 - 1400 = 0
1300 - 1400 =  175
1400 - 1500 = 0
1400 - 1500 =  112 

what I want is

900 - 1000 =  2562
1000 - 1100 =  931
1100 - 1200 =  469
1200 - 1300 = 0
1300 - 1400 =  175
1400 - 1500 =  112 

Any help would be greatly appreciated
:confused:

Let's give it a try.

awk '{a[$1]=$0; next}END{for (i in a) {print a}}' filename | sort -n
2 Likes
  1. thanks for the quick reply
  2. your recommendation works with a larger amount of data, and now I have a large bunch of data that i need to parse - but I can handle that
  3. I owe you a tasty beverage if you are in my neck of the woods :slight_smile:

It works great but I don't get it. :confused:

To me it looks like you save each record in an array, using $1 as index. That's ok, I understand. Then you decide to jump to the next record.. why?

And then it magically worked and it's all saved in the array and print it at the end.

I can't see the light in this one. Could you explain it a little bit?

Since it is the first three columns, technically that would need to be:

awk '{a[$1,$2,$3]=$0; next} .....

Now I understand.

Because there's always the first ocurrence that equals 0 which does not count, and it has the same index for the array, the second value overlaps the first, so it's always saved the second value of the same pattern, in case there's a second value with the same pattern.

This time the key was paying attention to the index and how awk saves in the array.

Thanks.

There is no test for 0. It is not necessarily the second line with a given value for the 1st three fields that is saved in the array; it is the last line with a given value for the 1st three fields that is saved. If there is one line with 900 , - , and 1000 as the 1st three fields on the line, respectively, a[$1, $2, $3] 's value (or in this case a["900", "-", "1000"] 's value) will be that entire line. If there is more one line with 900 , - , and 1000 as the 1st three fields on the line, respectively, a[$1, $2, $3] 's value will be the last line starting with those three values.

When processing an array with:

for(i in a)

the elements are processed in a random order (not necessarily the order in which they were found in the input file). This is why aia used sort -n to print the output in the same order as the (sorted) input file.

1 Like

Thanks Don. Yes, The last line with a given a value is the one that it's saved.

Thank you. :b: