Removing specific records from files when duplicate key

tinytimmay · May 20, 2014, 11:23pm

Hello

I have been trying to remove a row from a file which has the same first three columns as another row - I have tried lots of different combinations of suggestion on this forum but can't get it exactly right.

what I have is

900 - 1000 = 0
900 - 1000 =  2562
1000 - 1100 = 0
1000 - 1100 =  931
1100 - 1200 = 0
1100 - 1200 =  469
1200 - 1300 = 0
1300 - 1400 = 0
1300 - 1400 =  175
1400 - 1500 = 0
1400 - 1500 =  112

what I want is

900 - 1000 =  2562
1000 - 1100 =  931
1100 - 1200 =  469
1200 - 1300 = 0
1300 - 1400 =  175
1400 - 1500 =  112

Any help would be greatly appreciated

Aia · May 20, 2014, 11:49pm

Let's give it a try.

awk '{a[$1]=$0; next}END{for (i in a) {print a}}' filename | sort -n

tinytimmay · May 21, 2014, 7:24am

thanks for the quick reply
your recommendation works with a larger amount of data, and now I have a large bunch of data that i need to parse - but I can handle that
I owe you a tasty beverage if you are in my neck of the woods

Kibou · May 21, 2014, 2:28pm

It works great but I don't get it.

To me it looks like you save each record in an array, using $1 as index. That's ok, I understand. Then you decide to jump to the next record.. why?

And then it magically worked and it's all saved in the array and print it at the end.

I can't see the light in this one. Could you explain it a little bit?

Scrutinizer · May 21, 2014, 2:33pm

Since it is the first three columns, technically that would need to be:

awk '{a[$1,$2,$3]=$0; next} .....

Kibou · May 21, 2014, 5:41pm

Now I understand.

Because there's always the first ocurrence that equals 0 which does not count, and it has the same index for the array, the second value overlaps the first, so it's always saved the second value of the same pattern, in case there's a second value with the same pattern.

This time the key was paying attention to the index and how awk saves in the array.

Thanks.

Don_Cragun · May 21, 2014, 11:32pm

There is no test for 0. It is not necessarily the second line with a given value for the 1st three fields that is saved in the array; it is the last line with a given value for the 1st three fields that is saved. If there is one line with 900 , - , and 1000 as the 1st three fields on the line, respectively, a[$1, $2, $3] 's value (or in this case a["900", "-", "1000"] 's value) will be that entire line. If there is more one line with 900 , - , and 1000 as the 1st three fields on the line, respectively, a[$1, $2, $3] 's value will be the last line starting with those three values.

When processing an array with:

for(i in a)

the elements are processed in a random order (not necessarily the order in which they were found in the input file). This is why aia used sort -n to print the output in the same order as the (sorted) input file.

Kibou · May 22, 2014, 3:04am

don cragun:

There is no test for 0. It is not necessarily the second line with a given value for the 1st three fields that is saved in the array; it is the last line with a given value for the 1st three fields that is saved. If there is one line with 900 , - , and 1000 as the 1st three fields on the line, respectively, a[$1, $2, $3] 's value (or in this case a["900", "-", "1000"] 's value) will be that entire line. If there is more one line with 900 , - , and 1000 as the 1st three fields on the line, respectively, a[$1, $2, $3] 's value will be the last line starting with those three values.

When processing an array with:
for(i in a)
the elements are processed in a random order (not necessarily the order in which they were found in the input file). This is why aia used sort -n to print the output in the same order as the (sorted) input file.

Thanks Don. Yes, The last line with a given a value is the one that it's saved.

Thank you.