Removing Duplicate Rows in a file

Hello

I have a file with contents like this...

Part1 Field2 Field3 Field4 (line1)
Part2 Field2 Field3 Field4 (line2)
Part3 Field2 Field3 Field4 (line3)
Part1 Field2 Field3 Field4 (line4)
Part4 Field2 Field3 Field4 (line5)
Part5 Field2 Field3 Field4 (line6)
Part2 Field2 Field3 Field4 (line7)
Part1 Field2 Field3 Field4 (line8)
...

The lines are added throughout the day at different times by various programs so the listing is in the order of timestamp . At the end of the day, I want to remove the oldest values (since they are superseded). So in the example above, I want to get rid of line 1 line 2 and line 4 as there are more recent row of these Parts. Also delete the empty rows that get created during the delete of the row.

Part3 Field2 Field3 Field4 (line3)
Part4 Field2 Field3 Field4 (line5)
Part5 Field2 Field3 Field4 (line6)
Part2 Field2 Field3 Field4 (line7)
Part1 Field2 Field3 Field4 (line8)

Any help will be greatly appreciated.

I think the (line number) are added for demonstration, not in the real file?
Then it is with awk

awk '
 {s[$0]=NR}
 END {for (i=1;i<=NR;i++) for (j in s) if (i==s[j]) print j}
' file

For big files the END section should sort on the line numbers. With perl it becomes

perl -ne '
 $s{$_}=++$i;
 END {print sort {$s{$a}<=>$s{$b}} keys %s}
' file

Yes, the line numbers at the end were added for demonstration purpose.

---------- Post updated at 05:10 PM ---------- Previous update was at 02:52 PM ----------

I tried it, but it just returned the original values.

It works with this file:

Part1 Field2 Field3 Field4
Part2 Field2 Field3 Field4
Part3 Field2 Field3 Field4
Part1 Field2 Field3 Field4
Part4 Field2 Field3 Field4
Part5 Field2 Field3 Field4
Part2 Field2 Field3 Field4
Part1 Field2 Field3 Field4
1 Like

ok, i see it works only when the entire line duplicated.

Anyway to just check on the first column and not the entire row ?

Thank you so much for sharing your experience and expertise.

Use s[$1] instead of s[$0] in awk .

1 Like

s[$1] only stores the key (column 1), so one needs to also store the rest of the row.
Or the entire row:

awk '
 {s[$1]=NR; row[NR]=$0}
 END {for (i=1;i<=NR;i++) for (j in s) if (i==s[j]) print row}
' file

Or

awk '
 {s[$1]=NR; row[$1]=$0}
 END {for (i=1;i<=NR;i++) for (j in s) if (i==s[j]) print row[j]}
' file

I wonder which one consumes less memory?

2 Likes

Thank you very much. It works perfectly.