Removing Duplicate Rows in a file

ekbaazigar · August 21, 2014, 3:26pm

Hello

I have a file with contents like this...

Part1 Field2 Field3 Field4 (line1)
Part2 Field2 Field3 Field4 (line2)
Part3 Field2 Field3 Field4 (line3)
Part1 Field2 Field3 Field4 (line4)
Part4 Field2 Field3 Field4 (line5)
Part5 Field2 Field3 Field4 (line6)
Part2 Field2 Field3 Field4 (line7)
Part1 Field2 Field3 Field4 (line8)
...

The lines are added throughout the day at different times by various programs so the listing is in the order of timestamp . At the end of the day, I want to remove the oldest values (since they are superseded). So in the example above, I want to get rid of line 1 line 2 and line 4 as there are more recent row of these Parts. Also delete the empty rows that get created during the delete of the row.

Part3 Field2 Field3 Field4 (line3)
Part4 Field2 Field3 Field4 (line5)
Part5 Field2 Field3 Field4 (line6)
Part2 Field2 Field3 Field4 (line7)
Part1 Field2 Field3 Field4 (line8)

Any help will be greatly appreciated.

MadeInGermany · August 21, 2014, 3:47pm

I think the (line number) are added for demonstration, not in the real file?
Then it is with awk

awk '
 {s[$0]=NR}
 END {for (i=1;i<=NR;i++) for (j in s) if (i==s[j]) print j}
' file

For big files the END section should sort on the line numbers. With perl it becomes

perl -ne '
 $s{$_}=++$i;
 END {print sort {$s{$a}<=>$s{$b}} keys %s}
' file

ekbaazigar · August 21, 2014, 6:10pm

Yes, the line numbers at the end were added for demonstration purpose.

---------- Post updated at 05:10 PM ---------- Previous update was at 02:52 PM ----------

madeingermany:

I think the (line number) are added for demonstration, not in the real file?
Then it is with awk
awk '
 {s[$0]=NR}
 END {for (i=1;i<=NR;i++) for (j in s) if (i==s[j]) print j}
' file
For big files the END section should sort on the line numbers. With perl it becomes
perl -ne '
 $s{$_}=++$i;
 END {print sort {$s{$a}<=>$s{$b}} keys %s}
' file

I tried it, but it just returned the original values.

MadeInGermany · August 21, 2014, 6:44pm

It works with this file:

Part1 Field2 Field3 Field4
Part2 Field2 Field3 Field4
Part3 Field2 Field3 Field4
Part1 Field2 Field3 Field4
Part4 Field2 Field3 Field4
Part5 Field2 Field3 Field4
Part2 Field2 Field3 Field4
Part1 Field2 Field3 Field4

ekbaazigar · August 21, 2014, 7:51pm

ok, i see it works only when the entire line duplicated.

Anyway to just check on the first column and not the entire row ?

Thank you so much for sharing your experience and expertise.

RudiC · August 22, 2014, 4:42am

Use s[$1] instead of s[$0] in awk .

MadeInGermany · August 22, 2014, 4:51am

s[$1] only stores the key (column 1), so one needs to also store the rest of the row.
Or the entire row:

awk '
 {s[$1]=NR; row[NR]=$0}
 END {for (i=1;i<=NR;i++) for (j in s) if (i==s[j]) print row}
' file

Or

awk '
 {s[$1]=NR; row[$1]=$0}
 END {for (i=1;i<=NR;i++) for (j in s) if (i==s[j]) print row[j]}
' file

I wonder which one consumes less memory?

ekbaazigar · August 25, 2014, 5:56pm

Thank you very much. It works perfectly.