Removing duplicates from delimited file based on 2 columns

kevinprood · August 12, 2014, 10:33pm

Hi guys,Got a bit of a bind I'm in. I'm looking to remove duplicates from a pipe delimited file, but do so based on 2 columns. Sounds easy enough, but here's the kicker...
Column #1 is a simple ID, which is used to identify the duplicate.
Once dups are identified, I need to only keep the one with the latest date, which is column #4, in mm/dd/yyyy format. Of course, rows that don't have dup's would remain as-is.
Example input.txt :

9300617000372|Skittles|Candy|5/1/2013|12
4381472200131|M&Ms|Chocolate|9/20/2013|39
9414789515104|Jif|Peanut Butter|11/8/2013|14
4381472200131|Reese's|Peanut Butter|5/20/2014|61
4381472200131|Reese's|Chocolate|2/20/2014|36

In that scenario, the output would be rows 1, 3, and 4, since rows 2 and 5 are duplicates based on the ID and are older than the one in row 4 based on date.
The other kicker is...
The file I'm doing this with is 400,000 rows. So, I need the method to be extremely efficient and as quick as possible. I can't afford for this to take hours.

This is running on a Windows machine with GnuWin utils, as one last note.
I am definitely not enough of an expert to make this work, especially efficiently, so I'm hoping someone can help. Many thanks in advance.

SriniShoo · August 13, 2014, 12:29am

awk -F '|' '{split($4, d, "/"); dt=d[3] d[1] d[2]
  if($1 in a) {
    if(b[$1] < dt) {b[$1] = dt; a[$1] = $0}}
  else {a[$1] = $0; b[$1] = dt}}
END {for(x in a) {print a[x]}}' file

Don_Cragun · August 13, 2014, 4:37am

Although it looks like this will work for the small sample given, with the date formats being used (with no leading zeros on the month and day fields), this won't work reliably. I think you need to change:

dt=d[3] d[1] d[2]

to:

dt=sprintf("%d%02d%02d", d[3], d[1], d[2])