sort and semi-duplicate row - keep latest only

LisaS · January 8, 2009, 2:53am

I have a pipe delimited file. Key is field 2, date is field 5 (as example, my real file is more complicated of course, but the KEY and DATE are accurate)
There can be duplicate rows for a key with different dates.
I need to keep only rows with latest date in this case.
Example data:
W|AAA|DD|D|20080101
W|BBB|CC|C|20080101
W|AAA|BB|B|20080201
W|CCC|DD|D|20080701
W|CCC|EE|E|20080801
W|AAA|DD|D|20081231

I would want to see:
W|AAA|DD|D|20081231
W|BBB|CC|C|20080101
W|CCC|EE|E|20080801

I want to use sort for this but am open to other options. I'm guessing awk could be involved, but I'm bad at writing awk.
I've searched but didn't find anything that seemed to match.
Ideas?

edit to add a little more info:
It could be that I have rows with same key and same date. In that case, I'd prefer to take the one that is last in the file (because of differences in the other fields of the rows and how the file gets built) -- but if that is not possible I understand.

LisaS · January 8, 2009, 4:06am

I made something that is working on my small test file (the real file has too much data to really hand check) - I'd appreciate if someone could take a look and critique/agree/tell me I'm all wet.

sort -t"|" -k2,2 -kr5,5 <myrawfile.txt | sort -ru -k2,2 | sort -t"|" -k2,2

(I know the 3rd sort might be over the top, but it gets them back in key sequence)

summer_cherry · January 8, 2009, 4:25am

#! /usr/bin/perl
open FH,"a.txt";
while(<FH>){
	chomp;
	@arr=split("[|]",$_,3);
	$hash{$arr[1]}=$_;
}
close FH;
map {print $hash{$_},"\n"} sort keys %hash;

nawk '{
	split($0,arr,"|")
	_[arr[2]]=$0
}
END{
	for(i in _)
		print _
}' a.txt | sort -t"|" +1

jaduks · January 8, 2009, 4:33am

Not sure if you are looking for this

$ awk '!x[$1,$2]++' FS="|" li.txt
W|AAA|DD|D|20080101
W|BBB|CC|C|20080101
W|CCC|DD|D|20080701

$ sort -t"|" -n -rk5 li.txt |  awk '!x[$1,$2]++' FS="|"
W|AAA|DD|D|20081231
W|CCC|EE|E|20080801
W|BBB|CC|C|20080101

radoulov · January 8, 2009, 4:35am

Something like this:

sort -t\| -k5nr infile|sort -t\| -uk2.1,2.3

With AWK (nawk or /usr/xpg4/bin/awk on Solaris):

awk -F\| 'END {
  for (k in r)
    print r[k]
	}
$NF > m[$2] {
  m[$2] = $NF
  r[$2] = $0 }
  ' file

Notice that the output will not be ordered by the second field.