sort and summarize

Hi Guys,

I have a file in UNIX with duplicates, I have use sort command as below to delete duplicates based on the KEY positions/columns but now I do not want to "delete" duplicates but summarize by KEY numeric columns.

REALLY NEED HELP... URGENT!!!

Thanks in advance.

sort -k 1.1,1.92 -u file > outfile

I don't think sort does that natively...

If you can provide an example input and an example output showing what you want done, it's probably scriptable.

Here is the example:

1288M99G14 ALA201001+00000000.000+00000005.000
1288M99G14 ALA201001+00000000.000+00000005.000
1288M99G14 ALB201001+00000005.000+00000000.000
1288M99G14 ALA201002+00000000.000+00000017.000
1288M99G14 ALB201001+00000017.000+00000000.000
1288M99G14 ALA201002+00000000.000+00000005.000

Output:

1288M99G14 ALA201001+00000000.000+00000010.000
1288M99G14 ALB201001+00000023.000+00000000.000
1288M99G14 ALA201002+00000000.000+00000023.000

So summarize by first 2 fields

Ah, so it's totalling them...

Sounds like a awk or perl solution would be the way to go.
You can then pipe the ouput through sort to get whatever order you want. You've already got the sort right (without the -u of course) so I'll focus on the totaling part...

As I'm not great with awk, I'll try perl, I'm sure one of the awk wizzes around here can offer up a solution for that :wink:

#!/bin/perl -w

while (<>) {
  ($name,$left,$right)=split(/\+/);
  $vals{$name}{"left"}+=$left;
  $vals{$name}{"right"}+=$right;
}

foreach $name (keys %vals) {
  printf "%s\+%012.3f\+%012.3f\n",${name},$vals{$name}{'left'},$vals{$name}{'right'};
}

filename=$1
sort $filename|
awk ' BEGIN {FS="+"; prev_key1=""; prev_key2=0; prev_key2=0; first=1; }
{
# print "asdfdafsdfsdfasf|"prev_key1 "|"$1
if($1==prev_key1)
{
prev_key2 += $2;
prev_key3 += $3;
}
else
{
if(!first)
printf("%20.20s+%08.3f+%08.3f\n",prev_key1,prev_key2,prev_key3);
else first=0;
prev_key1 = $1;
prev_key2 = $2;
prev_key3 = $3;
}
}
END {printf("%20.20s+%08.3f+%08.3f\n",prev_key1,prev_key2,prev_key3);}'

The sort needs to be smarter, the OP was not sorting by the first element (but they have that bit working so I just left it out of the solution :slight_smile: )

Another Awk/sort try:

awk '{
	x[$1] += $2
	y[$1] += $3
} END {
	for (e in x)
		printf "%s+%012.3f+%012.3f\n",
		e, x[e], y[e]
}' FS="+" filename|sort -t " " -k2.4n

Use nawk or /usr/xpg4/bin/awk on Solaris
(or just write the printf statement on one line :)).