How to average the third column for each value of the first column?

PMFCG · November 12, 2015, 4:14pm

So I have a large amount of comma delimited data that looks like this:

30.498001,-87.881412,0.024958
30.498001,-87.881412,0.035684
30.498001,-87.881412,,0.026
34.758781,-87.650562,0.034292
34.758781,-87.650562,0.029458
32.498567,-86.136587,0.045458
32.498567,-86.136587,0.036292
32.498567,-86.136587,0.024125
32.498567,-86.136587,0.01775

I need to change this data such that the third column is averaged for each different value of the first column. So I would have something like this:

30.498001,-87.881412, average
34.758781,-87.650562, average
32.498567,-86.136587, average

So for each unique value in the first column the third column is averaged. (only the first column matters because there are no two rows in my data with the same first column but different second columns0

I have tried numerous awk and Perl solutions to similar problems online. I am stumped.

If someone could point me in the right direction that would be incredible.

Don_Cragun · November 12, 2015, 4:28pm

Please show us what you have tried. It helps us figure out what problems you are having and the level of your expertise.

Please also tell us what operating system and shell you are using.

RudiC · November 12, 2015, 4:37pm

Is the second column value always equal to the first column value?

wisecracker · November 12, 2015, 4:46pm

I am missing something here.

30.498001,-87.881412,,0.026

There is no third column!

Do we assume that this NULL is the value zero?
Otherwise how do you want us to average a none existant value?

PMFCG · November 12, 2015, 4:55pm

The first two column values are lat/long values. They are for specific locations, so considering the first column as an id is fine, I think.

I am using UNIX on a mac (not sure if that's what you needed to know). I am VERY inexperienced, only been working with UNIX for a few months.

So far I have tried switching this code around in the terminal. It is for averaging rows with the same first column.
$ awk '
NR>1{
arr[$1] += $2
count[$1] += 1
}
END{
for (a in arr) {
print "id avg " a " = " arr[a] / count[a]
}
}
' FILE

I have tried to switch the $2 to $3 and switched the order so that it would be averaging columns not rows, but I don't think this is even the right approach. No matter what I try I get "0" as the output for my average...

---------- Post updated at 05:55 PM ---------- Previous update was at 05:49 PM ----------

30.498001,-87.881412,0.026

1,2,3

Don_Cragun · November 12, 2015, 4:58pm

Assuming that the data is already in sorted order (as in your example), you could try:

awk '
BEGIN {	FS = OFS = ","
}
function print_av() {
	if(NR <= 1)
		return
	print sum / cnt
	sum = cnt = 0
}
$1 != last {
	print_av()
	printf("%s%s%s%s", $1, OFS, $2, OFS)
	last = $1
}
{	sum += $3
	cnt++
}
END {	print_av()
}' file

But, note as wisecracker said, you need to clean up your input data so that each line has three fields. It would probably be better to use the latitude and the longitude as in:

awk '
BEGIN {	FS = OFS = ","
}
function print_av() {
	if(NR <= 1)
		return
	print sum / cnt
	sum = cnt = 0
}
$1 FS $2 != last {
	print_av()
	printf("%s%s%s%s", $1, OFS, $2, OFS)
	last = $1 FS $2
}
{	sum += $3
	cnt++
}
END {	print_av()
}' file

RudiC · November 12, 2015, 5:19pm

You weren't too far off ... compare this to what you posted, and you'll find your mistakes:

awk '
        {arr[$1] += $3
         count[$1] ++
        }
END     {for (a in arr) {print "id avg " a " = " arr[a] / count[a]
                        }
        }
' FS="," file
id avg 32.498567 = 0.0309062
id avg 30.498001 = 0.020214
id avg 34.758781 = 0.031875

Aia · November 12, 2015, 9:59pm

rudic:

You weren't too far off ... compare this to what you posted, and you'll find your mistakes:

awk '
   {arr[$1] += $3
   count[$1] ++
   }
END     {for (a in arr) {print "id avg " a " = " arr[a] / count[a]
   }
   }
' FS="," file
id avg 32.498567 = 0.0309062
id avg 30.498001 = 0.020214 <--- it should be close to 0.0288807
id avg 34.758781 = 0.031875

Only that the input is not consistent and line 3 has an extra field, producing the wrong average.

Perhaps,

awk '
        {arr[$1] += $NF
         count[$1] ++
        }
END     {for (a in arr) {print "id avg " a " = " arr[a] / count[a]
                        }
        }
' FS="," pmfcg.file
id avg 32.498567 = 0.0309062
id avg 30.498001 = 0.0288807
id avg 34.758781 = 0.031875

A Perl alternative:

perl -nle '
    /^([\d\.-]+,[\d\.-]+).*?([\d\.-]+)$/ and @{$u{$1}}[0] += $2 and @{$u{$1}}[1]++;
    END{for $k (keys %u){print "$k,", @{$u{$k}}[0]/@{$u{$k}}[1]}}
' pmfcg.file

output:

34.758781,-87.650562,0.031875
32.498567,-86.136587,0.03090625
30.498001,-87.881412,0.0288806666666667

PMFCG · November 18, 2015, 4:05pm

Ok, so I tried this

awk '
        {arr[$1] += $3
         count[$1] ++
        }
END     {for (a in arr) {print "id avg � a " = " arr[a] / count[a]
                        }
        }
' FS="," file

This is exactly what I need! Thanks so much for the pointers!

However, I also need to print it in the following format:

Lat,Long,avg

As of now, it gives me this:

id avg 32.498567 = 0.0309062
id avg 30.498001 = 0.0288807
id avg 34.758781 = 0.031875

How could I ammend the above code to give me the LONG too? Otherwise I would have to manually pair the LAT with the LONg.... yuck...

---------- Post updated at 05:05 PM ---------- Previous update was at 04:59 PM ----------

Alright, Aia's Perl solution did exactly what I needed to do. Thanks!!!!!