Average across multiple columns - awk

theflamingmoe · November 20, 2016, 7:33pm

Hi forum members,

I'm trying to get an average of multiple columns in a csv file using awk. A small example of my input data is as follows:

cu,u3o8,au,ag
-9,20,-9,3.6
0.005,30,-9,-9
0.005,50,10,3.44
0.021,-9,8,3.35

The following code seems to do most of what I want

gawk -F"," 'NR==1;NR>1 {for (i=1;i<=4;i++){if($i>0)a+=$i}}END{ for (i=1;i<=4;i++){printf a/(NR-1)","};printf "\n"}' OFS=, input.csv > average.csv

However, the issue is i'm using the $i>o condition to filter out the -9 values and then dividing by NR to calculate the average, which gives the wrong answer. For the Cu column, ignoring -9 values, I should get an average of 0.0103 but instead I get 0.0077.
I tried using a counter as follows:

gawk -F"," 'NR==1;NR>1 {for (i=1;i<=4;i++){if($i>0)a+=$i}; ++count[$i]}END{ for (i=1;i<=4;i++){printf a/count[$i]","};printf "\n"}' OFS=, input.csv > average.csv

I can't seem to get it to work. Any help would be greatly appreciated.

jlliagre · November 20, 2016, 7:44pm

A couple of issues corrected:

gawk -F"," 'NR>1 {for (i=1;i<=4;i++){if($i>0){a+=$i;++count}}}END{ for (i=1;i<=4;i++){printf a/count","};printf "\n"}' input.csv > average.csv

theflamingmoe · November 20, 2016, 8:05pm

Thanks jlliagre, that worked perfectly. I have one final question. Say I have the same data, but added two groups, rd and ru. How would I get the average of the groups?

cu,u3o8,au,ag,group
-9,20,-9,3.6,ru
0.005,30,-9,-9,ru
0.005,50,10,3.44,rd
0.021,-9,8,3.35,rd

Desired output:

0.005,25,2,3.6,ru
0.013,50,9,3.39,rd

Thanks in advance,

Don_Cragun · November 20, 2016, 9:06pm

Since jillagre was kind enough to fix your earlier code to do what you said you wanted to do, why don't you show us what you have learned and try to add your new requirements to the code you now have. If you get stuck, we'll try to help; but we are here to help you learn how to write your own code. We are not here to act as your unpaid programming staff when your requirements change.

Will there always be two groups ( rd and ru )? Or do you want code that will work for any number of groups with names consisting of any strings?

Will all lines for a given group be adjacent (as in your example), or will data for various groups be intermixed?

theflamingmoe · November 20, 2016, 9:35pm

Fair enough Don

My various attempts were something along the line of adding a

for (group in a)

as part of the code. For example:

gawk -F"," 'NR>1 {for (i=1;i<=4;i++){if($i>0){a+=$i;++count;++group[$5]}}} END{for (g in group) {for (i=1;i<=4;i++){printf a/count","};printf "\n"}}' input.csv > average.csv

I just used "rd" and "ru" as an example. I would like to match any string in column 5.

Don_Cragun · November 20, 2016, 11:05pm

You didn't answer the question: "Will all lines for a given group be adjacent (as in your example), or will data for various groups be intermixed?" so I tried to write code that would work either way. Note that the order of the groups in the output from this is random. It uses two dimensional arrays to keep track of the sums and counts for each column and group. (If all lines for a group are adjacent, you could still use one-dimensional arrays and dump the data for each group when a new group is found.)

Given that both entries in the 3rd column for group ru are -9 , I don't understand why you believe the output for that column for that group should be 2 ; the following code produces 0 in this case:

gawk '
BEGIN {	FS = OFS = ","
}
NR > 1 {for(i = 1; i <= 4; i++)
		if($i > 0) {
			s[i, $5] += $i
			++count[i, $5]
			group[$5]
		}
}
END {	for(g in group) {
		for(i = 1; i <= 4; i++)
			printf("%g%s", count[i, g] ? s[i, g] / count[i, g] : 0,
			    OFS)
		print g
	}
}' input.csv > average.csv

Using awk on macOS 10.12.1, I get the output:

0.005,25,0,3.6,ru
0.013,50,9,3.395,rd

in average.csv with the sample input.csv data you provided in post #3.

theflamingmoe · November 21, 2016, 12:09am

Thanks for your help Don and sorry for not answering all your questions. The code works perfectly and matches my requirements. Regarding the "group ru are -9" in your post, it was a typo error. I had been working on this all morning and trying different input data. Thanks again for all your help