Average across multiple columns group by

ritakadm · January 26, 2015, 4:57pm

Hi experts,

I want to group by average, for multiple columns starting column $7 until NF,
group by ($1-$5), please help

For just 7th column, I can do

awk '
    NR>1{
        arr[$1,$2,$3,$4,$5]   += $7
        count[$1,$2,$3,$4,$5] += 1
    }
    END{
        for (a in arr) {
            print a, arr[a]/count[a]
        }
    }
' FILE

migurus · January 26, 2015, 5:05pm

you can use a for loop in this manner

 
 sum=0; for(I = 7; I <= NF; I++) { sum += $I; }

where $I would be values of your 7th, 8th etc... fields

ritakadm · January 26, 2015, 5:10pm

Wouldn't that add all the elements in the same row? I need average for each of the columns separately.

migurus · January 26, 2015, 5:21pm

Sorry, I misunderstood your request

Don_Cragun · January 26, 2015, 5:34pm

Please show us some sample input, show us the output you expect from that sample input, tell us what OS and shell you're using, and tell us something about the ranges of bounds we should expect on the files you'll be processing (e.g., maximum number of fields, ranges of values in those fields, number of lines, is the number of fields within an input file a constant, etc.).

Does the code you showed us give you the results you want for column 7 in your input file?

ritakadm · January 26, 2015, 5:49pm

sorry about not providing with a sample input, I`m using cygwin. The data range is 0 to 100,000 should be outputted upto 2 decimal places.
Data is 83000 lines, not very big.

Yes, the code calculates the average correctly for only the 7th column, although I should
populate arr as arr[$1" "$2" "$3" "$4" "$5] to get all the variables delimited.

Sample input has 4 data columns, I have many in the original data starting col7 until NF ($22).

  a1 b1 c1 d1 e1 12 13 14 15
  a1 b1 c1 d1 e1 14 15 16 17
  a1 b1 c1 d1 e1 13 14 15 16
  a2 b1 c1 d1 e1 112 113 114 115
  a2 b1 c1 d1 e1 114 115 116 117
  a2 b1 c1 d1 e1 113 114 115 116

Output should be

  a1 b1 c1 d1 e1 13 14 15 16
  a2 b1 c1 d1 e1 113 114 115 116

Don_Cragun · January 26, 2015, 7:18pm

ritakadm:

sorry about not providing with a sample input, I`m using cygwin. The data range is 0 to 100,000 should be outputted upto 2 decimal places.
Data is 83000 lines, not very big.

Yes, the code calculates the average correctly for only the 7th column, although I should
populate arr as arr[$1" "$2" "$3" "$4" "$5] to get all the variables delimited.

Sample input has 4 data columns, I have many in the original data starting col7 until NF ($22).
  a1 b1 c1 d1 e1 12 13 14 15
  a1 b1 c1 d1 e1 14 15 16 17
  a1 b1 c1 d1 e1 13 14 15 16
  a2 b1 c1 d1 e1 112 113 114 115
  a2 b1 c1 d1 e1 114 115 116 117
  a2 b1 c1 d1 e1 113 114 115 116
Output should be
  a1 b1 c1 d1 e1 13 14 15 16
  a2 b1 c1 d1 e1 113 114 115 116

The code you showed us in your 1st post in this thread skips data in the 1st line of your file (which I assumed was intended to skip over a header line). But, I don't see any headers in this sample. Is there a header, or not? If there s a header, should it be copied to the output?

Is the number of fields constant in an input file, or can it vary from line to line?

It looks like there is a leading space in your sample input and output. Is a leading space required in your output?

Do you want 2 decimal places in all computed output fields, or do you want values to be printed without decimal places (as in your sample output) in cases where the computed result is an integral value?

You say you want to calculate averages for fields 7 through NF, but your sample data also calculates the average for field 6? Is field 6 supposed to be ignored in calculations and removed from the output, or is field 6 to be averaged as well as fields 7 through NF?

ritakadm · January 26, 2015, 7:29pm

don cragun:

The code you showed us in your 1st post in this thread skips data in the 1st line of your file (which I assumed was intended to skip over a header line). But, I don't see any headers in this sample. Is there a header, or not? If there s a header, should it be copied to the output?

Yes, there is header to be carried over.

Is the number of fields constant in an input file, or can it vary from line to line?

It is constant at 22. The missing values are designated as NA

It looks like there is a leading space in your sample input and output. Is a leading space required in your output?

No leading spaces and none required.

Do you want 2 decimal places in all computed output fields, or do you want values to be printed without decimal places (as in your sample output) in cases where the computed result is an integral value?

Integer values can also be outputted with 2 decimal places, it doesnt matter to add .00 or not.

You say you want to calculate averages for fields 7 through NF, but your sample data also calculates the average for field 6? Is field 6 supposed to be ignored in calculations and removed from the output, or is field 6 to be averaged as well as fields 7 through NF?

Yes, $6 is to be ignored, and doesnt appear in the output.

senhia83 · January 26, 2015, 7:52pm

I want to edit the sample input slightly to accommodate NA missing values, and remove the leading spaces. For a finite number of columns you can of course use arr1, arr2, arr3 etc like you have used arr

 
a1 b1 c1 d1 e1 12 13 14 15
a1 b1 c1 d1 e1 14 15 16 17
a1 b1 c1 d1 e1 NA 14 15 16
a2 b1 c1 d1 e1 112 113 114 115
a2 b1 c1 d1 e1 114 115 116 117
a2 b1 c1 d1 e1 113 NA 115 116

Output should be

Code:

a1 b1 c1 d1 e1 13 14 15 16
a2 b1 c1 d1 e1 113 114 115 116

migurus · January 26, 2015, 8:07pm

I adjusted your original code to support N counts/accumulators (a pair per each column). I use isnum function from Wikipedia

 
 awk '
function isnum(x){
        return(x==x+0);
}
{
    if(NR>1) {
        a = $1 $2 $3 $4 $5;
        keys[a] = 1;
        for(I = 7; I <= NF; I++) {
                b = I a;
                arr   += $I;
                if(isnum($I)) {
                        count += 1;
                }
        }
    }
}
END{
        for (key in keys) {
                printf "%-16s ", key;
                for(I = 7; I <= NF; I++) {
                        b = I key;
                        printf "%8.2f ", arr / count;
                }
                printf "\n";
        }
}
'

The input file:

 
 HEADER
a1 b1 c1 d1 e1 f1  1  2  4  5
a1 b1 c1 d1 e1 f1 NA  2  6  7
a1 b1 c1 d1 e1 f1  1  2  5  6
a2 b1 c1 d1 e1 f1 12 13 14 15
a2 b1 c1 d1 e1 f1 14 15 16 17
a2 b1 c1 d1 e1 f1 13 14 15 16

Results

 
 a2b1c1d1e1          13.00    14.00    15.00    16.00 
a1b1c1d1e1           1.00     2.00     5.00     6.00

Don_Cragun · January 26, 2015, 8:37pm

You could also try something like this:

awk '
NR == 1 {
	# Note that this copies in the input header to the output, but the
	# output will NOT include field 6 from the input.
	print
	next
}
{	# Gather keys:
	key[k = $1 FS $2 FS $3 FS $4 FS $5]
	# Accumulate counts and data from non-"NA" data fields.
	for(i = 7; i <= NF; i++) {
		if($i != "NA") {
			data[k, i] += $i
			cnt[k, i]++
		}
	}
}
END {	for(k in key) {
		printf("%s ", k)
		for(i = 7; i <= NF; i++)
			if(cnt[k, i])
				printf("%.2f%s", data[k, i] / cnt[k, i],
					(i == NF) ? "\n" : " ")
			else	printf("NA%s", (i == NF) ? "\n" : " ")
	}
}' file

With the following sample input:

This is a header line
a1 b1 c1 d1 e1 12 13 14 15
a1 b1 c1 d1 e1 14 15 16 17
a1 b1 c1 d1 e1 13 NA 15 16
a2 b1 c1 d1 e1 112 113 114 115
a2 b1 c1 d1 e1 114 115 116 117
a2 b1 c1 d1 e1 113 114 115 NA
a3 b2 c1 d2 e3 110 111 112 NA
a3 b2 c1 d2 e3 110 111 113 NA
a3 b2 c1 d2 e3 110 112 113 NA

the above code produces the output:

This is a header line
a2 b1 c1 d1 e1 114.00 115.00 116.00
a1 b1 c1 d1 e1 14.00 15.00 16.00
a3 b2 c1 d2 e3 111.33 112.67 NA

Note that with this input, the code migurus suggested will produce output similar to the following:

a2b1c1d1e1         114.00   115.00   116.00 
a1b1c1d1e1          14.00    15.00    16.00 
a3b2c1d2e3         111.33   112.67 awk: division by zero
 input record number 10, file file
 source line number 23

If someone wants to try the above awk script on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk .