Awk- Pivot Table Averages

Hi everyone,

Has anyone figured out yet how to do pivot table averages using AWK. I didn't see anything with regards to doing averages.

For example, suppose you have the following table with various individuals and their scores in round1 and round2:

SAMPLE    SCORE1    SCORE2
British    15    19.5
British    7    9.1
British    8    10.4
British    11    14.3
German    6    7.8
German    7    9.1
Italian    10    13
Italian    3    3.9
Italian    19    24.7
Italian    9    11.7
Italian    6    7.8

The objective is to do a pivot table using Awk to calculate the average scores by country such as shown in the output below.

POPULATION    AVG SCORE1    AVG SCORE2
British    10.25    13.33
German    6.50    8.45
Italian    9.40    12.22

Any ideas how to do this using Awk?

awk '
NR > 1 {aa[$1]; a[$1]++; b[$1,1]+=$2; b[$1,2]+=$3;}
END {
   printf("%-10s\t%-10s\t%-10s\n", "POPULATION", "AVG SCORE1", "AVG SCORE2");
   for (i in aa) printf("%-10s\t%10.2f\t%10.2f\n", i, b[i,1]/a, b[i,2]/a);
}' datafile
1 Like

When looking at the aa[$1]; a[$1]++; ,
the first array is not needed because its information (the index) is also present in the second array.
So can be simply a[$1]++; , and the reference changes to for (i in a)

1 Like

Awesome rdrtx!! good work. Any idea why there is an extra line inserted in the output with 0 results. Here is the output based on your script:

POPULATION    AVG SCORE1    AVG SCORE2
                    0.00          0.00
Italian             9.40         12.22
German              6.50          8.45
British            10.25         13.32

Also, MadeInGermany your are spot on.

maybe blank lines in input file. try updating NR line to:

NR > 1 && NF
1 Like

That did the trick :slight_smile:

The following variant adapts to the number of columns, using a custom field width %*s (denotes an additional argument for the field width)

awk '
(NR>1 && NF>0) {
  if (NF>nf) nf=NF
  CNT[$1]++
  for (i=2; i<=NF; i++)
    SUM[$1,i]+=$i
}
END {
  colh="  AVG SCORE"
  out="POPULATION"
  lcolh=length(colh)
  lout=length(out)
  for (i=2; i<=nf; i++)
    out=(out colh i)
  print out
  for (c in CNT) {
    out=sprintf("%-*s", lout, c)
    for (i=2; i<=nf; i++)
      out=sprintf("%s %*.2f", out, lcolh, SUM[c,i]/CNT[c])
    print out
  }
}
' datafile
1 Like