Awk- Pivot Table Averages

Geneanalyst · January 21, 2018, 1:56pm

Hi everyone,

Has anyone figured out yet how to do pivot table averages using AWK. I didn't see anything with regards to doing averages.

For example, suppose you have the following table with various individuals and their scores in round1 and round2:

SAMPLE    SCORE1    SCORE2
British    15    19.5
British    7    9.1
British    8    10.4
British    11    14.3
German    6    7.8
German    7    9.1
Italian    10    13
Italian    3    3.9
Italian    19    24.7
Italian    9    11.7
Italian    6    7.8

The objective is to do a pivot table using Awk to calculate the average scores by country such as shown in the output below.

POPULATION    AVG SCORE1    AVG SCORE2
British    10.25    13.33
German    6.50    8.45
Italian    9.40    12.22

Any ideas how to do this using Awk?

rdrtx1 · January 21, 2018, 2:12pm

awk '
NR > 1 {aa[$1]; a[$1]++; b[$1,1]+=$2; b[$1,2]+=$3;}
END {
   printf("%-10s\t%-10s\t%-10s\n", "POPULATION", "AVG SCORE1", "AVG SCORE2");
   for (i in aa) printf("%-10s\t%10.2f\t%10.2f\n", i, b[i,1]/a, b[i,2]/a);
}' datafile

MadeInGermany · January 21, 2018, 5:18pm

When looking at the aa[$1]; a[$1]++; ,
the first array is not needed because its information (the index) is also present in the second array.
So can be simply a[$1]++; , and the reference changes to for (i in a)

Geneanalyst · January 21, 2018, 6:57pm

rdrtx1:

awk '
NR > 1 {aa[$1]; a[$1]++; b[$1,1]+=$2; b[$1,2]+=$3;}
END {
   printf("%-10s\t%-10s\t%-10s\n", "POPULATION", "AVG SCORE1", "AVG SCORE2");
   for (i in aa) printf("%-10s\t%10.2f\t%10.2f\n", i, b[i,1]/a, b[i,2]/a);
}' datafile

Awesome rdrtx!! good work. Any idea why there is an extra line inserted in the output with 0 results. Here is the output based on your script:

POPULATION    AVG SCORE1    AVG SCORE2
                    0.00          0.00
Italian             9.40         12.22
German              6.50          8.45
British            10.25         13.32

Also, MadeInGermany your are spot on.

rdrtx1 · January 21, 2018, 7:16pm

maybe blank lines in input file. try updating NR line to:

NR > 1 && NF

Geneanalyst · January 21, 2018, 7:36pm

That did the trick

MadeInGermany · January 22, 2018, 4:47am

The following variant adapts to the number of columns, using a custom field width %*s (denotes an additional argument for the field width)

awk '
(NR>1 && NF>0) {
  if (NF>nf) nf=NF
  CNT[$1]++
  for (i=2; i<=NF; i++)
    SUM[$1,i]+=$i
}
END {
  colh="  AVG SCORE"
  out="POPULATION"
  lcolh=length(colh)
  lout=length(out)
  for (i=2; i<=nf; i++)
    out=(out colh i)
  print out
  for (c in CNT) {
    out=sprintf("%-*s", lout, c)
    for (i=2; i<=nf; i++)
      out=sprintf("%s %*.2f", out, lcolh, SUM[c,i]/CNT[c])
    print out
  }
}
' datafile