NR==1 {
print
next
}
# print average of each column per year
# then, reset columns sums and number of lines
function print_sum() {
printf prev
# needs GNU awk, for length of array
for (i=2; i < length(sum) + 2; i++) {
printf FS sum/nlines
sum = 0
}
printf ORS
nlines = 0
}
# print average when $1 changes, but not the first time
# also, on end of script
NR>2 && prev!=$1 { print_sum() }
END { print_sum() }
# for every line with the same $1, sum column values, increment number of lines
{
prev=$1;
nlines++
for (i=2; i <= NF; i++) {
sum+=$i
}
}
}
Utility datamash makes the median and other statistical calculations fairly easy. Aside from the scaffolding code, the operative line is datamash :
#!/usr/bin/env bash
# @(#) s1 Demonstrate statistical calculations, median, datamash.
# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
em() { pe "$*" >&2 ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C datamash
FILE=${1-data1}
E=expected-output.txt
pl " Input data file $FILE:"
cat $FILE
pl " Expected output:"
cat $E
pl " Results (adjusted for visual with code align):"
datamash -H -g1 median 2 median 3 median 4 median 5 < $FILE |
align |
tee f1
pl " Verify results if possible:"
C=$HOME/bin/pass-fail
[ -f $C ] && $C || ( pe; pe " Results cannot be verified." ) >&2
pl " Some details for datamash:"
dixf datamash
exit $?
producing:
$ ./s1
Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution : Debian 8.7 (jessie)
bash GNU bash 4.3.30
datamash (GNU datamash) 1.0.6
-----
Input data file data1:
name s1 s2 s3 s4
g1 2 8 6 5
g1 5 7 9 9
g1 6 7 8 9
g2 8 8 8 8
g2 7 7 7 7
g2 10 10 10 10
g3 3 12 1 24
g3 5 5 24 48
g3 12 3 12 12
g3 2 3 3 3
-----
Expected output:
name s1 s2 s3 s4
g1 5 7 8 9
g2 7 7 7 7
g3 4 4 7.5 18
-----
Results:
GroupBy(name) median(s1) median(s2) median(s3) median(s4)
g1 5 7 8 9
g2 8 8 8 8
g3 4 4 7.5 18
-----
Verify results if possible:
-----
Comparison of 4 created lines with 4 lines of desired results:
f1 expected-output.txt differ: char 1, line 1
Failed -- files f1 and expected-output.txt not identical -- detailed comparison follows.
1c1
< name s1 s2 s3 s4
---
> GroupBy(name) median(s1) median(s2) median(s3) median(s4)
3c3
< g2 7 7 7 7
---
> g2 8 8 8 8
Results cannot be verified.
-----
Some details for datamash:
datamash command-line calculations (man)
Path : /usr/bin/datamash
Version : 1.0.6
Type : ELF 64-bit LSB executable, x86-64, version 1 (SYSV ...)
Help : probably available with -h,--help
Repo : Debian 8.7 (jessie)
Home : https://savannah.gnu.org/projects/datamash/ (pm)
There is a disagreement about the headers and group g2. I would tend to trust datamash , but you can do the calculations again to verify your answer. I tried it again sorting the file, as well as interchanging lines for g2 and got the same result.