How to print median values of matrix -awk?

quincyjones · June 22, 2017, 10:51am

I use the following script to print the sum and how could I extend this to print medians instead? thanks

name	s1	s2	s3	s4
g1	2	8	6	5
g1	5	7	9	9
g1	6	7	8	9
g2	8	8	8	8
g2	7	7	7	7
g2	10	10	10	10
g3	3	12	1	24
g3	5	5	24	48
g3	12	3	12	12
g3	2	3	3	3

output

name	s1	s2	s3	s4
g1	5	7	8	9
g2	7	7	7	7
g3	4	4	7.5	18

scripts - mean

NR==1 {
    print
    next
}
    # print average of each column per year
    #  then, reset columns sums and number of lines
function print_sum() {
    printf prev
    # needs GNU awk, for length of array
    for (i=2; i < length(sum) + 2; i++) {
            printf FS sum/nlines
            sum = 0
    }
    printf ORS
    nlines = 0
}
    # print average when $1 changes, but not the first time
    # also, on end of script
NR>2 && prev!=$1 { print_sum() }
END              { print_sum() }
    # for every line with the same $1, sum column values, increment number of lines
{
    prev=$1;
    nlines++
    for (i=2; i <= NF; i++) {
            sum+=$i
    }
}
}

drl · June 22, 2017, 11:32am

Hi.

Utility datamash makes the median and other statistical calculations fairly easy. Aside from the scaffolding code, the operative line is datamash :

#!/usr/bin/env bash

# @(#) s1       Demonstrate statistical calculations, median, datamash.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
em() { pe "$*" >&2 ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C datamash

FILE=${1-data1}
E=expected-output.txt

pl " Input data file $FILE:"
cat $FILE

pl " Expected output:"
cat $E

pl " Results (adjusted for visual with code align):"
datamash -H -g1 median 2 median 3 median 4 median 5 < $FILE |
align | 
tee f1

pl " Verify results if possible:"
C=$HOME/bin/pass-fail
[ -f $C ] && $C || ( pe; pe " Results cannot be verified." ) >&2

pl " Some details for datamash:"
dixf datamash

exit $?

producing:

$ ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution        : Debian 8.7 (jessie) 
bash GNU bash 4.3.30
datamash (GNU datamash) 1.0.6

-----
 Input data file data1:
name    s1      s2      s3      s4
g1      2       8       6       5
g1      5       7       9       9
g1      6       7       8       9
g2      8       8       8       8
g2      7       7       7       7
g2      10      10      10      10
g3      3       12      1       24
g3      5       5       24      48
g3      12      3       12      12
g3      2       3       3       3

-----
 Expected output:
name    s1      s2      s3      s4
g1      5       7       8       9
g2      7       7       7       7
g3      4       4       7.5     18

-----
 Results:
GroupBy(name)   median(s1)      median(s2)      median(s3)      median(s4)
g1              5               7               8               9
g2              8               8               8               8
g3              4               4               7.5             18

-----
 Verify results if possible:

-----
 Comparison of 4 created lines with 4 lines of desired results:
f1 expected-output.txt differ: char 1, line 1
 Failed -- files f1 and expected-output.txt not identical -- detailed comparison follows.
1c1
< name  s1      s2      s3      s4
---
> GroupBy(name) median(s1)      median(s2)      median(s3)      median(s4)
3c3
< g2    7       7       7       7
---
> g2            8               8               8               8

 Results cannot be verified.

-----
 Some details for datamash:
datamash        command-line calculations (man)
Path    : /usr/bin/datamash
Version : 1.0.6
Type    : ELF 64-bit LSB executable, x86-64, version 1 (SYSV ...)
Help    : probably available with -h,--help
Repo    : Debian 8.7 (jessie) 
Home    : https://savannah.gnu.org/projects/datamash/ (pm)

There is a disagreement about the headers and group g2. I would tend to trust datamash , but you can do the calculations again to verify your answer. I tried it again sorting the file, as well as interchanging lines for g2 and got the same result.

Best wishes ... cheers, drl

RudiC · June 23, 2017, 4:54am

Try this for medians:

awk -F"\t" '
NR == 1
NR > 1          {for (i=2; i<=NF; i++) print $1, i, $i | "sort -k1,2 -k3bn > TMP"
                }

function PRMED()        {printf TFS "%s", MEDIAN
                         TFS = OFS
                         PRV2 = $2
                         CNT = 0
                        }

END             {while (1 == getline < "TMP")   {if ($1 != PRV1)        {PRMED()
                                                                         printf TRS "%s", $1
                                                                         TRS = ORS
                                                                         PRV1 = $1
                                                                        }
                                                 if ($2 != PRV2)        {PRMED()
                                                                        }
                                                 M[++CNT] = $3
                                                 CH       = int (CNT / 2)
                                                 MEDIAN   = CNT%2?M[CH+1]:(M[CH]+M[CH+1])/2
                                                }
                }
END             {PRMED()
                 printf ORS
                }
' OFS="\t" file
name 	s1	s2	s3	s4
g1	5	7	8	9
g2	8	8	8	8
g3	4	4	7.5	18

quincyjones · June 23, 2017, 6:43am

I am afraid, it seems there is a small bug some where. Sometimes, I get different outputs from the same input. Some times just the header.

RudiC · June 23, 2017, 9:02am

I'm afraid, I can't help without sample data leading to errors. Different outputs from identical input is highly improbable, btw ...

quincyjones · June 23, 2017, 10:10am

No worries but thank you for the help. I figure out this in R in more easy way.

library(dplyr)
a<-read.table("input", head=T)
b<- a %>%
  group_by(name) %>%
  summarise_each(funs(median(., na.rm=TRUE)))
write.table(b, file="output", sep="\t")