Compute average based on field values

Im looking for a way to average the values in field 14 (when field 2 is equal to 2016) and fields 3 and 4 (when field 2 is equal to 2017).

Any help is appreciated.

001001  2016    33.22   38.19   48.07   51.75   59.77   67.68   70.86   72.21   66.92   53.67   42.31   40.15
001001  2017    42.04   44.02   47.01   54.81   57.97   67.21   70.88   70.11   63.86   55.80   42.96   37.83

(40.15 + 42.04 + 44.02) / 3

Expected output

001001  42.07

After six and a half years as a member and with more than 130 posts, some of which on average handling / calculation, one would presume you have at least some idea of an approach. So - any attempts from your side?

2 Likes

Using the example that RudiC linked to - I know how to get the average of fields 3 and 4, but Im not sure how to include field 14 in the calculation. The below script is not clean but appears to work.

 awk 'NR == FNR {if ($2==2017) {SUM[$1] += ($3 + $4)/2; CNT[$1]++}; next} {print $0, SUM[$1]/CNT[$1]}' OFS="\t" OFMT="%.2f" file file
001001  2016    33.22   38.19   48.07   51.75   59.77   67.68   70.86   72.21   66.92   53.67   42.31   40.150          43.03
001001  2017    42.04   44.02   47.01   54.81   57.97   67.21   70.88   70.11   63.86   55.80   42.96   37.830          43.03

Does it? For your 2017 line, yes, but not for your specified task for the entire data file. To implement that, you need to tell us,

  • are the records sorted in ascending / descending order?
  • are there more / other lines, on top of 2016 or 2017 lines, like 2015 or 2018? Should those be ignored?

Try

awk '$2==2016 {SUM[$1]+=$14; CNT[$1]++} $2==2017 {SUM[$1]+=$3+$4; CNT[$1]+=2} END {for (s in SUM) print s, SUM/CNT}' file

If there are no other data / years to be found, this can be simplified. The order of data is not conserved.

EDIT: IF there's 2016 & 2017 only, try

awk '
        {SUM[$1] += ($2==2016 ? $14 : $3+$4)
         CNT[$1] += $2-2015
        }

END     {for (s in SUM) print s, SUM/CNT
        }

' file
001001 42.07

The records are sorted by field 2 then by field 1.

001001  2015    32.97   30.74   48.52   56.88   60.67   69.22   72.41   69.48   64.44   54.18   49.64   47.21
001001  2016    33.22   38.19   48.07   51.75   59.77   67.68   70.86   72.21   66.92   53.67   42.31   40.15
001001  2017    42.04   44.02   47.01   54.81   57.97   67.21   70.88   70.11   63.86   55.80   42.96   37.83
001001  2018    30.78   -99.99  -99.99  -99.99  -99.99  -99.99  -99.99  -99.99  -99.99  -99.99  -99.99  -99.99
001002  1900    39.87   34.00   49.69   56.97   63.68   71.04   73.20   73.63   71.76   54.52   47.71   40.05
001002  1901    40.68   43.45   49.51   60.91   69.08   71.44   73.69   73.94   67.93   57.72   52.77   41.38
001002  1902    38.23   47.23   58.32   56.21   60.51   72.48   73.54   72.19   67.95   60.04   49.51   43.66

Your script

awk '$2==2016 {SUM[$1]+=$14; CNT[$1]++} $2==2017 {SUM[$1]+=$3+$4; CNT[$1]+=2} END {for (s in SUM) print s, SUM/CNT}'

works for a 3 value computation ((2016 field 14 + 2017 field 3 + field 4)/3)

Output

001001 42.07

but once you add more values, the calculation is not correct.

For instance

awk '$2>=2015 {SUM[$1]+=$14; CNT[$1]++} $2<=2018 {SUM[$1]+=$3+$4; CNT[$1]+=2} END {for (s in SUM) print s, SUM/CNT}'

Output

001001 35.2234

I could not replicate the end result in excel. Perhaps its the ordering/sorting? There are additional lines on top of 2015 so that could be an issue. They should not be ignored.

They are not. It's field 1, then field 2.

How can that be "not correct"? You didn't specify what to do for field 2 values other than 2016 and 2017, so "ignore" was assumed. With your NEW sample data, the proposal given yields

001001 42.07
  • exactly what was requested.

With your modified code, several fields will be counted more than once, falsifying the average.

You are correct. The data is sorted by field 1 than field 2.

Can the code be modified to work across multiple fields for instance for values between 2015 and 2018? Much like the script you helped me with here. Seems like you would just divide by the number of fields examined.

Certainly. I F you define HOW values need to be treated.

For simplicity sake I originally wanted to average the output for 2016 and 2017. So (40.15 + 42.04 + 44.02)/3
I thought I would be able to simply expand it out to cover additional lines. There are hundreds prior to 2018. Say for instance for 2015 through 2018 the output would be something like:

001001  2015
001001  2016  39.54
001001  2017  42.07
001001  2018  37.87

Note that there would not be an average for 2015 since we are starting with the field 14 and ending with fields 3 and 4 of the following line.

Input

001001  2015    32.97   30.74   48.52   56.88   60.67   69.22   72.41   69.48   64.44   54.18   49.64   47.21
001001  2016    33.22   38.19   48.07   51.75   59.77   67.68   70.86   72.21   66.92   53.67   42.31   40.15
001001  2017    42.04   44.02   47.01   54.81   57.97   67.21   70.88   70.11   63.86   55.80   42.96   37.83
001001  2018    30.78   45.00  -99.99  -99.99  -99.99  -99.99  -99.99  -99.99  -99.99  -99.99  -99.99  -99.99

How about describing the problem correctly and entirely from the beginning? Could have saved you and me quite some time.
For your new specification try

awk 'NR > 1 {print $2, (LV + $3 + $4) / 3} {LV = $14}' file
2016 39.54
2017 42.07
2018 37.87

Thanks. This makes sense aside from the "NR > 1" part. Does that make it start 1 line above?