Extract means in bash

Thomthom · August 2, 2022, 8:25pm

Hello,

I would like to extract the average of each group (column 2),
then display the name of the fruit (column 3) that is closest to the group means.
I tried to simplify the problem.

group.txt

#Group Value Fruit
  1   8   Orange
  1   6.5   Banana
  1   6.2   Apple 
  1   12   Apricot
  1   7   Blackberry 
 
  2   4   Apple
  2   6   Banana
  2   6   Apricot
  2   3   Blackberry

(8 + 6 + 6 + 12 +7) / 5 = 7.94
The fruit closest to the mean is Orange for group 1.
(4 + 6 + 6 + 3) / 4 = 4.75
The fruit closest to the mean is Apple for group 2.

I'm scripting a bit in bash but now I don't know where to start. Does anyone have an idea? ' -_-

vgersh99 · August 2, 2022, 8:51pm

welcome to the community, @Thomthom !
Could you start by mentioning your OS, pls.
Also, if you have gawk installed, pls mention its version: gawk --version.

I'd start with gawk as it has the associative array capability builtin and also has the basic math functions that you might need.

Start with calculating avg for each group.

Take the following as a starting point for calculating avg for each group for your sample file:
awk -f thom.awk myInputFile where thom.awk is:

FNR > 1 && NF {
   groupSum[$1]+=$2
   groupCnt[$1]++
}
END {
   for( i in groupSum)
       printf("%s [%.2f]\n", i,groupSum[i]/groupCnt[i])
}

yields:

1 [7.94]
2 [4.75]

Hopefully this is NOT a homework...

Thomthom · August 2, 2022, 9:02pm

Thank you very much for you answer !

gawk, awk, sed, grep, bash it's ok

But do you know to write : The fruit closest to the mean is Orange for group 1.

My gawk version : GNU Awk 5.0.1, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.2.0)

vgersh99 · August 2, 2022, 9:04pm

I do, but I'll let you to experiment...

Thomthom · August 2, 2022, 9:07pm

I think that each value should be compared to the average and the smallest difference taken, but the formatting is very complicated for me

vgersh99 · August 2, 2022, 9:16pm

That's right. You'll need to add another array (possibly) indexed by a group and a fruit with the value of "Value" for each cell. And then find "the closest" to the avg.
You can substract "value" from the "avg" and find the "absolute" (as it can be negative) closest.
Look at the sample starting point code I've provided and try to enhance it based on the above algorithm.
Let us know how it goes and where/if you get stuck.

Once again: Is this a homework?
No further assistance will be provided unless the above is clarified.

munkeHoller · August 2, 2022, 9:16pm

what if there's multiple matches or no matches ?

Thomthom · August 2, 2022, 9:37pm

No, it's not a homework, I work in the pharmaceutical industry, I use simple numbers, but basically they are 3 decimales.

For munkeHoller, normally this should not happen, otherwise take the first one in alphabetical order.

vgersh99 · August 2, 2022, 10:18pm

Awesome.
Give it a whirl and we'll try to help you if/when you get stuck!

Thomthom · August 2, 2022, 10:21pm

" You'll need to add another array (possibly) indexed by a group and a fruit with the value of "Value" for each cell. And then find "the closest" to the avg. "

I can't materialize at all

DiscourseDuck · August 2, 2022, 11:43pm

Homework / school work.