Averaging Data From Multiple Columns, Using Header if Possible

mikey11415 · January 4, 2011, 10:07am

Hi,

I have a file with multiple tab delimited columns and I would like to have the average of each column:

Iteration	Tree No	Lh	HMean	
1000	1	-78.834717	-78.834717	
1100	1	-77.991031	-78.624046	
1200	1	-79.416055	-78.761861	
1300	1	-79.280494	-78.968099	
1400	1	-82.846275	-80.808696	
1500	1	-81.167562	-81.306918	
1600	1	-81.217691	-81.290207	
1700	1	-80.269179	-81.241736

I have been trimming off the header row (using emacs and a cut, please let me know if there is a one liner for that!!!) and then just manually using awk to get the average of each column individually, e.g.:

awk '{sum+=$3} END { print "Average = ",sum/NR}' FILENAME

BUT...I wonder if there is a way to
a) get the average of each of the columns in one go and
b) use the headers to create a report that was something like:

Iteration	Tree No	Lh	HMean	
AVGC1	AVGC2	AVGC3	AVGC4

or in a single column format is fine, too:

Iteration
AVGC1
Tree No
AVGC2
...etc

In my final file, I will have many many more columns, and the number of columns will be different in the different files. Also, there will be no missing data cells. The file will be complete. Thanks for any suggestions!!!

All the best,

Mikey

m.d.ludwig · January 4, 2011, 10:23am

Is the "tree no" field significant? Assuming it is not:

awk '
   BEGIN { FS = "\t"; }
   NR == 1 { print; next; }
  { for (i = 1; i <= NF; i++) { sum += $i; } }
  END {
    printf "\t";
    for (i = 1; i <= NF; i++) { printf "\t%.4g", sum / (NR - 1); }
    print "";
  }
' inputfile...

bartus11 · January 4, 2011, 10:27am

Try:

awk 'NR==1;NR>1{for (i=1;i<=NF;i++){a+=$i}}END{for (i=1;i<=NF;i++){printf a/(NR-1)" "};printf "\n"}' file

mikey11415 · January 4, 2011, 11:06am

Hi, Thanks!
Well, the first two columns, Iteration and Tree No are actually both not significant to me. But, I figured that whatever code would be easiest if it could be applied to all columns. Thanks for the code, but on the little fake data file I posted above, the output is

No
-80.1279
Lh
-79.9795

That does not take the average from the first two columns (which is fine), but it seems to take the 3rd and 4th word from the headers, not the third and fourth tab delim column, which would be Lh and HMean. Any suggestions for fine tuning?

Mikey

---------- Post updated at 10:40 AM ---------- Previous update was at 10:36 AM ----------

Oh, that is amazing. I only need one sort of modification, if it is possible: can the numbers be tab delimited in the output? I suppose that I could split the report into two files, replace the spaces in the average line with tabs and then cat. But is there a more elegant way?

and THANKS!

---------- Post updated at 11:06 AM ---------- Previous update was at 10:40 AM ----------

OMG
I did it myself and added the TABS!
I am a self taught unix genius!
But seriously, THANK you for your help!

Here is my addition to your fab code:

awk 'NR==1;NR>1{for (i=1;i<=NF;i++){a+=$i}}END{for (i=1;i<=NF;i++){printf a/(NR-1)"\t"};printf "\n"}'

Chubler_XL · January 4, 2011, 5:07pm

You can probably work it out yourself (now your a self taught unix genius, lol), but below in red are are the changes required to skip the first two columns:

awk 'NR==1;NR>1{for (i=3;i<=NF;i++){a+=$i}}END{ printf "\t\t"; for (i=3;i<=NF;i++){printf a/(NR-1)"\t"};printf "\n"}'

[/quote]