If we look very closely at the desired output shown in post #1 in this thread and do a LOT of reading between the lines, the awk
program with the printf
piped through ex
helper script shown in post #3 is even further off the mark than RudiC noticed.
Like RudiC's code, the code shown in post #3 sums and prints every input field in the output. But the sample input contains 23 fields and the sample (desired) output only contains 18 fields??? So reading between the lines and assuming that the header for the desired 1st output field was supposed to be AVG_POP1
instead of VG_POP1
, we might guess that what is really wanted is to only calculate averages of the input fields that have headers that start with POP
. If we make that assumption and also assume that the heading for the output field should have the input field header with the string AVG_
prepended to it AND assume that the average should only count the data lines being averaged (as was done in RudiC's code but was not done in the code in post #3), we could come closer to getting what seems to be the desired output.
But, then we also note that the code in post #3 uses a <tab> character as the input field separator and there are no <tab>s in the sample input provided (so the code in post #3 only produces one output field).
So, if we modify the sample input data to be <tab> separated and assume that the desired output should also be <tab> separated instead of separating output fields with a single <space> character and not even including a <newline> terminator to the single partial line of output produced by the script in post #3 AND not just as sequences of <space>s as shown in the given in the desired output shown in post #1, we might try something more like:
awk '
BEGIN { #Set input and output field separators.
FS = OFS = "\t"
}
NR == 1 {
# Process input header line...
# Determine which input fields are to become output fields.
for(i = 1; i <= NF; i++)
if($i ~ /^POP/) {
# Add an entry to POP[] for each field number to be
# processed, and set "last" to the field number of the
# last field number to be processed.
POP[last = i]
}
# Print output file header.
for(i = 1; i <= NF; i++)
if(i in POP)
printf("AVG_%s%s", $i, (i == last) ? ORS : OFS)
}
NR > 1 {# Loop through each input field...
for(i = 1; i <= NF; i++)
# Is this an input field to be processed...
if(i in POP)
# Yes. Keep a running total of values in this column.
total += $i
}
END { # We have hit the EOF on the input...
# Calculate and print the results for the selected fields...
for(i = 1; i <= NF; i++)
if(i in POP)
printf("%9.2f%s", total / (NR - 1),
(i == last) ? ORS : OFS)
}' file
Using %9.2f
as the output format for the calculated values makes the field values line up with the output field headers. It isn't obvious whether or not this was required by the sample output provided in post #1. The above code produces the output:
Output produced by above code:
AVG_POP1 AVG_POP2 AVG_POP3 AVG_POP4 AVG_POP5 AVG_POP6 AVG_POP7 AVG_POP8 AVG_POP9 AVG_POP10 AVG_POP11 AVG_POP12 AVG_POP13 AVG_POP14 AVG_POP15 AVG_POP16 AVG_POP17 AVG_POP18
0.17 0.16 0.09 0.61 0.17 0.17 0.19 0.02 0.02 0.10 0.10 0.06 0.30 0.10 0.11 0.11 0.02 0.01
Output requested in post #1 (with columns aligned):
VG_POP1 AVG_POP2 AVG_POP3 AVG_POP4 AVG_POP5 AVG_POP6 AVG_POP7 AVG_POP8 AVG_POP9 AVG_POP10 AVG_POP11 AVG_POP12 AVG_POP13 AVG_POP14 AVG_POP15 AVG_POP16 AVG_POP17 AVG_POP18
0.17 0.16 0.09 0.61 0.17 0.16 0.19 0.02 0.02 0.10 0.10 0.06 0.30 0.11 0.11 0.11 0.02 0.01
Output produced by code in post #3:
7.600000 0.000000 0.000000 0.132000 0.126000 0.072000 0.492000 0.138000 0.132000 0.154000 0.018000 0.012000 0.000000 0.000000 0.082000 0.082000 0.046000 0.244000 0.084000 0.086000 0.088000 0.018000 0.008000
The values shown in red differ from the desired sample output (with fields aligned for comparison purposes) in post #1 by ±.01
for the AVG_POP6
and AVG_POP14
fields shown in red above; all of the other values exactly match the desired output shown on the last line of the above comparison.
If you strip out the values shown in orange in the last line of the output (corresponding to fields that have been deleted from the output in the other two sets of output shown), the values given by the code in post #3 are about 80% of the desired values (as I would expect since that code is calculating the average by dividing the sum of four numeric fields by five instead of dividing the sum of four numeric fields by four).
I hope this helps. It would certainly be a lot easier to come up with code like the above if the description of the problem matched the desired output a lot closer.