Hello. Im just starting to learn awk so hang in there with me...I have a large text file formatted as such everything is in a single column
ID001
value 1
value 2
value....n
ID002
value 1
value 2
value... n
I want to be able to calculate the average for values for each ID from the whole column (in total there are 25000 IDs) and the n= anywhere from 100 to 2000
here is the beginning of the file i am attempting to calculate averages from. This is just the 1st couple of entries...in total there are 25,000 BCxxxxxx entries.
you have floating point numbers - need to accommodate a regex for that:
awk '
# for the very FIRST line (NR==1) in a file, assign the entire record ($0) to a variable "id".
# then proceed to the next input line ("next")
NR==1{id=$0; next}
# if a line starts (^) with "BC" and is followed by one or more (+) numbers "[0-9]"
# output the value of a variable "id", followed by:
# if "n" is non-zero, divide '"s" by "n"
# if "n" is 0, output string "NA"
# "s=n=0" - assign "0" to "s" and "n"
# assign a current record/line ($0) to variable "id"
/^BC[0-9]+$/{print id, (n) ? s/n : "NA"; s=n=0; id=$0}
# if a line starts with one or more (+) numbers ([0-9]) optionally followed
# by zero or more (*) numbers ([0-9]) or a dot (.)...
# calculate a sum (s) by adding the current record value ($0) to a running sum (s): s+=$0
# increment the running counter for records associated with a current "id": n++
/^[0-9]+[.0-9]*$/{s+=$0; n++}
# at the END of processing the entire file, we still have the LAST "id" no printed out
# print the "id" value AND its average as described above.
END{print id, (n) ? s/n : "NA"}' data.txt
thanks. that appears to work. Could you explain the code in detail so I fully grasp that? Im not sure if I understand floating numbers? Also. Now if I wanted then to take the average for each BCxxxxxx ID and subtract the mean from each number would that be equally difficult?
thanks a million. this is all new too me...like learning chinese...
mean= average. The format would look like the original attachment. But each value would have the average for that set of data substracted out of each value
BC156041
56 subtract (avg all values for BC156041)
45 subtract (avg all values for BC156041)
mean= average. The format would look like the original attachment. But each value would have the average for that set of data substracted out of each value
BC156041
56 subtract (avg all values for BC156041)
45 subtract (avg all values for BC156041)
n.. subtract (avg of n values for BC156041)
BC056472
12 subtract (avg all values for BC056472)
45 subtract (avg all values for BC056472)
n.. subtract (avg all values for BC056472)
etc etc
so that the output looks identical to the input except the average for each data set has been subtacted from each original data value. i.e. the average for all IDs is being set to zero. by subtracting the mean for each ID we are zeroing the average...
# for the very FIRST line (NR==1) in a file, assign the entire record ($0) to a variable "id".
# then proceed to the next input line ("next")
FNR==1 && NR==1{id=$0; next}
# if a line starts (^) with "BC" and is followed by one or more (+) numbers "[0-9]"
# output the value of a variable "id", followed by:
# if "n" is non-zero, divide '"s" by "n"
# if "n" is 0, output string "NA"
# "s=n=0" - assign "0" to "s" and "n"
# assign a current record/line ($0) to variable "id"
(FNR==NR || FNR==1) && /^BC[0-9]+$/ {arr[id]= (n) ? s/n : "NA"; s=n=0; id=$0}
# if a line starts with one or more (+) numbers ([0-9]) optionally followed
# by zero or more (*) numbers ([0-9]) or a dot (.)...
# calculate a sum (s) by adding the current record value ($0) to a running sum (s): s+=$0
# increment the running counter for records associated with a current "id": n++
FNR==NR && /^[0-9]+[.0-9]*$/{s+=$0; n++}
FNR!=NR && /^BC[0-9]+$/ {id=$0; print $0, arr[id]}
FNR!=NR && /^[0-9]+[.0-9]*$/{print $0, $0 - arr[id] }