averaging column values with awk

johnmillsbro · January 25, 2009, 10:46pm

Hello. Im just starting to learn awk so hang in there with me...I have a large text file formatted as such everything is in a single column

ID001
value 1
value 2
value....n
ID002
value 1
value 2
value... n

I want to be able to calculate the average for values for each ID from the whole column (in total there are 25000 IDs) and the n= anywhere from 100 to 2000

thanks in advance.
John

rubin · January 26, 2009, 12:08am

Assuming the same file format as described above:

$ cat file
ID001
1
2
3
4
5
ID002
22
33
44
55
66
77
ID003
6
7
19
1

awk  'NR==1{id=$0; next} /^ID[0-9]+$/{print id, s/n; s=n=0; id=$0}
      /^[0-9]+$/{s+=$0; n++} END{print id, s/n}' file

Output:

ID001 3
ID002 49.5
ID003 8.25

johnmillsbro · January 26, 2009, 9:46am

Thanks Rubin..I think I left out some info though. Your script works great but not for my data...I think the reason is...

1.) My IDs are not linear (i.e. ID001, ID002..rather random ID005, ID001, ID999)

2.) My IDs are not ID001 rather BC followed by a random 6 digits such ..BC000601, BC015656, etc.

3.) All the data is stored in a file called data.txt
such the format is
BC001061
56.66
51.1
12.1223
68
..n
BC567123
1
15.6
12.111
..n
etc etc

Also I am using cygwin and presumably gnu awk

thanks again. I hope that helps
cheers
John

vgersh99 · January 26, 2009, 10:05am

awk  'NR==1{id=$0; next} /^BC[0-9]+$/{print id, s/n; s=n=0; id=$0}
      /^[0-9]+$/{s+=$0; n++} END{print id, s/n}' data.txt

johnmillsbro · January 26, 2009, 10:14am

yes...I tried it...but I get a fatal error; division by zero attempted.FNR=288

Line 288 happens to be the line with the ID for the second set of data in the column...
any ideas?
thanks again..in advance.
-J

vgersh99 · January 26, 2009, 10:37am

awk  'NR==1{id=$0; next} /^BC[0-9]+$/{print id, (n) ? s/n : "NA"; s=n=0; id=$0}
      /^[0-9]+$/{s+=$0; n++} END{print id, (n) ? s/n : "NA"}' data.txt

johnmillsbro · January 26, 2009, 10:46am

Thanks....that removed the error..but I still need to figure out how to get the average for each ID...

johnmillsbro · January 26, 2009, 10:48am

it gives an output for some of the IDs..but most recieve NA..?
thanks

otheus · January 26, 2009, 10:49am

Since they're not ordered linearly, you can pre-sort them, or use awk arrays:

awk '/^BC[0-9]+$/ {id=$1; next;} { sum[id]+=$0; count[id]++; } 
  END {   for (id in sum)  print id, sum[id]/count[id]; }'

johnmillsbro · January 26, 2009, 10:52am

even if I pre-sort them...the ID numbers are not continous. i.e. (BC100100, BC100199, etc)

-J

vgersh99 · January 26, 2009, 10:55am

please post a sample file

johnmillsbro · January 26, 2009, 11:06am

here is the beginning of the file i am attempting to calculate averages from. This is just the 1st couple of entries...in total there are 25,000 BCxxxxxx entries.

thanks

vgersh99 · January 26, 2009, 11:14am

you have floating point numbers - need to accommodate a regex for that:

awk  '
        # for the very FIRST line (NR==1) in a file, assign the entire record ($0) to a variable "id".
        # then proceed to the next input line ("next")
        NR==1{id=$0; next} 

        # if a line starts (^) with "BC" and is followed by one or more (+) numbers "[0-9]"
        # output the value of a variable "id", followed by:
        #     if "n" is non-zero, divide '"s" by "n"
        #     if "n" is 0, output string "NA"
        # "s=n=0" - assign "0" to "s" and "n"
        # assign a current record/line ($0) to variable "id"
        /^BC[0-9]+$/{print id, (n) ? s/n : "NA"; s=n=0; id=$0}

        # if a line starts with one or more (+) numbers ([0-9]) optionally followed
        # by zero or more (*) numbers ([0-9]) or a dot (.)...
        #    calculate a sum (s) by adding the current record value ($0) to a running sum (s): s+=$0
        #    increment the running counter for records associated with a current "id": n++
        /^[0-9]+[.0-9]*$/{s+=$0; n++} 

# at the END of processing the entire file, we still have the LAST "id" no printed out
# print the "id" value AND its average as described above.
END{print id, (n) ? s/n : "NA"}' data.txt

johnmillsbro · January 26, 2009, 11:27am

thanks. that appears to work. Could you explain the code in detail so I fully grasp that? Im not sure if I understand floating numbers? Also. Now if I wanted then to take the average for each BCxxxxxx ID and subtract the mean from each number would that be equally difficult?
thanks a million. this is all new too me...like learning chinese...

-J

vgersh99 · January 26, 2009, 11:47am

I put comments in the last post with the code - hope it helps.
How do you define "mean" and what's the format of your desired output?

johnmillsbro · January 26, 2009, 11:51am

mean= average. The format would look like the original attachment. But each value would have the average for that set of data substracted out of each value
BC156041
56 subtract (avg all values for BC156041)
45 subtract (avg all values for BC156041)

johnmillsbro · January 26, 2009, 11:54am

mean= average. The format would look like the original attachment. But each value would have the average for that set of data substracted out of each value
BC156041
56 subtract (avg all values for BC156041)
45 subtract (avg all values for BC156041)
n.. subtract (avg of n values for BC156041)
BC056472
12 subtract (avg all values for BC056472)
45 subtract (avg all values for BC056472)
n.. subtract (avg all values for BC056472)
etc etc

so that the output looks identical to the input except the average for each data set has been subtacted from each original data value. i.e. the average for all IDs is being set to zero. by subtracting the mean for each ID we are zeroing the average...

are you following me?
thanks you are great help..

johnmillsbro · January 26, 2009, 11:56am

i.e
for input
BC111111
4
8
12
BC555555
2
4
6

output
BC111111
-4
0
4
BC555555
-2
0
2

vgersh99 · January 26, 2009, 12:20pm

assuming 'mean = value - average'....

I also put the 'average' for every 'BC' for the ease of validating - you can change

FNR!=NR && /^BC[0-9]+$/ {id=$0; print $0, arr[id]}

TO

FNR!=NR && /^BC[0-9]+$/ {id=$0; print $0}

in your final version.

awk -f john.awk data.txt data.txt

john.awk:

        # for the very FIRST line (NR==1) in a file, assign the entire record ($0) to a variable "id".
        # then proceed to the next input line ("next")
        FNR==1 && NR==1{id=$0; next} 

        # if a line starts (^) with "BC" and is followed by one or more (+) numbers "[0-9]"
        # output the value of a variable "id", followed by:
        #     if "n" is non-zero, divide '"s" by "n"
        #     if "n" is 0, output string "NA"
        # "s=n=0" - assign "0" to "s" and "n"
        # assign a current record/line ($0) to variable "id"
        (FNR==NR || FNR==1) && /^BC[0-9]+$/ {arr[id]= (n) ? s/n : "NA"; s=n=0; id=$0}

        # if a line starts with one or more (+) numbers ([0-9]) optionally followed
        # by zero or more (*) numbers ([0-9]) or a dot (.)...
        #    calculate a sum (s) by adding the current record value ($0) to a running sum (s): s+=$0
        #    increment the running counter for records associated with a current "id": n++
        FNR==NR && /^[0-9]+[.0-9]*$/{s+=$0; n++}


        FNR!=NR && /^BC[0-9]+$/ {id=$0; print $0, arr[id]}
        FNR!=NR && /^[0-9]+[.0-9]*$/{print $0, $0 - arr[id] }