averaging column values with awk

Hello. Im just starting to learn awk so hang in there with me...I have a large text file formatted as such everything is in a single column

ID001
value 1
value 2
value....n
ID002
value 1
value 2
value... n

I want to be able to calculate the average for values for each ID from the whole column (in total there are 25000 IDs) and the n= anywhere from 100 to 2000

thanks in advance.
John

Assuming the same file format as described above:

$ cat file
ID001
1
2
3
4
5
ID002
22
33
44
55
66
77
ID003
6
7
19
1
awk  'NR==1{id=$0; next} /^ID[0-9]+$/{print id, s/n; s=n=0; id=$0}
      /^[0-9]+$/{s+=$0; n++} END{print id, s/n}' file

Output:

ID001 3
ID002 49.5
ID003 8.25

Thanks Rubin..I think I left out some info though. Your script works great but not for my data...I think the reason is...

1.) My IDs are not linear (i.e. ID001, ID002..rather random ID005, ID001, ID999)

2.) My IDs are not ID001 rather BC followed by a random 6 digits such ..BC000601, BC015656, etc.

3.) All the data is stored in a file called data.txt
such the format is
BC001061
56.66
51.1
12.1223
68
..n
BC567123
1
15.6
12.111
..n
etc etc

Also I am using cygwin and presumably gnu awk

thanks again. I hope that helps
cheers
John

awk  'NR==1{id=$0; next} /^BC[0-9]+$/{print id, s/n; s=n=0; id=$0}
      /^[0-9]+$/{s+=$0; n++} END{print id, s/n}' data.txt

yes...I tried it...but I get a fatal error; division by zero attempted.FNR=288

Line 288 happens to be the line with the ID for the second set of data in the column...
any ideas?
thanks again..in advance.
-J

awk  'NR==1{id=$0; next} /^BC[0-9]+$/{print id, (n) ? s/n : "NA"; s=n=0; id=$0}
      /^[0-9]+$/{s+=$0; n++} END{print id, (n) ? s/n : "NA"}' data.txt

Thanks....that removed the error..but I still need to figure out how to get the average for each ID...

it gives an output for some of the IDs..but most recieve NA..?
thanks

Since they're not ordered linearly, you can pre-sort them, or use awk arrays:

awk '/^BC[0-9]+$/ {id=$1; next;} { sum[id]+=$0; count[id]++; } 
  END {   for (id in sum)  print id, sum[id]/count[id]; }'

even if I pre-sort them...the ID numbers are not continous. i.e. (BC100100, BC100199, etc)

-J

please post a sample file

here is the beginning of the file i am attempting to calculate averages from. This is just the 1st couple of entries...in total there are 25,000 BCxxxxxx entries.

thanks

you have floating point numbers - need to accommodate a regex for that:

awk  '
        # for the very FIRST line (NR==1) in a file, assign the entire record ($0) to a variable "id".
        # then proceed to the next input line ("next")
        NR==1{id=$0; next} 

        # if a line starts (^) with "BC" and is followed by one or more (+) numbers "[0-9]"
        # output the value of a variable "id", followed by:
        #     if "n" is non-zero, divide '"s" by "n"
        #     if "n" is 0, output string "NA"
        # "s=n=0" - assign "0" to "s" and "n"
        # assign a current record/line ($0) to variable "id"
        /^BC[0-9]+$/{print id, (n) ? s/n : "NA"; s=n=0; id=$0}

        # if a line starts with one or more (+) numbers ([0-9]) optionally followed
        # by zero or more (*) numbers ([0-9]) or a dot (.)...
        #    calculate a sum (s) by adding the current record value ($0) to a running sum (s): s+=$0
        #    increment the running counter for records associated with a current "id": n++
        /^[0-9]+[.0-9]*$/{s+=$0; n++} 

# at the END of processing the entire file, we still have the LAST "id" no printed out
# print the "id" value AND its average as described above.
END{print id, (n) ? s/n : "NA"}' data.txt

thanks. that appears to work. Could you explain the code in detail so I fully grasp that? Im not sure if I understand floating numbers? Also. Now if I wanted then to take the average for each BCxxxxxx ID and subtract the mean from each number would that be equally difficult?
thanks a million. this is all new too me...like learning chinese...

-J

I put comments in the last post with the code - hope it helps.
How do you define "mean" and what's the format of your desired output?

mean= average. The format would look like the original attachment. But each value would have the average for that set of data substracted out of each value
BC156041
56 subtract (avg all values for BC156041)
45 subtract (avg all values for BC156041)

mean= average. The format would look like the original attachment. But each value would have the average for that set of data substracted out of each value
BC156041
56 subtract (avg all values for BC156041)
45 subtract (avg all values for BC156041)
n.. subtract (avg of n values for BC156041)
BC056472
12 subtract (avg all values for BC056472)
45 subtract (avg all values for BC056472)
n.. subtract (avg all values for BC056472)
etc etc

so that the output looks identical to the input except the average for each data set has been subtacted from each original data value. i.e. the average for all IDs is being set to zero. by subtracting the mean for each ID we are zeroing the average...

are you following me?
thanks you are great help..

i.e
for input
BC111111
4
8
12
BC555555
2
4
6

output
BC111111
-4
0
4
BC555555
-2
0
2

assuming 'mean = value - average'....

I also put the 'average' for every 'BC' for the ease of validating - you can change

FNR!=NR && /^BC[0-9]+$/ {id=$0; print $0, arr[id]}

TO

FNR!=NR && /^BC[0-9]+$/ {id=$0; print $0}

in your final version.

awk -f john.awk data.txt data.txt

john.awk:

        # for the very FIRST line (NR==1) in a file, assign the entire record ($0) to a variable "id".
        # then proceed to the next input line ("next")
        FNR==1 && NR==1{id=$0; next} 

        # if a line starts (^) with "BC" and is followed by one or more (+) numbers "[0-9]"
        # output the value of a variable "id", followed by:
        #     if "n" is non-zero, divide '"s" by "n"
        #     if "n" is 0, output string "NA"
        # "s=n=0" - assign "0" to "s" and "n"
        # assign a current record/line ($0) to variable "id"
        (FNR==NR || FNR==1) && /^BC[0-9]+$/ {arr[id]= (n) ? s/n : "NA"; s=n=0; id=$0}

        # if a line starts with one or more (+) numbers ([0-9]) optionally followed
        # by zero or more (*) numbers ([0-9]) or a dot (.)...
        #    calculate a sum (s) by adding the current record value ($0) to a running sum (s): s+=$0
        #    increment the running counter for records associated with a current "id": n++
        FNR==NR && /^[0-9]+[.0-9]*$/{s+=$0; n++}


        FNR!=NR && /^BC[0-9]+$/ {id=$0; print $0, arr[id]}
        FNR!=NR && /^[0-9]+[.0-9]*$/{print $0, $0 - arr[id] }