Hourly averaging using Awk

gd9629 · July 28, 2011, 9:27am

Hey all,

I have a set of 5-second data as shown below. I need to find an hourly average of this data.

date                         co2
25/06/2011 08:04:00        8.30
25/06/2011 08:04:05        8.31
25/06/2011 08:04:10        8.32
25/06/2011 08:04:15        8.33
25/06/2011 08:04:20        8.32
25/06/2011 08:04:25        8.31
25/06/2011 08:04:30        8.30
25/06/2011 08:04:35        8.29
25/06/2011 08:04:40        8.34
25/06/2011 08:04:45        8.43
25/06/2011 08:04:50        8.33
25/06/2011 08:04:55        8.32
25/06/2011 08:05:00        8.31
25/06/2011 08:05:05        8.32
25/06/2011 08:05:10        8.30
25/06/2011 08:05:15        8.29
25/06/2011 08:05:20        8.30
25/06/2011 08:05:25        8.31
25/06/2011 08:05:30        8.33
25/06/2011 08:05:35        8.32
25/06/2011 08:05:40        8.32
25/06/2011 08:05:45        8.35
25/06/2011 08:05:50        8.34
25/06/2011 08:05:55        8.36
25/06/2011 08:06:00        8.37
25/06/2011 08:06:05        8.34
25/06/2011 08:06:10        8.35
25/06/2011 08:06:15        8.35
25/06/2011 08:06:20        8.34
25/06/2011 08:06:25        8.38
25/06/2011 08:06:30        8.35
25/06/2011 08:06:35        8.36
25/06/2011 08:06:40        8.32
25/06/2011 08:06:45        8.31
25/06/2011 08:06:50        8.30
25/06/2011 08:06:55        8.32

So above is just 3 minutes worth of data! Is there a way to average the co2 values into hourly data?

The co2 data is in column 13 of my dataframe. I also data in column 15 that I need to hourly average too!

Thanks a lot!

panyam · July 28, 2011, 9:53am

How about this?

 
awk '{a[substr($2,1,index($2,":")-1)]=a[substr($2,1,index($2,":")-1)]+$3} END{ for(i in a)print "hour->",i,"average->",a/720}' input_file

gd9629 · July 28, 2011, 10:12am

hey! This is my output


hour-> 17 average-> 0
hour-> 08 average-> 0
hour->  average-> 0
hour-> 18 average-> 0
hour-> 09 average-> 0
hour-> 19 average-> 0
hour-> 00 average-> 0
hour-> 01 average-> 0
hour-> 10 average-> 0
hour-> 11 average-> 0
hour-> 20 average-> 0
hour-> 02 average-> 0
hour-> 12 average-> 0
hour-> 21 average-> 0
hour-> 03 average-> 0
hour-> 04 average-> 0
hour-> 22 average-> 0
hour-> 13 average-> 0
hour-> 23 average-> 0
hour-> 05 average-> 0
hour-> 14 average-> 0
hour-> 06 average-> 0
hour-> 15 average-> 0
hour-> 16 average-> 0
hour-> 07 average-> 0

Also I need to edit the actual file. As in the second data needs to be deleted and replaced with the hourly averages i.e.

date                         co2
25/06/2011 08:00:00        8.30
25/06/2011 09:00:00        8.31
25/06/2011 10:00:00        8.32
25/06/2011 11:00:00        8.33

etc.

panyam · July 28, 2011, 10:38am

Hello Not sure,

the same code I posted worked for me!!!

Try this:

 
awk '{a[substr($2,1,index($2,":")-1)]=a[substr($2,1,index($2,":")-1)]+$3} END{ for(i in a)print "hour->",i,"average->",a/720.0}' input_file

Tested for the sample that I have , can't paste as it is too big here. Got the result satisfactory.

 
awk '{a[$1" "substr($2,1,index($2,":")-1)]=a[$1" "substr($2,1,index($2,":")-1)]+$3} END{ for(i in a)print i":00:00",a/720.0}' input_file
01/01/2011 00:00:00 8.17293
01/01/2011 01:00:00 8.23
01/01/2011 02:00:00 8.31

rdcwayx · July 28, 2011, 10:45pm

awk '{split($2,a,":");b[$1 FS a[1]]+=$3;c[$1 FS a[1]]++}
    END{for (i in b) printf "%s:00:00\t%.2f\n", i,b/c}' infile

gd9629 · July 29, 2011, 5:33am

@panyam this is the result the Shell gave me

09/06/2011 20:00:00 0
24/06/2011 13:00:00 0
21/06/2011 18:00:00 0
09/06/2011 21:00:00 0
24/06/2011 14:00:00 0
21/06/2011 19:00:00 0
09/06/2011 22:00:00 0
24/06/2011 15:00:00 0

(I know the dates are mixed up, I need to sort this out after)

@rdcwayx this is the result I got from your code

09/06/2011 20:00:00     0.00
24/06/2011 13:00:00     0.00
21/06/2011 18:00:00     0.00
09/06/2011 21:00:00     0.00
24/06/2011 14:00:00     0.00
21/06/2011 19:00:00     0.00
09/06/2011 22:00:00     0.00

Also the code has to actually edit the csv file.

Thanks guys

---------- Post updated at 10:33 AM ---------- Previous update was at 09:28 AM ----------

do you think it makes sense to sort out the dates/time before averaging the data. It does doesn't it. Here's my code so far

#!/bin/bash


# input every 'CF*nc.dat' file in 'picarro' folder and create combined 'june2011.dat' file
find /u/gd9629/private/Scripts/Gavin/picarro -type f -name "CF*nc.dat" -exec cat {} > /u/gd9629/private/Scripts/Gavin/Data/june2011.dat \;

# use 'june2011.dat' as input file to create .csv file
IN_all='/u/gd9629/private/Scripts/Gavin/Data/june2011.dat' 
	
# Output files
OUT_all='/u/gd9629/private/Scripts/Gavin/Awk/juneout.csv'		# the csv file to create for all data called 'june.csv' in the respective directory

# gawk files to create csv file
GAWK='/u/gd9629/private/Scripts/Gavin/Format.csv.awk'

# creating headers for the columns
echo "date,alarm,species,solenoid,mpv,outlet,cavp,cavt,warmbox,etalon,dastemp,co2sync,co2,ch4sync,ch4,h2osync" > $OUT_all    # create clean OUT_all file with headers

#produce the OUT file from the IN file(s)
$GAWK $IN_all >> $OUT_all 


#removes the interspersed headers
awk '!/^\/\//' /u/gd9629/private/Scripts/Gavin/Awk/juneout.csv > /u/gd9629/private/Scripts/Gavin/Awk/juneout2.csv

The problem is awk reads in the files in all the .dat files in the folder I've specified in no discernible manner ( or so it seems). Is there a way of telling awk to sort it by day and time?

I suppose this would make averaging the data easier as opposed to sorting it out afterwards