I have a data file like below, where Time is in the second column
DATE TIME FRAC_DAYS_SINCE_JAN1
2011-06-25 08:03:20.000 175.33564815
2011-06-25 08:03:25.000 175.33570602
2011-06-25 08:03:30.000 175.33576389
2011-06-25 08:03:35.000 175.33582176
There are about 20 or so columns. The data file contains data every 5 seconds (as shown above), I need to find a way to average the data every 30 minutes. Is there any way to do this?
Yes, of course. But why awk? It's very easy for scripting languages with big integers - just convert the third row to such integers and follow differences. It's 2083320 for 30 minutes. This numbers are not fit in 32 bits so big ints (and with float numbers results wouldn't be exact). It's a snap with Ruby or Python. (And maybe with awk but I think it may depends on awk version and it needs testing).
But even best programmers need to test their simplest programs and why somebody should want to write those tests for you? Give more rows (not every 5 seconds but with selection of different hours and maybe dates) and some columns with different and real data (what are they?) and you can get help.
I have written a script in awk that edits the raw data into a format that can be analysed by a pre written program. However with 5 second data, the quantity that is read into the program is huge (and therefore takes a very long time). Averaging this 5 second data into 30 minute data will be most helpful.
$ cat script.pl
use warnings;
use strict;
## 30min * 60min/sec / 5sec/line
my $lines_to_average = (30 * 60 / 5)+1;
# $lines_to_average = 3; #Test for ten seconds.
my $proc_lines = 0;
my ($init_date, $end_date, $total_frac_days);
while ( <> ) {
## Skip header.
next if $. == 1;
## Remove last '\n'.
chomp;
## Sum 'frac_days' while not reached to average time.
if ( $proc_lines < $lines_to_average ) {
my ($date,$frac_day) = (split)[1,2];
++$proc_lines;
$total_frac_days += $frac_day;
$init_date = $date if $proc_lines == 1;
if ( $proc_lines == $lines_to_average || eof() ) {
$end_date = $date;
} else {
next;
}
}
printf "%s-%s %.8f\n", $init_date, $end_date, $total_frac_days / $lines_to_average;
$proc_lines = $total_frac_days = 0;
}
I've made a test (changing the $lines_to_average variable) with an average of ten seconds, which results in an output of two lines: initDate-endDate<space>average
Here the script running:
The first part has the same values as the second, but they are different in time (more then 30 minutes). The time tag shows the time when the difference between lines becomes more then 30 minutes or the time of the last line of the file. Test data in INPUTFILE in columns 2 and 3 are time tags too. You are really have 4 ($1$2, $3, $4, $5 of YOURFILE) time tags. You can take for your tests another columns.
The results should be the same and they are the same and they look right:
awk -vSCALE=100000000 -vDIF=2083320 -vprev=17533564815 '
{
cur = cur+($1*SCALE-prev)
prev = $1*SCALE
if (cur > DIF) {
printf $1 " "
for (i=2; i<=NF; i++) {
printf "%f ", a/c; a = 0 # format output for your needs
}
printf "\n"
c = 0; cur = 0
}
for (i=2; i<=NF; i++) {
a += $i;
}
c++
}
END {
printf $1 " "
for (i=2; i<=NF; i++)
printf "%f ", a/c
print ""
}' INPUTFILE
176.33564815 4208.061806 1308989022.500000
176.33616898 4208.061806 1308989022.500000
But it just prints the averages to the Shell. What would I have to do to replace the data above to get it into the desired output (as shown in my previous post).
for the averaging code you gave me I'm unsure as to what the code actually does, can you just breifly run through what the parts of the script do?
#!/bin/gawk -f
# this script is used to average the data into 60 minute intervals
NR==1{
gsub(" +","\t")
print
}
NR>1&&(NR-1)%720{
for (i=3;i<=NF;i++){
a+=$i
}
}
NR>1&&!((NR-2)%720){
t=$1"\t"$2"\t"
}
NR>1&&!((NR-1)%720){
printf t
for (i=3;i<=NF;i++){
printf "%.5e\t",a/720
a=0
}
printf "\n"
}
When analyzing my code I noticed that it is not giving accurate output... This should calculate the average properly:
NR==1{
gsub(" +","\t")
print
}
NR>1&&(NR-1)%720{
for (i=3;i<=NF;i++){
a+=$i
}
}
NR>1&&!((NR-2)%720){
t=$1"\t"$2"\t"
}
NR>1&&!((NR-1)%720){
for (i=3;i<=NF;i++){
a+=$i
}
printf t
for (i=3;i<=NF;i++){
printf "%.5e\t",a/720
a=0
}
printf "\n"
}
In simple words, it is going through whole file and checks line numbers:
for line 1 it is printing the header;
for line 2 it is saving time into variable
for lines 2-720 it is summing the columns in appropriate array elements;
for line number 721 it is adding the last value of the range to the array elements, then it prints the time saved at line 2 and prints calculated average value.
Using modulo operator causes this steps to be performed again for line 722-1440, 1441 etc.
Hi gd9629
I recognise the data you're working with (measurements of atmospheric carbon dioxide, methane and water vapour) - my day job is running instruments such as you have.
A couple of comments if I may;
It would be fairly standard procedure to average such data to a fixed timestamp rather than have it defined by the start of the data record eg hourly averages from minute 0 to minute 59. This simplifies comparison or combination of different data sets and takes care of pesky issues such as periods of missing data
You can reduce your computation time by ignoring the data which are either pointless or meaningless to average. PM me if you want to discuss further as that discussion would be very much off-topic
My own approach would be to reduce the data to a one minute averages, filtered vs the diagnostics information in $6 to $15 (again PM me if you want further clarification) and written to a daily or monthly file. You can then calculate averages and other stats for whatever periods you wish without having to revisit the rather large raw data files you started with (must be at least 350Mb per day?). You can also reformat to whatever suits you. I'd go with space or comma delimited and lose the exponents and a lot of excess digits