Averaging help in awk

Hi all,

I have a data file like below, where Time is in the second column

DATE                      TIME                      FRAC_DAYS_SINCE_JAN1      
2011-06-25                08:03:20.000              175.33564815 
2011-06-25                08:03:25.000              175.33570602
2011-06-25                08:03:30.000              175.33576389 
2011-06-25                08:03:35.000              175.33582176

There are about 20 or so columns. The data file contains data every 5 seconds (as shown above), I need to find a way to average the data every 30 minutes. Is there any way to do this?

Thanks a lot

Yes, of course. But why awk? It's very easy for scripting languages with big integers - just convert the third row to such integers and follow differences. It's 2083320 for 30 minutes. This numbers are not fit in 32 bits so big ints (and with float numbers results wouldn't be exact). It's a snap with Ruby or Python. (And maybe with awk but I think it may depends on awk version and it needs testing).
But even best programmers need to test their simplest programs and why somebody should want to write those tests for you? Give more rows (not every 5 seconds but with selection of different hours and maybe dates) and some columns with different and real data (what are they?) and you can get help.

I have written a script in awk that edits the raw data into a format that can be analysed by a pre written program. However with 5 second data, the quantity that is read into the program is huge (and therefore takes a very long time). Averaging this 5 second data into 30 minute data will be most helpful.

A snippet of all the columns of the data is below

DATE                      TIME                      FRAC_DAYS_SINCE_JAN1      FRAC_HRS_SINCE_JAN1       EPOCH_TIME                ALARM_STATUS              species                   solenoid_valves           MPVPosition               OutletValve               CavityPressure            CavityTemp                WarmBoxTemp               EtalonTemp                DasTemp                   CO2_sync                  CO2_dry_sync              CH4_sync                  CH4_dry_sync              H2O_sync                  
2011-06-25                08:03:20.000              175.33564815              4208.055556               1308989000.000            0                         1.0000000000E+000         0.0000000000E+000         0.0000000000E+000         3.1033031768E+004         1.3998939700E+002         4.4999687195E+001         4.4999909923E+001         4.4982554694E+001         3.1751035912E+001         3.8940279934E+002         3.9515856123E+002         3.8618224512E+000         3.9114643916E+000         9.4040946797E-001         
2011-06-25                08:03:25.000              175.33570602              4208.056944               1308989005.000            0                         2.0000000000E+000         0.0000000000E+000         0.0000000000E+000         3.1036267092E+004         1.3999910014E+002         4.4999724978E+001         4.5000055343E+001         4.4982432169E+001         3.1750000000E+001         3.8935193355E+002         3.9511128829E+002         3.8489583767E+000         3.8990622477E+000         9.4172965077E-001         
2011-06-25                08:03:30.000              175.33576389              4208.058333               1308989010.000            0                         3.0000000000E+000         0.0000000000E+000         0.0000000000E+000         3.1035811523E+004         1.4000220783E+002         4.4999819829E+001         4.5000161391E+001         4.4982414246E+001         3.1795673077E+001         3.8938861262E+002         3.9514732622E+002         3.8595829527E+000         3.9062814635E+000         9.3441447742E-001         
2011-06-25                08:03:35.000              175.33582176              4208.059722               1308989015.000            0                         2.0000000000E+000         0.0000000000E+000         0.0000000000E+000         3.1036211602E+004         1.4000074680E+002         4.4999892989E+001         4.5000205994E+001         4.4982414246E+001         3.1750000000E+001         3.8945211683E+002         3.9517328822E+002         3.8841252351E+000         3.9451982461E+000         9.3417578630E-001         
2011-06-25                08:03:40.000              175.33587963              4208.061111               1308989020.000            0                         1.0000000000E+000         0.0000000000E+000         0.0000000000E+000         3.1036852113E+004         1.3999663742E+002         4.4999757503E+001         4.5000099772E+001         4.4982477978E+001         3.1793508287E+001         3.8936261353E+002         3.9511123990E+002         3.8395527261E+000         3.8931575825E+000         9.4752583244E-001         
2011-06-25                08:03:45.000              175.33593750              4208.062500               1308989025.000            0                         2.0000000000E+000         0.0000000000E+000         0.0000000000E+000         3.1032849098E+004         1.3998770158E+002         4.4999748230E+001         4.5000053406E+001         4.4982505798E+001         3.1750000000E+001         3.8931505434E+002         3.9512509568E+002         3.8351808367E+000         3.8599191928E+000         9.5367870751E-001         
2011-06-25                08:03:50.000              175.33599537              4208.063889               1308989030.000            0                         1.0000000000E+000         0.0000000000E+000         0.0000000000E+000         3.1039831891E+004         1.4001021926E+002         4.5000107359E+001         4.4999785829E+001         4.4982363833E+001         3.1750000000E+001         3.8929289686E+002         3.9508579576E+002         3.8474802185E+000         3.8609535036E+000         9.4079656257E-001         
2011-06-25                08:03:55.000              175.33605324              4208.065278               1308989035.000            0                         1.0000000000E+000         0.0000000000E+000         0.0000000000E+000         3.1034918787E+004         1.3999495360E+002         4.5000032095E+001         4.4999813527E+001         4.4982444763E+001         3.1750000000E+001         3.8929209181E+002         3.9514661440E+002         3.8572828460E+000         3.9132254247E+000         9.5624024001E-001         
2011-06-25                08:04:00.000              175.33611111              4208.066667               1308989040.000            0                         1.0000000000E+000         0.0000000000E+000         0.0000000000E+000         3.1035897944E+004         1.3999774824E+002         4.4999900818E+001         4.4999778748E+001         4.4982444763E+001         3.1750000000E+001         3.8899351438E+002         3.9493622327E+002         3.8126918495E+000         3.7581951076E+000         9.7192541150E-001         
2011-06-25                08:04:05.000              175.33616898              4208.068056               1308989045.000            0                         3.0000000000E+000         0.0000000000E+000         0.0000000000E+000         3.1039568209E+004         1.4001117648E+002         4.5000114441E+001         4.4999682500E+001         4.4982360253E+001         3.1750000000E+001         3.8927314658E+002         3.9508938383E+002         3.8333155741E+000         3.8801732738E+000         9.5087600370E-001         

Hi,

Try next 'perl' program:

$ cat script.pl
use warnings;
use strict;

## 30min * 60min/sec / 5sec/line
my $lines_to_average = (30 * 60 / 5)+1;
# $lines_to_average = 3; #Test for ten seconds.

my $proc_lines = 0;
my ($init_date, $end_date, $total_frac_days);

while ( <> ) {
        ## Skip header.
        next if $. == 1;

        ## Remove last '\n'.
        chomp;

        ## Sum 'frac_days' while not reached to average time.
        if ( $proc_lines < $lines_to_average ) {
                my ($date,$frac_day) = (split)[1,2];
                ++$proc_lines;
                $total_frac_days += $frac_day;
                $init_date = $date if $proc_lines == 1;
                if ( $proc_lines == $lines_to_average || eof() ) {
                        $end_date = $date;
                } else {
                        next;
                }
        }

        printf "%s-%s %.8f\n", $init_date, $end_date, $total_frac_days / $lines_to_average;
        $proc_lines = $total_frac_days = 0;

}

I've made a test (changing the $lines_to_average variable) with an average of ten seconds, which results in an output of two lines: initDate-endDate<space>average
Here the script running:

$ cat infile
DATE                      TIME                      FRAC_DAYS_SINCE_JAN1      
2011-06-25                08:03:20.000              175.33564815 
2011-06-25                08:03:25.000              175.33570602
2011-06-25                08:03:30.000              175.33576389 
2011-06-25                08:03:35.000              175.33582176
$ perl script.pl infile
08:03:20.000-08:03:30.000 175.33570602
08:03:35.000-08:03:35.000 58.44527392

UPDATE:
My script with your last example of input file and an average of ten seconds.

$ cat infile
DATE                      TIME                      FRAC_DAYS_SINCE_JAN1      FRAC_HRS_SINCE_JAN1       EPOCH_TIME                ALARM_STATUS              species                   solenoid_valves           MPVPosition               OutletValve               CavityPressure            CavityTemp                WarmBoxTemp               EtalonTemp                DasTemp                   CO2_sync                  CO2_dry_sync              CH4_sync                  CH4_dry_sync              H2O_sync                  
2011-06-25                08:03:20.000              175.33564815              4208.055556               1308989000.000            0                         1.0000000000E+000         0.0000000000E+000         0.0000000000E+000         3.1033031768E+004         1.3998939700E+002         4.4999687195E+001         4.4999909923E+001         4.4982554694E+001         3.1751035912E+001         3.8940279934E+002         3.9515856123E+002         3.8618224512E+000         3.9114643916E+000         9.4040946797E-001         
2011-06-25                08:03:25.000              175.33570602              4208.056944               1308989005.000            0                         2.0000000000E+000         0.0000000000E+000         0.0000000000E+000         3.1036267092E+004         1.3999910014E+002         4.4999724978E+001         4.5000055343E+001         4.4982432169E+001         3.1750000000E+001         3.8935193355E+002         3.9511128829E+002         3.8489583767E+000         3.8990622477E+000         9.4172965077E-001         
2011-06-25                08:03:30.000              175.33576389              4208.058333               1308989010.000            0                         3.0000000000E+000         0.0000000000E+000         0.0000000000E+000         3.1035811523E+004         1.4000220783E+002         4.4999819829E+001         4.5000161391E+001         4.4982414246E+001         3.1795673077E+001         3.8938861262E+002         3.9514732622E+002         3.8595829527E+000         3.9062814635E+000         9.3441447742E-001         
2011-06-25                08:03:35.000              175.33582176              4208.059722               1308989015.000            0                         2.0000000000E+000         0.0000000000E+000         0.0000000000E+000         3.1036211602E+004         1.4000074680E+002         4.4999892989E+001         4.5000205994E+001         4.4982414246E+001         3.1750000000E+001         3.8945211683E+002         3.9517328822E+002         3.8841252351E+000         3.9451982461E+000         9.3417578630E-001         
2011-06-25                08:03:40.000              175.33587963              4208.061111               1308989020.000            0                         1.0000000000E+000         0.0000000000E+000         0.0000000000E+000         3.1036852113E+004         1.3999663742E+002         4.4999757503E+001         4.5000099772E+001         4.4982477978E+001         3.1793508287E+001         3.8936261353E+002         3.9511123990E+002         3.8395527261E+000         3.8931575825E+000         9.4752583244E-001         
2011-06-25                08:03:45.000              175.33593750              4208.062500               1308989025.000            0                         2.0000000000E+000         0.0000000000E+000         0.0000000000E+000         3.1032849098E+004         1.3998770158E+002         4.4999748230E+001         4.5000053406E+001         4.4982505798E+001         3.1750000000E+001         3.8931505434E+002         3.9512509568E+002         3.8351808367E+000         3.8599191928E+000         9.5367870751E-001         
2011-06-25                08:03:50.000              175.33599537              4208.063889               1308989030.000            0                         1.0000000000E+000         0.0000000000E+000         0.0000000000E+000         3.1039831891E+004         1.4001021926E+002         4.5000107359E+001         4.4999785829E+001         4.4982363833E+001         3.1750000000E+001         3.8929289686E+002         3.9508579576E+002         3.8474802185E+000         3.8609535036E+000         9.4079656257E-001         
2011-06-25                08:03:55.000              175.33605324              4208.065278               1308989035.000            0                         1.0000000000E+000         0.0000000000E+000         0.0000000000E+000         3.1034918787E+004         1.3999495360E+002         4.5000032095E+001         4.4999813527E+001         4.4982444763E+001         3.1750000000E+001         3.8929209181E+002         3.9514661440E+002         3.8572828460E+000         3.9132254247E+000         9.5624024001E-001         
2011-06-25                08:04:00.000              175.33611111              4208.066667               1308989040.000            0                         1.0000000000E+000         0.0000000000E+000         0.0000000000E+000         3.1035897944E+004         1.3999774824E+002         4.4999900818E+001         4.4999778748E+001         4.4982444763E+001         3.1750000000E+001         3.8899351438E+002         3.9493622327E+002         3.8126918495E+000         3.7581951076E+000         9.7192541150E-001         
2011-06-25                08:04:05.000              175.33616898              4208.068056               1308989045.000            0                         3.0000000000E+000         0.0000000000E+000         0.0000000000E+000         3.1039568209E+004         1.4001117648E+002         4.5000114441E+001         4.4999682500E+001         4.4982360253E+001         3.1750000000E+001         3.8927314658E+002         3.9508938383E+002         3.8333155741E+000         3.8801732738E+000         9.5087600370E-001
$ perl script.pl infile
08:03:20.000-08:03:30.000 175.33570602
08:03:35.000-08:03:45.000 175.33587963
08:03:50.000-08:04:00.000 175.33605324
08:04:05.000-08:04:05.000 58.44538966

Adapt it to your needs.

But agree with yazu, give better examples and how you want your output, what have you tried to solve it, etc.

Regards,
Birei

And I see the task differently. I'm not sure I understand you right but I'm afraid it's not my fault. I changed your file for testing:

 awk '
NR != 1 { print $3, $4, $5; 
          $3++; printf "%.8f %.6f %.3f\n", $3, $4, $5}
' YOURFILE | sort > INPUTFILE 
cat INPUTFILE
175.33564815 4208.055556 1308989000.000
175.33570602 4208.056944 1308989005.000
175.33576389 4208.058333 1308989010.000
175.33582176 4208.059722 1308989015.000
175.33587963 4208.061111 1308989020.000
175.33593750 4208.062500 1308989025.000
175.33599537 4208.063889 1308989030.000
175.33605324 4208.065278 1308989035.000
175.33611111 4208.066667 1308989040.000
175.33616898 4208.068056 1308989045.000
176.33564815 4208.055556 1308989000.000
176.33570602 4208.056944 1308989005.000
176.33576389 4208.058333 1308989010.000
176.33582176 4208.059722 1308989015.000
176.33587963 4208.061111 1308989020.000
176.33593750 4208.062500 1308989025.000
176.33599537 4208.063889 1308989030.000
176.33605324 4208.065278 1308989035.000
176.33611111 4208.066667 1308989040.000
176.33616898 4208.068056 1308989045.000

The first part has the same values as the second, but they are different in time (more then 30 minutes). The time tag shows the time when the difference between lines becomes more then 30 minutes or the time of the last line of the file. Test data in INPUTFILE in columns 2 and 3 are time tags too. You are really have 4 ($1$2, $3, $4, $5 of YOURFILE) time tags. You can take for your tests another columns.
The results should be the same and they are the same and they look right:

awk -vSCALE=100000000 -vDIF=2083320 -vprev=17533564815 '
{
  cur = cur+($1*SCALE-prev)
  prev = $1*SCALE

  if (cur > DIF) {
    printf $1 " "
    for (i=2; i<=NF; i++) {
      printf "%f ", a/c; a = 0  # format output for your needs
    }
    printf "\n"
    c = 0; cur = 0
  }

  for (i=2; i<=NF; i++) {
    a += $i;
  }
  c++
}
END {
  printf $1 " "
  for (i=2; i<=NF; i++)
    printf "%f ", a/c
  print ""
}' INPUTFILE
176.33564815 4208.061806 1308989022.500000 
176.33616898 4208.061806 1308989022.500000 

I think you want an average for every 360 lines (30 minutes=3060 seconds = 3060/5 readings).

Pseudo code in awk.
BEGIN lines=0; total=0

REGULAR
lines++
If lines=360 then print total/360;initialise total.
total+=$2

END
print total/lines

I apologise if I've been ambiguous, I shall attempt to clarify the task.

I wish the output file to look like this

DATE                      TIME                      FRAC_DAYS_SINCE_JAN1      FRAC_HRS_SINCE_JAN1       EPOCH_TIME                ALARM_STATUS              species                   solenoid_valves           MPVPosition               OutletValve               CavityPressure            CavityTemp                WarmBoxTemp               EtalonTemp                DasTemp                   CO2_sync                  CO2_dry_sync              CH4_sync                  CH4_dry_sync              H2O_sync                  
2011-06-25                08:03:20.000              175.33564815              4208.055556               1308989000.000            0                         1.0000000000E+000         0.0000000000E+000         0.0000000000E+000         3.1033031768E+004         1.3998939700E+002         4.4999687195E+001         4.4999909923E+001         4.4982554694E+001         3.1751035912E+001         3.8940279934E+002         3.9515856123E+002         3.8618224512E+000         3.9114643916E+000         9.4040946797E-001         
2011-06-25                08:33:20.000              175.33570602              4208.056944               1308989005.000            0                         2.0000000000E+000         0.0000000000E+000         0.0000000000E+000         3.1036267092E+004         1.3999910014E+002         4.4999724978E+001         4.5000055343E+001         4.4982432169E+001         3.1750000000E+001         3.8935193355E+002         3.9511128829E+002         3.8489583767E+000         3.8990622477E+000         9.4172965077E-001         
2011-06-25                09:03:20.000              175.33576389              4208.058333               1308989010.000            0                         3.0000000000E+000         0.0000000000E+000         0.0000000000E+000         3.1035811523E+004         1.4000220783E+002         4.4999819829E+001         4.5000161391E+001         4.4982414246E+001         3.1795673077E+001         3.8938861262E+002         3.9514732622E+002         3.8595829527E+000         3.9062814635E+000         9.3441447742E-001         
2011-06-25                09:33:20.000              175.33582176              4208.059722               1308989015.000            0                         2.0000000000E+000         0.0000000000E+000         0.0000000000E+000         3.1036211602E+004         1.4000074680E+002         4.4999892989E+001         4.5000205994E+001         4.4982414246E+001         3.1750000000E+001         3.8945211683E+002         3.9517328822E+002         3.8841252351E+000         3.9451982461E+000         9.3417578630E-001         
2011-06-25                10:03:20.000              175.33587963              4208.061111               1308989020.000            0                         1.0000000000E+000         0.0000000000E+000         0.0000000000E+000         3.1036852113E+004         1.3999663742E+002         4.4999757503E+001         4.5000099772E+001         4.4982477978E+001         3.1793508287E+001         3.8936261353E+002         3.9511123990E+002         3.8395527261E+000         3.8931575825E+000         9.4752583244E-001         
2011-06-25                10:33:20.000              175.33593750              4208.062500               1308989025.000            0                         2.0000000000E+000         0.0000000000E+000         0.0000000000E+000         3.1032849098E+004         1.3998770158E+002         4.4999748230E+001         4.5000053406E+001         4.4982505798E+001         3.1750000000E+001         3.8931505434E+002         3.9512509568E+002         3.8351808367E+000         3.8599191928E+000         9.5367870751E-001         
2011-06-25                11:03:20.000              175.33599537              4208.063889               1308989030.000            0                         1.0000000000E+000         0.0000000000E+000         0.0000000000E+000         3.1039831891E+004         1.4001021926E+002         4.5000107359E+001         4.4999785829E+001         4.4982363833E+001         3.1750000000E+001         3.8929289686E+002         3.9508579576E+002         3.8474802185E+000         3.8609535036E+000         9.4079656257E-001         
2011-06-25                11:33:20.000              175.33605324              4208.065278               1308989035.000            0                         1.0000000000E+000         0.0000000000E+000         0.0000000000E+000         3.1034918787E+004         1.3999495360E+002         4.5000032095E+001         4.4999813527E+001         4.4982444763E+001         3.1750000000E+001         3.8929209181E+002         3.9514661440E+002         3.8572828460E+000         3.9132254247E+000         9.5624024001E-001         
2011-06-25                12:03:20.000              175.33611111              4208.066667               1308989040.000            0                         1.0000000000E+000         0.0000000000E+000         0.0000000000E+000         3.1035897944E+004         1.3999774824E+002         4.4999900818E+001         4.4999778748E+001         4.4982444763E+001         3.1750000000E+001         3.8899351438E+002         3.9493622327E+002         3.8126918495E+000         3.7581951076E+000         9.7192541150E-001         
2011-06-25                12:33:20.000              175.33616898              4208.068056               1308989045.000            0                         3.0000000000E+000         0.0000000000E+000         0.0000000000E+000         3.1039568209E+004         1.4001117648E+002         4.5000114441E+001         4.4999682500E+001         4.4982360253E+001         3.1750000000E+001         3.8927314658E+002         3.9508938383E+002         3.8333155741E+000         3.8801732738E+000         9.5087600370E-001

Except I need all the columns values (FRAC_DAYS through to H2O_sync) to be averaged to 30 minutes aswell.

So essentially as ananthap put it, an average every 360 lines. I tried incorporating the code ananthap suggested into the script but it didn't work.

Is this clearer? I'll try to explain again if you need me to.

Thanks guys

---------- Post updated at 02:38 PM ---------- Previous update was at 10:33 AM ----------

Managed to find the 30 minute averages using this code

awk 'NR>1 -F ";" {a[int((NR-1)/720)]+=$17;b[int((NR-1)/720)]+=$19};END{print "CO2 CH4";for (i=0;i<NR/360;i++){print a/360,b/360}}' INPUTFILE

But it just prints the averages to the Shell. What would I have to do to replace the data above to get it into the desired output (as shown in my previous post).

Cheers

Put this code into "script.awk":

NR==1{
  gsub(" +","\t")
  print
}
NR>1&&(NR-1)%360{
  for (i=3;i<=NF;i++){
    a+=$i
  }
}
NR>1&&!((NR-2)%360){
  t=$1"\t"$2"\t"
}
NR>1&&!((NR-1)%360){
  printf t
  for (i=3;i<=NF;i++){
    printf "%.10e\t",a/360
    a=0
  }
  printf "\n"
}

Then run it like that:

awk -f script.awk input > output

It will change the format of some columns, and it will change the field delimiter to single TAB as keeping all the spaces intact is quite troublesome.

1 Like

Hey Bartus,

for the averaging code you gave me I'm unsure as to what the code actually does, can you just breifly run through what the parts of the script do?

#!/bin/gawk -f

# this script is used to average the data into 60 minute intervals


NR==1{
  gsub(" +","\t")
  print
}
NR>1&&(NR-1)%720{
  for (i=3;i<=NF;i++){
    a+=$i
  }
}
NR>1&&!((NR-2)%720){
  t=$1"\t"$2"\t"
}
NR>1&&!((NR-1)%720){
  printf t
  for (i=3;i<=NF;i++){
    printf "%.5e\t",a/720
    a=0
  }
  printf "\n"
}

When analyzing my code I noticed that it is not giving accurate output... This should calculate the average properly:

NR==1{
  gsub(" +","\t")
  print
}
NR>1&&(NR-1)%720{
  for (i=3;i<=NF;i++){
    a+=$i
  }
}
NR>1&&!((NR-2)%720){
  t=$1"\t"$2"\t"
}
NR>1&&!((NR-1)%720){
  for (i=3;i<=NF;i++){
    a+=$i
  }
  printf t
  for (i=3;i<=NF;i++){
    printf "%.5e\t",a/720
    a=0
  }
  printf "\n"
}

In simple words, it is going through whole file and checks line numbers:
for line 1 it is printing the header;
for line 2 it is saving time into variable
for lines 2-720 it is summing the columns in appropriate array elements;
for line number 721 it is adding the last value of the range to the array elements, then it prints the time saved at line 2 and prints calculated average value.
Using modulo operator causes this steps to be performed again for line 722-1440, 1441 etc.

Hi gd9629
I recognise the data you're working with (measurements of atmospheric carbon dioxide, methane and water vapour) - my day job is running instruments such as you have.

A couple of comments if I may;
It would be fairly standard procedure to average such data to a fixed timestamp rather than have it defined by the start of the data record eg hourly averages from minute 0 to minute 59. This simplifies comparison or combination of different data sets and takes care of pesky issues such as periods of missing data

You can reduce your computation time by ignoring the data which are either pointless or meaningless to average. PM me if you want to discuss further as that discussion would be very much off-topic

My own approach would be to reduce the data to a one minute averages, filtered vs the diagnostics information in $6 to $15 (again PM me if you want further clarification) and written to a daily or monthly file. You can then calculate averages and other stats for whatever periods you wish without having to revisit the rather large raw data files you started with (must be at least 350Mb per day?). You can also reformat to whatever suits you. I'd go with space or comma delimited and lose the exponents and a lot of excess digits

cheers