Standard deviation in awk

gd9629 · August 19, 2011, 5:48am

Hi all,

I need to find the standard deviation of each column of a dataset below for each hour. The data is given in 5 second intervals as shown below

DATE                      TIME                      FRAC_DAYS_SINCE_JAN1      FRAC_HRS_SINCE_JAN1       EPOCH_TIME                ALARM_STATUS              species                   solenoid_valves           MPVPosition               OutletValve               CavityPressure            CavityTemp                WarmBoxTemp               EtalonTemp                DasTemp                   CO2_sync                  CO2_dry_sync              CH4_sync                  CH4_dry_sync              H2O_sync                  
2011-06-25                08:03:20.000              175.33564815              4208.055556               1308989000.000            0                         1.0000000000E+000         0.0000000000E+000         0.0000000000E+000         3.1033031768E+004         1.3998939700E+002         4.4999687195E+001         4.4999909923E+001         4.4982554694E+001         3.1751035912E+001         3.8940279934E+002         3.9515856123E+002         3.8618224512E+000         3.9114643916E+000         9.4040946797E-001         
2011-06-25                08:03:25.000              175.33570602              4208.056944               1308989005.000            0                         2.0000000000E+000         0.0000000000E+000         0.0000000000E+000         3.1036267092E+004         1.3999910014E+002         4.4999724978E+001         4.5000055343E+001         4.4982432169E+001         3.1750000000E+001         3.8935193355E+002         3.9511128829E+002         3.8489583767E+000         3.8990622477E+000         9.4172965077E-001         
2011-06-25                08:03:30.000              175.33576389              4208.058333               1308989010.000            0                         3.0000000000E+000         0.0000000000E+000         0.0000000000E+000         3.1035811523E+004         1.4000220783E+002         4.4999819829E+001         4.5000161391E+001         4.4982414246E+001         3.1795673077E+001         3.8938861262E+002         3.9514732622E+002         3.8595829527E+000         3.9062814635E+000         9.3441447742E-001         
2011-06-25                08:03:35.000              175.33582176              4208.059722               1308989015.000            0                         2.0000000000E+000         0.0000000000E+000         0.0000000000E+000         3.1036211602E+004         1.4000074680E+002         4.4999892989E+001         4.5000205994E+001         4.4982414246E+001         3.1750000000E+001         3.8945211683E+002         3.9517328822E+002         3.8841252351E+000         3.9451982461E+000         9.3417578630E-001         
2011-06-25                08:03:40.000              175.33587963              4208.061111               1308989020.000            0                         1.0000000000E+000         0.0000000000E+000         0.0000000000E+000         3.1036852113E+004         1.3999663742E+002         4.4999757503E+001         4.5000099772E+001         4.4982477978E+001         3.1793508287E+001         3.8936261353E+002         3.9511123990E+002         3.8395527261E+000         3.8931575825E+000         9.4752583244E-001         
2011-06-25                08:03:45.000              175.33593750              4208.062500               1308989025.000            0                         2.0000000000E+000         0.0000000000E+000         0.0000000000E+000         3.1032849098E+004         1.3998770158E+002         4.4999748230E+001         4.5000053406E+001         4.4982505798E+001         3.1750000000E+001         3.8931505434E+002         3.9512509568E+002         3.8351808367E+000         3.8599191928E+000         9.5367870751E-001         
2011-06-25                08:03:50.000              175.33599537              4208.063889               1308989030.000            0                         1.0000000000E+000         0.0000000000E+000         0.0000000000E+000         3.1039831891E+004         1.4001021926E+002         4.5000107359E+001         4.4999785829E+001         4.4982363833E+001         3.1750000000E+001         3.8929289686E+002         3.9508579576E+002         3.8474802185E+000         3.8609535036E+000         9.4079656257E-001

As each line 5 seconds, each hour would represent 720 lines! Basically need to find the SD each 720 lines.

I've worked out how to find the SD of a column (column 6 here)

awk  'NR>2 {sum+=$6; array[NR]=$6} END {for(x=1;x<=NR;x++){sumsq+=((array[x]-(sum/NR))^2);}print sqrt(sumsq/NR)}' INPUTFILE.dat

but instead of the whole column I need every hour (every 720 lines).

I've attached a file containing 3 hours worth of data if needed

Thanks a lot!

Roboticus · August 19, 2011, 8:19am

Give this a try:

#!/usr/bin/awk

function std_dev(data, count) {
    sum=0;
    for( x=1; x <= count; x++) {
        sum += data[x];
    }

    avg = sum/count;

    sumsq=0;
    for( x=1; x <= count; x++) {
        sumsq += (data[x] - avg)^2;
    }

    print sqrt(sumsq/count);
}

BEGIN {
    cnt = 0;
}

END {
    std_dev(array, cnt);
}

{
    array[cnt++] = $7;

    if (cnt == 720) {
        std_dev(array, cnt);
        cnt = 0;
    }
}

gd9629 · August 19, 2011, 10:03am

Hi Roboticus,

Thanks for the help. Am I right in thinking this will give me the hourly SD's for column 7? How would I get the hourly SD's of say columns 17 19 and 20 and place them next to each other?

Roboticus · August 19, 2011, 11:20pm

I'd create an array for each of the columns, something like:

{
   array7[cnt]=$7; array8[cnt]=$8; array9[cnt]=$9; cnt++;
   if (cnt==720) {
      print std_dev(array7,cnt) ", " std_dev(array8,cnt)
       ", " std_dev(array9,cnt);
   }
}

But you'd need to modify std_dev to return the value instead of printing it, update the END block, etc.

(I'd give it a go right now, but am having production issues. If you're still fighting it tomorrow, I can tune it up some.)

--roboticus

gd9629 · August 22, 2011, 11:19am

Hi Roboticus,

I've tried combining the codes you've given me into

function std_dev(data, count) {
    sum=0;
    for( x=1; x <= count; x++) {
        sum += data[x];
    }

    avg = sum/count;

    sumsq=0;
    for( x=1; x <= count; x++) {
        sumsq += (data[x] - avg)^2;
    }

    print sqrt(sumsq/count);
}


BEGIN {
    cnt = 0;
}


END {
    std_dev(array17, cnt); std_dev(array19, cnt); std_dev(array20, cnt);
}

{
   array17[cnt]=$17; array19[cnt]=$19; array20[cnt]=$20; cnt++;
   if (cnt==720) {
      print std_dev(array17, cnt) " " std_dev(array19, cnt)
       " " std_dev(array20, cnt);
   }
	
}

Unfortunately this just gives me a few numbers, the meaning of which I'm unsure lol

How should I edit the script further to give the 3 separate columns of SD values?

Thanks!

---------- Post updated at 04:19 PM ---------- Previous update was at 09:58 AM ----------

Hmm I've tried this now

BEGIN {
    cnt = 0;
}


END {
    std_dev(array17, cnt); std_dev(array19, cnt); std_dev(array20, cnt);
}

{
   array17[cnt++] = $17; array19[cnt++] = $19; array20[cnt++] = $20;
   
   if (cnt==720) {
    print  std_dev(array17, cnt) " " std_dev(array19, cnt)
       " " std_dev(array20, cnt); 
	   cnt = 0; 
	  
   }
	
}

and get this as my results

Any ideas?

bartus11 · August 22, 2011, 12:27pm

I've fixed your code a bit:

function std_dev(data, count) {
    sum=0;
    for( x=1; x <= count; x++) {
        sum += data[x];
    }
    avg = sum/count;
    sumsq=0;
    for( x=1; x <= count; x++) {
        sumsq += (data[x] - avg)^2;
    }
    return sqrt(sumsq/count);
}
BEGIN {
    cnt = 0;
}
END {
    std_dev(array17, cnt); std_dev(array19, cnt); std_dev(array20, cnt);
}
NR>1{
   array17[cnt]=$17; array19[cnt]=$19; array20[cnt]=$20; cnt++;
   if (cnt==720) {
      print std_dev(array17, cnt) " " std_dev(array19, cnt) " " std_dev(array20, cnt);
      cnt=0;
   }
}

Run it as: awk -f script.awk testfile.txt
If you need times to be printed before those values, I might come up with something (in an hour or so).

gd9629 · August 23, 2011, 4:42am

Bartus I do believe you're a magician. If I want to combine two files

File 1

Site: Ridge Hill
yyyy mm dd hh         CO2                CH4         H20
2011 06 09 14  5.2977498477e+02 2.2533536482e+00 9.5975030698e-01
2011 06 09 15  4.6520528305e+02 2.0304031747e+00 8.5550947898e-01
2011 06 09 16  4.2844761381e+02 1.9940230591e+00 8.1996490325e-01
2011 06 09 17  4.2039731519e+02 2.1120403967e+00 8.8853008313e-01
2011 06 09 18  4.1383091367e+02 2.2051511063e+00 9.2270478393e-01
2011 06 09 19  4.0919424013e+02 2.2731077128e+00 9.3647077771e-01
2011 06 09 20  4.0612124764e+02 2.3469439802e+00 9.3830980979e-01
2011 06 09 21  4.0448877229e+02 2.4314198862e+00 9.3857043465e-01
2011 06 09 22  4.0341963267e+02 2.5037286957e+00 9.3017004739e-01
2011 06 09 23  4.0258560618e+02 2.5635443249e+00 9.2232603099e-01

File 2 (3 columns with all these values to 4.d.p)

71.0168 0.178157 0.0674305
24.332 0.087471 0.0448028
19.5077 0.082566 0.0407083
15.8686 0.0851615 0.0366714
15.507 0.0850516 0.0349926
15.3 0.0870619 0.0349696
15.1607 0.0904091 0.0350349
15.0886 0.0938745 0.0350621
15.0478 0.0954998 0.0348352
15.0147 0.0965944 0.0344979

Desired Output

Site: Ridge Hill
yyyy mm dd hh         CO2       	    CH4           		   H20
2011 06 09 14  5.2977498477e+02 71.0168	2.2533536482e+00 0.178157 9.5975030698e-01 0.0674305
2011 06 09 15  4.6520528305e+02 24.332	2.0304031747e+00 0.087471 8.5550947898e-01 0.0448028
2011 06 09 16  4.2844761381e+02 19.5077	1.9940230591e+00 0.082566 8.1996490325e-01 0.0407083
2011 06 09 17  4.2039731519e+02 15.8686	2.1120403967e+00 0.0851615 8.8853008313e-01 0.0366714
2011 06 09 18  4.1383091367e+02 15.507	2.2051511063e+00 0.0850516 9.2270478393e-01 0.0349926
2011 06 09 19  4.0919424013e+02 15.3 	2.2731077128e+00 0.0870619 9.3647077771e-01 0.0349696
2011 06 09 20  4.0612124764e+02 15.1607	2.3469439802e+00 0.0904091 9.3830980979e-01 0.0350349
2011 06 09 21  4.0448877229e+02 15.0886	2.4314198862e+00 0.0938745 9.3857043465e-01 0.0350621
2011 06 09 22  4.0341963267e+02 15.0478	2.5037286957e+00 0.0954998 9.3017004739e-01 0.0348352
2011 06 09 23  4.0258560618e+02 15.0147	2.5635443249e+00 0.0965944 9.2232603099e-01 0.0344979

Can I use the 'join' command? Or would it be better to use the 'FNR==NR' or perhaps something different?

Thanks

bartus11 · August 23, 2011, 5:50am

awk 'NR==FNR{a[NR+2]=$1;b[NR+2]=$2;c[NR+2]=$3;next}NR>2{$5=$5" "a[FNR];$6=$6" "b[FNR];$7=$7" "c[FNR]}1' file2 file1

gd9629 · August 23, 2011, 6:14am

My love knows no boundaries for you my friend. Just one final quick thing, for file2, the standard deviation code you gave me before, how would I amend it so that all values were to 4 decimal places?

bartus11 · August 23, 2011, 6:19am

function std_dev(data, count) {
    sum=0;
    for( x=1; x <= count; x++) {
        sum += data[x];
    }
    avg = sum/count;
    sumsq=0;
    for( x=1; x <= count; x++) {
        sumsq += (data[x] - avg)^2;
    }
    return sprintf ("%.4f", sqrt(sumsq/count));
}
BEGIN {
    cnt = 0;
}
END {
    std_dev(array17, cnt); std_dev(array19, cnt); std_dev(array20, cnt);
}
NR>1{
   array17[cnt]=$17; array19[cnt]=$19; array20[cnt]=$20; cnt++;
   if (cnt==720) {
      print std_dev(array17, cnt) " " std_dev(array19, cnt) " " std_dev(array20, cnt);
      cnt=0;
   }
}

gd9629 · August 24, 2011, 12:09pm

I have an output like this now

Site: Ridge Hill
yyyy mm dd hh mm ____CO2________SD_______CH4_______SD_______H20_______SD__     
2011 06 09 15 38 4.65205e+02 71.0168 2.03040e+00 0.1782 8.55509e-01 0.0674
2011 06 09 16 38 4.28448e+02 24.3320 1.99402e+00 0.0875 8.19965e-01 0.0448
2011 06 09 17 38 4.20397e+02 19.5077 2.11204e+00 0.0826 8.88530e-01 0.0407
2011 06 09 18 38 4.13831e+02 15.8686 2.20515e+00 0.0852 9.22705e-01 0.0367
2011 06 09 19 38 4.09194e+02 15.5070 2.27311e+00 0.0851 9.36471e-01 0.0350

Problem is I want the data to start from the nearest whole hour (so starts at 14:00). With the averaging script and SD script is there a way of telling awk to start calculating from the nearest hour?

averaging script

NR==1{
  gsub(" +","\t")
  print
}
NR>1&&(NR-1)%720{
  for (i=3;i<=NF;i++){
    a+=$i
  }
}
NR>1&&!((NR-2)%720){
  t=$1"\t"$2"\t"
}
NR>1&&!((NR-1)%720){
  printf t
  for (i=3;i<=NF;i++){
    printf "%.5e\t",a/720
    a=0
  }
  printf "\n"
}

SD script

function std_dev(data, count) {
    sum=0;
    for( x=1; x <= count; x++) {
        sum += data[x];
    }
    avg = sum/count;
    sumsq=0;
    for( x=1; x <= count; x++) {
        sumsq += (data[x] - avg)^2;
    }
    return sprintf ("%.4f", sqrt(sumsq/count));
}


BEGIN {
    cnt = 0;
}
END {
    std_dev(array17, cnt); std_dev(array19, cnt); std_dev(array20, cnt);
}
NR>1{
   array17[cnt]=$17; array19[cnt]=$19; array20[cnt]=$20; cnt++;
   if (cnt==720) {
      print std_dev(array17, cnt) " " std_dev(array19, cnt) " " std_dev(array20, cnt);
      cnt=0;
   }

Thanks

bartus11 · August 24, 2011, 12:40pm

I think you mean it should start averaging from 15:00:00, cause there is no data for 14:00 to 14:38 in your sample file. Anyway, try this on your file:

awk 'NR==1;$2~"15:00:00"{p=1}p' testfile.txt > testfile2.txt

Then just run previous code without any modifications on testfile2.txt.