gd9629
August 19, 2011, 5:48am
1
Hi all,
I need to find the standard deviation of each column of a dataset below for each hour. The data is given in 5 second intervals as shown below
DATE TIME FRAC_DAYS_SINCE_JAN1 FRAC_HRS_SINCE_JAN1 EPOCH_TIME ALARM_STATUS species solenoid_valves MPVPosition OutletValve CavityPressure CavityTemp WarmBoxTemp EtalonTemp DasTemp CO2_sync CO2_dry_sync CH4_sync CH4_dry_sync H2O_sync
2011-06-25 08:03:20.000 175.33564815 4208.055556 1308989000.000 0 1.0000000000E+000 0.0000000000E+000 0.0000000000E+000 3.1033031768E+004 1.3998939700E+002 4.4999687195E+001 4.4999909923E+001 4.4982554694E+001 3.1751035912E+001 3.8940279934E+002 3.9515856123E+002 3.8618224512E+000 3.9114643916E+000 9.4040946797E-001
2011-06-25 08:03:25.000 175.33570602 4208.056944 1308989005.000 0 2.0000000000E+000 0.0000000000E+000 0.0000000000E+000 3.1036267092E+004 1.3999910014E+002 4.4999724978E+001 4.5000055343E+001 4.4982432169E+001 3.1750000000E+001 3.8935193355E+002 3.9511128829E+002 3.8489583767E+000 3.8990622477E+000 9.4172965077E-001
2011-06-25 08:03:30.000 175.33576389 4208.058333 1308989010.000 0 3.0000000000E+000 0.0000000000E+000 0.0000000000E+000 3.1035811523E+004 1.4000220783E+002 4.4999819829E+001 4.5000161391E+001 4.4982414246E+001 3.1795673077E+001 3.8938861262E+002 3.9514732622E+002 3.8595829527E+000 3.9062814635E+000 9.3441447742E-001
2011-06-25 08:03:35.000 175.33582176 4208.059722 1308989015.000 0 2.0000000000E+000 0.0000000000E+000 0.0000000000E+000 3.1036211602E+004 1.4000074680E+002 4.4999892989E+001 4.5000205994E+001 4.4982414246E+001 3.1750000000E+001 3.8945211683E+002 3.9517328822E+002 3.8841252351E+000 3.9451982461E+000 9.3417578630E-001
2011-06-25 08:03:40.000 175.33587963 4208.061111 1308989020.000 0 1.0000000000E+000 0.0000000000E+000 0.0000000000E+000 3.1036852113E+004 1.3999663742E+002 4.4999757503E+001 4.5000099772E+001 4.4982477978E+001 3.1793508287E+001 3.8936261353E+002 3.9511123990E+002 3.8395527261E+000 3.8931575825E+000 9.4752583244E-001
2011-06-25 08:03:45.000 175.33593750 4208.062500 1308989025.000 0 2.0000000000E+000 0.0000000000E+000 0.0000000000E+000 3.1032849098E+004 1.3998770158E+002 4.4999748230E+001 4.5000053406E+001 4.4982505798E+001 3.1750000000E+001 3.8931505434E+002 3.9512509568E+002 3.8351808367E+000 3.8599191928E+000 9.5367870751E-001
2011-06-25 08:03:50.000 175.33599537 4208.063889 1308989030.000 0 1.0000000000E+000 0.0000000000E+000 0.0000000000E+000 3.1039831891E+004 1.4001021926E+002 4.5000107359E+001 4.4999785829E+001 4.4982363833E+001 3.1750000000E+001 3.8929289686E+002 3.9508579576E+002 3.8474802185E+000 3.8609535036E+000 9.4079656257E-001
As each line 5 seconds, each hour would represent 720 lines! Basically need to find the SD each 720 lines.
I've worked out how to find the SD of a column (column 6 here)
awk 'NR>2 {sum+=$6; array[NR]=$6} END {for(x=1;x<=NR;x++){sumsq+=((array[x]-(sum/NR))^2);}print sqrt(sumsq/NR)}' INPUTFILE.dat
but instead of the whole column I need every hour (every 720 lines).
I've attached a file containing 3 hours worth of data if needed
Thanks a lot!
Give this a try:
#!/usr/bin/awk
function std_dev(data, count) {
sum=0;
for( x=1; x <= count; x++) {
sum += data[x];
}
avg = sum/count;
sumsq=0;
for( x=1; x <= count; x++) {
sumsq += (data[x] - avg)^2;
}
print sqrt(sumsq/count);
}
BEGIN {
cnt = 0;
}
END {
std_dev(array, cnt);
}
{
array[cnt++] = $7;
if (cnt == 720) {
std_dev(array, cnt);
cnt = 0;
}
}
gd9629
August 19, 2011, 10:03am
3
Hi Roboticus,
Thanks for the help. Am I right in thinking this will give me the hourly SD's for column 7? How would I get the hourly SD's of say columns 17 19 and 20 and place them next to each other?
I'd create an array for each of the columns, something like:
{
array7[cnt]=$7; array8[cnt]=$8; array9[cnt]=$9; cnt++;
if (cnt==720) {
print std_dev(array7,cnt) ", " std_dev(array8,cnt)
", " std_dev(array9,cnt);
}
}
But you'd need to modify std_dev to return the value instead of printing it, update the END block, etc.
(I'd give it a go right now, but am having production issues. If you're still fighting it tomorrow, I can tune it up some.)
--roboticus
gd9629
August 22, 2011, 11:19am
5
Hi Roboticus,
I've tried combining the codes you've given me into
function std_dev(data, count) {
sum=0;
for( x=1; x <= count; x++) {
sum += data[x];
}
avg = sum/count;
sumsq=0;
for( x=1; x <= count; x++) {
sumsq += (data[x] - avg)^2;
}
print sqrt(sumsq/count);
}
BEGIN {
cnt = 0;
}
END {
std_dev(array17, cnt); std_dev(array19, cnt); std_dev(array20, cnt);
}
{
array17[cnt]=$17; array19[cnt]=$19; array20[cnt]=$20; cnt++;
if (cnt==720) {
print std_dev(array17, cnt) " " std_dev(array19, cnt)
" " std_dev(array20, cnt);
}
}
Unfortunately this just gives me a few numbers, the meaning of which I'm unsure lol
71.1077
0.178527
0.0675481
38.6887
0.825408
0.140382
How should I edit the script further to give the 3 separate columns of SD values?
Thanks!
---------- Post updated at 04:19 PM ---------- Previous update was at 09:58 AM ----------
Hmm I've tried this now
BEGIN {
cnt = 0;
}
END {
std_dev(array17, cnt); std_dev(array19, cnt); std_dev(array20, cnt);
}
{
array17[cnt++] = $17; array19[cnt++] = $19; array20[cnt++] = $20;
if (cnt==720) {
print std_dev(array17, cnt) " " std_dev(array19, cnt)
" " std_dev(array20, cnt);
cnt = 0;
}
}
and get this as my results
272.963
1.14211
0.477465
251.055
1.05022
0.446297
232.296
1.00244
0.43713
222.966
0.970351
0.412753
219.27
0.957796
0.404489
216.511
0.948255
Any ideas?
I've fixed your code a bit:
function std_dev(data, count) {
sum=0;
for( x=1; x <= count; x++) {
sum += data[x];
}
avg = sum/count;
sumsq=0;
for( x=1; x <= count; x++) {
sumsq += (data[x] - avg)^2;
}
return sqrt(sumsq/count);
}
BEGIN {
cnt = 0;
}
END {
std_dev(array17, cnt); std_dev(array19, cnt); std_dev(array20, cnt);
}
NR>1{
array17[cnt]=$17; array19[cnt]=$19; array20[cnt]=$20; cnt++;
if (cnt==720) {
print std_dev(array17, cnt) " " std_dev(array19, cnt) " " std_dev(array20, cnt);
cnt=0;
}
}
Run it as: awk -f script.awk testfile.txt
If you need times to be printed before those values, I might come up with something (in an hour or so).
1 Like
gd9629
August 23, 2011, 4:42am
7
Bartus I do believe you're a magician. If I want to combine two files
File 1
Site: Ridge Hill
yyyy mm dd hh CO2 CH4 H20
2011 06 09 14 5.2977498477e+02 2.2533536482e+00 9.5975030698e-01
2011 06 09 15 4.6520528305e+02 2.0304031747e+00 8.5550947898e-01
2011 06 09 16 4.2844761381e+02 1.9940230591e+00 8.1996490325e-01
2011 06 09 17 4.2039731519e+02 2.1120403967e+00 8.8853008313e-01
2011 06 09 18 4.1383091367e+02 2.2051511063e+00 9.2270478393e-01
2011 06 09 19 4.0919424013e+02 2.2731077128e+00 9.3647077771e-01
2011 06 09 20 4.0612124764e+02 2.3469439802e+00 9.3830980979e-01
2011 06 09 21 4.0448877229e+02 2.4314198862e+00 9.3857043465e-01
2011 06 09 22 4.0341963267e+02 2.5037286957e+00 9.3017004739e-01
2011 06 09 23 4.0258560618e+02 2.5635443249e+00 9.2232603099e-01
File 2 (3 columns with all these values to 4.d.p)
71.0168 0.178157 0.0674305
24.332 0.087471 0.0448028
19.5077 0.082566 0.0407083
15.8686 0.0851615 0.0366714
15.507 0.0850516 0.0349926
15.3 0.0870619 0.0349696
15.1607 0.0904091 0.0350349
15.0886 0.0938745 0.0350621
15.0478 0.0954998 0.0348352
15.0147 0.0965944 0.0344979
Desired Output
Site: Ridge Hill
yyyy mm dd hh CO2 CH4 H20
2011 06 09 14 5.2977498477e+02 71.0168 2.2533536482e+00 0.178157 9.5975030698e-01 0.0674305
2011 06 09 15 4.6520528305e+02 24.332 2.0304031747e+00 0.087471 8.5550947898e-01 0.0448028
2011 06 09 16 4.2844761381e+02 19.5077 1.9940230591e+00 0.082566 8.1996490325e-01 0.0407083
2011 06 09 17 4.2039731519e+02 15.8686 2.1120403967e+00 0.0851615 8.8853008313e-01 0.0366714
2011 06 09 18 4.1383091367e+02 15.507 2.2051511063e+00 0.0850516 9.2270478393e-01 0.0349926
2011 06 09 19 4.0919424013e+02 15.3 2.2731077128e+00 0.0870619 9.3647077771e-01 0.0349696
2011 06 09 20 4.0612124764e+02 15.1607 2.3469439802e+00 0.0904091 9.3830980979e-01 0.0350349
2011 06 09 21 4.0448877229e+02 15.0886 2.4314198862e+00 0.0938745 9.3857043465e-01 0.0350621
2011 06 09 22 4.0341963267e+02 15.0478 2.5037286957e+00 0.0954998 9.3017004739e-01 0.0348352
2011 06 09 23 4.0258560618e+02 15.0147 2.5635443249e+00 0.0965944 9.2232603099e-01 0.0344979
Can I use the 'join' command? Or would it be better to use the 'FNR==NR' or perhaps something different?
Thanks
awk 'NR==FNR{a[NR+2]=$1;b[NR+2]=$2;c[NR+2]=$3;next}NR>2{$5=$5" "a[FNR];$6=$6" "b[FNR];$7=$7" "c[FNR]}1' file2 file1
1 Like
gd9629
August 23, 2011, 6:14am
9
My love knows no boundaries for you my friend. Just one final quick thing, for file2, the standard deviation code you gave me before, how would I amend it so that all values were to 4 decimal places?
function std_dev(data, count) {
sum=0;
for( x=1; x <= count; x++) {
sum += data[x];
}
avg = sum/count;
sumsq=0;
for( x=1; x <= count; x++) {
sumsq += (data[x] - avg)^2;
}
return sprintf ("%.4f", sqrt(sumsq/count));
}
BEGIN {
cnt = 0;
}
END {
std_dev(array17, cnt); std_dev(array19, cnt); std_dev(array20, cnt);
}
NR>1{
array17[cnt]=$17; array19[cnt]=$19; array20[cnt]=$20; cnt++;
if (cnt==720) {
print std_dev(array17, cnt) " " std_dev(array19, cnt) " " std_dev(array20, cnt);
cnt=0;
}
}
1 Like
gd9629
August 24, 2011, 12:09pm
11
I have an output like this now
Site: Ridge Hill
yyyy mm dd hh mm ____CO2________SD_______CH4_______SD_______H20_______SD__
2011 06 09 15 38 4.65205e+02 71.0168 2.03040e+00 0.1782 8.55509e-01 0.0674
2011 06 09 16 38 4.28448e+02 24.3320 1.99402e+00 0.0875 8.19965e-01 0.0448
2011 06 09 17 38 4.20397e+02 19.5077 2.11204e+00 0.0826 8.88530e-01 0.0407
2011 06 09 18 38 4.13831e+02 15.8686 2.20515e+00 0.0852 9.22705e-01 0.0367
2011 06 09 19 38 4.09194e+02 15.5070 2.27311e+00 0.0851 9.36471e-01 0.0350
Problem is I want the data to start from the nearest whole hour (so starts at 14:00). With the averaging script and SD script is there a way of telling awk to start calculating from the nearest hour?
averaging script
NR==1{
gsub(" +","\t")
print
}
NR>1&&(NR-1)%720{
for (i=3;i<=NF;i++){
a+=$i
}
}
NR>1&&!((NR-2)%720){
t=$1"\t"$2"\t"
}
NR>1&&!((NR-1)%720){
printf t
for (i=3;i<=NF;i++){
printf "%.5e\t",a/720
a=0
}
printf "\n"
}
SD script
function std_dev(data, count) {
sum=0;
for( x=1; x <= count; x++) {
sum += data[x];
}
avg = sum/count;
sumsq=0;
for( x=1; x <= count; x++) {
sumsq += (data[x] - avg)^2;
}
return sprintf ("%.4f", sqrt(sumsq/count));
}
BEGIN {
cnt = 0;
}
END {
std_dev(array17, cnt); std_dev(array19, cnt); std_dev(array20, cnt);
}
NR>1{
array17[cnt]=$17; array19[cnt]=$19; array20[cnt]=$20; cnt++;
if (cnt==720) {
print std_dev(array17, cnt) " " std_dev(array19, cnt) " " std_dev(array20, cnt);
cnt=0;
}
Thanks
I think you mean it should start averaging from 15:00:00, cause there is no data for 14:00 to 14:38 in your sample file. Anyway, try this on your file:
awk 'NR==1;$2~"15:00:00"{p=1}p' testfile.txt > testfile2.txt
Then just run previous code without any modifications on testfile2.txt.
1 Like