AWK: RMSD script

chrisjorg · February 1, 2012, 11:10am

Here is my AWK script to find the root mean square deviation of a set of data coming as a 51 column data file. I am reading each column of data *relative* to the last column ($51)
How could I get AWK to automatically detect the number of columns and use it as a reference. I.e. is there a way of making the script smoother.

Currently I also rely on 50 variables s_0, s_1, s_2, ..., s_50
how could I write this as an increment too. I really want to make the script more logical.

BEGIN {s_0=0;n_0=0}
      {n_0++;s_0+=($51-$1)^2}
END {print sqrt(s_0/n_0)}

BEGIN {s_1=0;n_1=0}
      {n_1++;s_1+=($51-$2)^2}
END {print sqrt(s_1/n_1)}

BEGIN {s_2=0;n_2=0}
      {n_2++;s_2+=($51-$3)^2}
END {print sqrt(s_2/n_2)}

BEGIN {s_3=0;n_3=0}
      {n_3++;s_3+=($51-$4)^2}
END {print sqrt(s_3/n_3)}

BEGIN {s_4=0;n_4=0}
      {n_4++;s_4+=($51-$5)^2}
END {print sqrt(s_4/n_4)}

BEGIN {s_5=0;n_5=0}
      {n_5++;s_5+=($51-$6)^2}
END {print sqrt(s_5/n_5)}

BEGIN {s_6=0;n_6=0}
      {n_6++;s_6+=($31-$7)^2}
END {print sqrt(s_6/n_6)}

BEGIN {s_7=0;n_7=0}
      {n_7++;s_7+=($51-$8)^2}
END {print sqrt(s_7/n_7)}

BEGIN {s_8=0;n_8=0}
      {n_8++;s_8+=($51-$9)^2}
END {print sqrt(s_8/n_8)}

BEGIN {s_9=0;n_9=0}
      {n_9++;s_9+=($51-$10)^2}
END {print sqrt(s_9/n_9)}

BEGIN {s_10=0;n_10=0}
      {n_10++;s_10+=($51-$11)^2}
END {print sqrt(s_10/n_10)}

BEGIN {s_11=0;n_11=0}
      {n_11++;s_11+=($51-$12)^2}
END {print sqrt(s_11/n_11)}

BEGIN {s_12=0;n_12=0}
      {n_12++;s_12+=($51-$13)^2}
END {print sqrt(s_12/n_12)}

BEGIN {s_13=0;n_13=0}
      {n_13++;s_13+=($51-$14)^2}
END {print sqrt(s_13/n_13)}

BEGIN {s_14=0;n_14=0}
      {n_14++;s_14+=($51-$15)^2}
END {print sqrt(s_14/n_14)}

BEGIN {s_15=0;n_15=0}
      {n_15++;s_15+=($51-$16)^2}
END {print sqrt(s_15/n_15)}

BEGIN {s_16=0;n_16=0}
      {n_16++;s_16+=($51-$17)^2}
END {print sqrt(s_16/n_16)}

BEGIN {s_17=0;n_17=0}
      {n_17++;s_17+=($51-$18)^2}
END {print sqrt(s_17/n_17)}

BEGIN {s_18=0;n_18=0}
      {n_18++;s_18+=($51-$19)^2}
END {print sqrt(s_18/n_18)}

BEGIN {s_19=0;n_19=0}
      {n_19++;s_19+=($51-$20)^2}
END {print sqrt(s_19/n_19)}

BEGIN {s_20=0;n_20=0}
      {n_20++;s_20+=($51-$21)^2}
END {print sqrt(s_20/n_20)}

BEGIN {s_21=0;n_21=0}
      {n_21++;s_21+=($51-$22)^2}
END {print sqrt(s_21/n_21)}

.......
.......
BEGIN {s_49=0;n_49=0}
      {n_49++;s_49+=($51-$50)^2}
END {print sqrt(s_49/n_49)}

BEGIN {s_50=0;n_50=0}
      {n_50++;s_50+=($51-$51)^2}
END {print sqrt(s_50/n_50)}

The script outputs 51 RMSD values, one for each column.

jim_mcnamara · February 1, 2012, 11:33am

awk ' 
{
   for(i=1; i<NF; i++)  # from 1..50
   {  n=1               # same as n=0; n++
      arr=sqrt( ( ($51 - $i)^2 )/n)  # you could use xxx /1 as well
      print arr;
   }
 } '  inputfile

This does what you coded, but uses an array arr[], and a loop.

chrisjorg · February 1, 2012, 11:53am

Fantastic,
and how could I get the program to detect the number of the last column,

would that be $NF instead of $51 in case my file was larger/smaller?

balajesuri · February 1, 2012, 12:05pm

Isn't RMSD = sqrt ((((x - xavg)^2)/(n-1))/n) ?

Input.

$ cat input
44 36 52 26 13 63 88 29 25 98 59 35 93 75 75 85 33 61 66 3 62 75 12 19 11 11 72 94 65 45 45 65 20 18 50 20 10 62 63 40 12 54 71 75 69 4 80 50 45 68 61
55 0 71 79 83 3 39 62 4 60 34 43 57 46 18 88 64 39 84 87 39 94 18 63 71 66 53 86 92 52 86 29 10 84 19 92 38 53 39 50 61 55 31 10 19 4 21 91 69 5 32

Perl.

#! /usr/bin/perl -w
use strict;

my (@num, $avg, $ss);

open I, "< input";
for (<I>) {
    @num = split / /;
    $avg = average (@num);
    for (@num) {
        $ss += (($_ - $avg) ** 2);
    }
    print sqrt (($ss / (@num - 1)) / @num); print "\n";
}
close I;

sub average {
    my $s;
    for (@_) { $s += $_ }
    return $s / @_;
}

Output.

$ ./test.pl
3.70366776863254
5.42950063304012

chrisjorg · February 1, 2012, 1:29pm

There is a problem with your script though,

{
   for(i=1; i<NF; i++)  # from 1..50
   {  n=1               # same as n=0; n++
      arr=sqrt( ( ($51 - $i)^2 )/n)  # you could use xxx /1 as well
      print arr;
   }
 }

My old script would evaluate all data in a column relative to $51, and output only a *single* RMSD value for that operation. i.e. I was left with 50 RMSD values. Yours outputs thousands of lines of RMSDs, I suspect we are not doing the same thing.