AWK script for standard deviation / root mean square deviation

chrisjorg · January 12, 2012, 11:40am

I have a file with say 50 columns, each containing a whole lot of data.

Each column contains data from a separate simulation, but each simulation is related to the data in the last (REFERENCE) column $50

I need to calculate the RMS deviation for each data line, i.e. column 1 relative to column 50, column 2 relative to column 50, etc. and my expected outcome with be a column 51 containing the RMSD for each line.

#!/usr/bin/awk

      BEGIN { s=0;n=0 }
                   { n++; s=s+(($n)^2-($50)^2) }
      END
      { print sqrt(s/50) }

But this is not helping, there is a syntax error,

any help is appreciated!

Corona688 · January 12, 2012, 11:54am

It shows what syntax error? :wall:

Please show a sample of input data and the output you'd want for it.

chrisjorg · January 12, 2012, 12:03pm

2.91187 4.25656 7.3225   ..... until column 50 
3.4187   2.67656 6.3225
3.54117 6.27656 4.3225
5.61187 6.27656 2.3225
....          ...           ....

The output should just be a column of numbers where each entry represents the calculated root-mean-square deviation (RMSD) of each entry of the 50 columns, i.e. (here I am just giving you random numbers)

RMSD
4.31185 
3.4185   
2.64115
4.71183 
....

The error on running the script as it is is

awk -f rmsd merge.pmf 
awk: syntax error at source line 6 source file rmsd
 context is
          END  >>> 
 <<< 
awk: bailing out at source line 7

Corona688 · January 12, 2012, 12:21pm

Given that awk script doesn't even have 7 lines, I'm not sure what to tell you about the syntax error, but if you want a standard deviation for each column, you're going to need to loop over each column, awk's behavior only loops over lines...

You also don't usually need to bother setting s=0 before you start and such. A blank variable + 1 gracefully becomes 1 the same way 0+1 becomes 1.

Also, since you need to subtract the average from every single number to calculate a standard deviation, you need all the numbers twice and might as well just store everything to calculate in an END{} block.

Working on something.

---------- Post updated at 11:21 AM ---------- Previous update was at 11:10 AM ----------

NF {    LINE++; MAX=NF; for(N=1; N<=NF; N++) A[LINE,N]=$N       }
END     {
        # Calculate sum for each column and row
        for(COL=1; COL<=MAX; COL++)
        for(ROW=1; ROW<=LINE; ROW++)
                AVG[COL]+=A[ROW,COL];

        # Turn the sums into averages
        for(COL=1; COL<=MAX; COL++) AVG[COL] /= LINE

        # Calculate deviations for each column and row
        for(COL=1; COL<=MAX; COL++)
        for(ROW=1; ROW<=LINE; ROW++)
                DEV[COL]+=(AVG[COL]-A[ROW,COL])^2

        # Divide sum by number of lines, then give the square root.
        for(COL=1; COL<=MAX; COL++)     print sqrt(DEV[COL]/LINE);
}

chrisjorg · January 12, 2012, 1:03pm

Thanks!

Yes, I want a standard deviation of all data relative to the last column of data which I suppose represents the 'average' or 'reference'.

That is a huge help with the script. Thanks a lot for now, I will try to see if it works.
It is true, I want to parse over each line in the columns,
so e.g. line 1 of column 1 relative to line 1 in column 50, line 2 in column 1 relative to live 2 in column 50, etc. etc. then line 1 in column 2 relative to line 1 in column 50, etc.....

---------- Post updated at 01:03 PM ---------- Previous update was at 12:37 PM ----------

Ok maybe I can clarify.

The last (50th) column already constitutes the average,
so I want to subtract each column of data to the 50th to
look at how much the data deviates from the reference column.

Corona688 · January 12, 2012, 1:40pm

Which average? If you have 49 different data columns, wouldn't you need 49 different averages?

What would help a whole lot more is a sample of your input data. And labels. And a sample of your output data.

chrisjorg · January 12, 2012, 3:29pm

Ok,

let me clarify.

I want to parse each *line* individually, I should have said this earlier.
So if there are 2000 lines, line 1 is different from line 2 etc.

There are 50 columns of data. All the data has to be compared to the *last* column, which is special. Therefore I am performing an RSMD calculation which is not exactly the same as a standard deviation, because the 'average' is that 50th column of data.

Let us for a moment forget there are e.g. 2000 lines of data. Let us imagine there is only 1

my data looks something like this:

2.91187  2.27656  3.3225  2.33938 2.55781 3.05656 2.66063 2.02781... ... 2.31219

where 2.31319 would represent the 50th column.

Ok, so I want to do the following

sqrt( ([2.31319-2.91187]^2 + [2.31319-2.27656]^2 + [2.31319-3.3225]^2 + [2.31319-2.33938]^2 [2.31319-2.55781]^2 ... [2.31319-2.31319]^2) /50 )

or in words

sqrt( ([col.50-col1]^2 + [col.50 - col.2]^2 + [col.50 - col.3]^2 + ... + [col.50 -col.50]^2 ) / 50 )

UPDATE
and yes, you are right, because I parse each line separately, if there are 2000 lines, I will want to end up with 2000 RMSD values lined up in a column.

---------- Post updated at 03:28 PM ---------- Previous update was at 02:08 PM ----------

set mean  = `awk '{++n;sum+=$NF} END{if(n) print sum/n}' slice.txt`

set rmsd  = `awk -v mean=$mean '{++n;sum+=($NF-mean)^2} END{if(n) print sqrt(sum/n)}' slice.txt`

Maybe something like that? But I need to be able to distinguish between columns and lines.

---------- Post updated at 03:29 PM ---------- Previous update was at 03:28 PM ----------

Root-mean-square deviation - Wikipedia, the free encyclopedia

vgersh99 · January 12, 2012, 3:36pm

nawk '{s=0;for(i=1;i<=NF;i++) s+=($NF-$i)^2;print sqrt(s)}' slice.txt

chrisjorg · January 13, 2012, 6:45am

Thanks,
guys this solved the issue.

Am very grateful for your patience, and your support.

X

---------- Post updated 01-13-12 at 05:58 AM ---------- Previous update was 01-12-12 at 04:37 PM ----------

Hi guys,
a little extra question,

what if I wanted the program to work in the reverse manner,

parse ONLY two columns per iteration, e.g. column 1 ($1) and column 50 ($50) and subtract value 1 in column 1 from value 1 in column 50, etc. until I calculate one RMSD number.

THEN move onto column 2 ($2) and column 50 ($50) etc. and obtain a second RMSD number.
etc etc until it is column 50 ($50) vs. column 50 ($50).

?

Currently my script parses each line separately, and it is great, but I want to try the other way around.

---------- Post updated at 06:09 AM ---------- Previous update was at 05:58 AM ----------

so instead of NF I thought of using NR to go down the column instead of across?

---------- Post updated at 06:45 AM ---------- Previous update was at 06:09 AM ----------

So for instance,

If I wanted to obtain the RMSD between column 1 and column 50 I wrote

{s=0;NF==1;for(i=1;i<=NR;i++)} s+=($NF-$50)^2;print sqrt(s/NR)}

but I get an error:

awk -f rmsd4 merge.pmf > test2
awk: syntax error at source line 2 source file rmsd4
 context is
    {s=0;NF==1;for(i=1;i<=NR;i++)} >>>  s+=($NR-$36)^2;print <<<  sqrt(s/NR)}
    extra }
awk: bailing out at source line 4

Corona688 · January 13, 2012, 11:01am

If you could post a sample of your data, a sample of what output you want, and show your calculations that would be much, much better than trying to explain an algorithm in casual English. It's a little more work for you, I understand, but it also means a much greater chance of being understood on the first try. And being all you need is a single line of data to demonstrate, it really isn't so awful. It may also help you get it organized better and see an algorithm yourself.

I don't understand what you're getting at now, and since your code doesn't work, it's not a good demonstration either.

I think I see your syntax error -- an extra bracket:

{s=0;NF==1;for(i=1;i<=NR;i++)} s+=($NF-$50)^2;print sqrt(s/NR)}

There's logic errors too, though.

The statement in blue, though, what did you intend that to do? Right now it's a complete no-op.

NR is the number of lines(records), not the number of fields, I think you want (i=1;i<=NF;i++) and sqrt(s/NF)

I also think that's off by one, since that will include the last column, the average itself, so:

(i=1;i<NF;i++) and sqrt(s/(NF-1))

And then, that thing in green. Since NF doesn't change until the number of columns does, this is always adding the same columns: ($NF-$50) In fact if you have 50 columns, $NF will be column 50, causing the result to always be zero!

vgersh99 · January 13, 2012, 11:18am

I think the OP wants to do the calculations based on columns, not based on rows as initially stated. But I fail to understand the algorithm entirely - using 'NR' instead of 'NF' is not going to achieve the desired results.
One would need to parse all lines and build up a 2-D matrix and navigate it by columns performing the desired calculation.
Once again, this is all somewhat nebulous for me.
If the OP could demonstrate what he/she/it is after given a simple table (say 3x3 - not necessarily Nx50)

1 2 3
4 5 6
7 8 9

and a sample output - that would be helpful.
[/code]

chrisjorg · January 16, 2012, 5:03am

Hi,
ok,
I understand I was not clear enough.

BEGIN {s_0=0;n_0=0}
      {n_0++;s_0+=($50-$1)^2}
END {print sqrt(s_0/n_0)}

BEGIN {s_1=0;n_1=0}
      {n_1++;s_1+=($50-$2)^2}
END {print sqrt(s_1/n_1)}
...
BEGIN {s_50=0;n_50=0}
      {n_50++;s_50+=($50-$50)^2}
END {print sqrt(s_50/n_50)}

I have written a code to work on a document with 50 columns of data to calculate the RMSD of each column of data relative to the last column of data.

The script is pretty primitive, and involves 50 different variables. Is there a way I could make it automatic by simply incrementing the n_x variable instead of having to define the 50 variables manually. Also, instead of defining the last column as $50 how do I do it with an $NF?

Corona688 · January 18, 2012, 11:30am

How about that sample input and output?