Hi, I want a quick way to determine the pearson correlation between two files. The two files have the same format with only the 3rd column varying.
E.g. of file 1
chr1 0 62
chr1 1 260
chr1 2 474
chr1 3 562
chr1 4 633
chr1 5 870
chr1 6 931
chr1 7 978
chr1 8 1058
chr1 9 1151
E.g. of file 2
chr1 0 76
chr1 1 455
chr1 2 806
chr1 3 914
chr1 4 986
chr1 5 1391
chr1 6 1484
chr1 7 1563
chr1 8 1705
chr1 9 1859
So I would want to know the correlation between column 3 for the two files.
Thanks
#! /usr/bin/perl -w
use strict;
my ($x_bar, $x_sd, $y_bar, $y_sd, $i, $numerator, $r);
my (@f1_data, @f2_data);
open F1, "< file1";
for (<F1>) {
push (@f1_data, (split /\s+/)[2]);
}
close F1;
open F2, "< file2";
for (<F2>) {
push (@f2_data, (split /\s+/)[2]);
}
close F2;
($x_bar, $x_sd) = avg_sd (@f1_data);
($y_bar, $y_sd) = avg_sd (@f2_data);
for ($i=0; $i<@f1_data; $i++) {
$numerator += (($f1_data[$i] - $x_bar) * ($f2_data[$i] - $y_bar));
}
$r = $numerator / (@f1_data * $x_sd * $y_sd);
print "$r\n";
sub avg_sd {
my ($sum, $avg, $sum_of_sq, $sd) = (0, 0, 0, 0);
my @data = @_;
for (@data) {
$sum += $_;
}
$avg = $sum / @data;
for (@data) {
$sum_of_sq += (($_ - $avg) ** 2);
}
$sd = sqrt ($sum_of_sq / @data);
return ($avg, $sd);
}
For the given two input files viz. file1 and file2, the correlation coefficient is 0.999125083532687.
By the way, if the input data are fewer in number, I'd suggest you use a scientific calculator. I was using a Casio FX 991 MS back in college I still have it. Masterpiece.