diagonal matrix to square matrix

Hello, all!

I am struggling with a short script to read a diagonal matrix for later retrieval.
1.000 0.234 0.435 0.123 0.012 0.102 0.325 0.412 0.087 0.098
1.000 0.111 0.412 0.115 0.058 0.091 0.190 0.045 0.058
1.000 0.205 0.542 0.335 0.054 0.117 0.203 0.125
1.000 0.587 0.159 0.357 0.258 0.654 0.341
1.000 0.269 0.369 0.687 0.145 0.125
1.000 0.222 0.451 0.134 0.333
1.000 0.112 0.217 0.095
1.000 0.508 0.701
1.000 0.663
1.000

Actually this matrix is the correlation co-efficiency of the gene expression by microarray, so that half matrix contains the same information of the square matrix.

First, the matrix should be aligned with all the 1.000 at the diagonal, i.e.

[RIGHT]1.000 0.234 0.435 0.123 0.012 0.102 0.325 0.412 0.087 0.098
1.000 0.111 0.412 0.115 0.058 0.091 0.190 0.045 0.050
1.000 0.205 0.542 0.335 0.054 0.117 0.203 0.125
1.000 0.587 0.159 0.357 0.258 0.654 0.341
1.000 0.269 0.369 0.687 0.145 0.125
1.000 0.222 0.451 0.134 0.333
1.000 0.112 0.217 0.095
1.000 0.508 0.701
1.000 0.663
1.000[/RIGHT]
as each gene has 1.000 correlation coefficiency with itself.
Then, I want to get a square matrix to fill the missing half by Matrix[i][j]=Matrix[j] [i]e.g. Matrix[2][1]= matrix[1][2] etc. I only posted 10 out of 25,000 genes. The real file is a 25,000x25,000 square matrix.
With the sqaure matrix I can easily access any row or column for the co-efficiencies of each individual gene with the others of the genome.
Thanks a lot!

Yifang

Technically, unless I've forgotten my definitions, that is not actually a diagonal matrix. Please share your script and I'm sure someone can easily point out where you went wrong.

Something like this would probably be better done with perl, as you can put the whole matrix in memory.

If you still want to use a shell script, here is one. By the way, I think you meant a symmetric matrix, and not a diagonal matrix.

#!/bin/bash

PATH=/usr/bin:/bin
export PATH

# length of each number in the matrix
rlen=5

# Right-justify the matrix
awk '{ if (ne == "") { ne = NF; } indent = (NR - 1) * (rlen + 1); printf("%" indent "s", ""); print }' rlen="$rlen" > temp.$$

# Fill in the missing parts
cat -n temp.$$ | while read line; do
set -- $line

    \# Get the row number
    n="$1"

    \# Discard the row number and the 1.000 value
    shift 2

    \# Calculate the start and end positions of the column

# If you're using Bourne shell, you'll have to use expr or similiar.
s=$(( ( $n -1 ) * ( $rlen + 1 ) + 1 ))
e=$(( $s + $rlen - 1 ))

    \# Get the values of the column for the preceding rows in the matrix
    head -$n temp.$$ | cut -c$s-$e | tr '\\n' ' '

   \# Output the rest of the row from the input
    echo $*

done

# Clean up
rm -f temp.$$

Here's one way to do it in Perl:

$
$ # display the diagonal matrix
$ cat diagmtx.txt
1.000 0.234 0.435 0.123 0.012 0.102 0.325 0.412 0.087 0.098
1.000 0.111 0.412 0.115 0.058 0.091 0.190 0.045 0.058
1.000 0.205 0.542 0.335 0.054 0.117 0.203 0.125
1.000 0.587 0.159 0.357 0.258 0.654 0.341
1.000 0.269 0.369 0.687 0.145 0.125
1.000 0.222 0.451 0.134 0.333
1.000 0.112 0.217 0.095
1.000 0.508 0.701
1.000 0.663
1.000
$
$ # display the contents of the perl program
$ cat convert.pl
#!/usr/bin/perl -w
@mtx = ();
$file=$ARGV[0];
open (F, $file) or die "Can't open $file: $!";
while (<F>){
  chomp;
  if ($. == 1){
    @x = split/ /;
    push @mtx, [ @x ];
    $num = $#x;
  } else {
    foreach $j (0..$.-2) {
      push @y, $mtx[$j][$.-1];
    }
    @x = split/ /;
    push @mtx, [ @y, @x ];
  }
  @y = ();
}
close(F) or die "Can't close $file: $!";
#
for($row = 0; $row <= $num; $row++) {
  for($col = 0; $col <= $num; $col++) {
    printf("%5.3f ",$mtx[$row][$col]);
  }
  print "\n";
}
$
$ # run the perl program
$ perl convert.pl diagmtx.txt
1.000 0.234 0.435 0.123 0.012 0.102 0.325 0.412 0.087 0.098
0.234 1.000 0.111 0.412 0.115 0.058 0.091 0.190 0.045 0.058
0.435 0.111 1.000 0.205 0.542 0.335 0.054 0.117 0.203 0.125
0.123 0.412 0.205 1.000 0.587 0.159 0.357 0.258 0.654 0.341
0.012 0.115 0.542 0.587 1.000 0.269 0.369 0.687 0.145 0.125
0.102 0.058 0.335 0.159 0.269 1.000 0.222 0.451 0.134 0.333
0.325 0.091 0.054 0.357 0.369 0.222 1.000 0.112 0.217 0.095
0.412 0.190 0.117 0.258 0.687 0.451 0.112 1.000 0.508 0.701
0.087 0.045 0.203 0.654 0.145 0.134 0.217 0.508 1.000 0.663
0.098 0.058 0.125 0.341 0.125 0.333 0.095 0.701 0.663 1.000
$
$

Given below is a perl program to generate a diagonal matrix of random numbers:

$
$
$ # display the program to generate diagonal matrix
$ cat gendiagmtx.pl
#!/usr/bin/perl -w
# Short script to generate a "diagonal matrix".
# Usage: perl gendiagmtx.pl N
# where integer N > 1
# An example: for N = 5, the output is as follows:
# 1.000 x01 x02 x03 x04
# 1.000 x11 x12 x13
# 1.000 x21 x22
# 1.000 x31
# 1.000
# where xMN = some random decimal number.
#
$num = $ARGV[0];
$iter = $num;
foreach (0..$num-1){
  printf("%5.3f ",1);
  foreach (0..$iter-2){
    printf("%5.3f ",rand(1));
  }
  print "\n";
  $iter--;
}
$
$ # generate a diagonal matrix of order 15
$ perl gendiagmtx.pl 15 > diagmtx_15.txt
$ cat diagmtx_15.txt
1.000 0.364 0.566 0.624 0.643 0.340 0.399 0.074 0.560 0.140 0.056 0.393 0.281 0.374 0.300
1.000 0.656 0.449 0.795 0.504 0.688 0.025 0.934 0.126 0.863 0.320 0.754 0.728 0.237
1.000 0.328 0.132 0.362 0.867 0.290 0.129 0.881 0.595 0.714 0.274 0.072 0.921
1.000 0.748 0.031 0.858 0.991 0.774 0.378 0.028 0.546 0.817 0.990 0.810
1.000 0.887 0.331 0.841 0.626 0.830 0.523 0.555 0.727 0.023 0.713
1.000 0.514 0.327 0.304 0.764 0.192 0.805 0.386 0.794 0.494
1.000 0.615 0.712 0.942 0.013 0.296 0.844 0.701 0.973
1.000 0.488 0.004 0.064 0.690 0.876 0.151 0.872
1.000 0.403 0.305 0.048 0.997 0.181 0.113
1.000 0.658 0.419 0.032 0.445 0.902
1.000 0.521 0.492 0.506 0.331
1.000 0.641 0.599 0.546
1.000 0.928 0.820
1.000 0.802
1.000
$
$ # test the "convert.pl" program
$ perl convert.pl diagmtx_15.txt
1.000 0.364 0.566 0.624 0.643 0.340 0.399 0.074 0.560 0.140 0.056 0.393 0.281 0.374 0.300
0.364 1.000 0.656 0.449 0.795 0.504 0.688 0.025 0.934 0.126 0.863 0.320 0.754 0.728 0.237
0.566 0.656 1.000 0.328 0.132 0.362 0.867 0.290 0.129 0.881 0.595 0.714 0.274 0.072 0.921
0.624 0.449 0.328 1.000 0.748 0.031 0.858 0.991 0.774 0.378 0.028 0.546 0.817 0.990 0.810
0.643 0.795 0.132 0.748 1.000 0.887 0.331 0.841 0.626 0.830 0.523 0.555 0.727 0.023 0.713
0.340 0.504 0.362 0.031 0.887 1.000 0.514 0.327 0.304 0.764 0.192 0.805 0.386 0.794 0.494
0.399 0.688 0.867 0.858 0.331 0.514 1.000 0.615 0.712 0.942 0.013 0.296 0.844 0.701 0.973
0.074 0.025 0.290 0.991 0.841 0.327 0.615 1.000 0.488 0.004 0.064 0.690 0.876 0.151 0.872
0.560 0.934 0.129 0.774 0.626 0.304 0.712 0.488 1.000 0.403 0.305 0.048 0.997 0.181 0.113
0.140 0.126 0.881 0.378 0.830 0.764 0.942 0.004 0.403 1.000 0.658 0.419 0.032 0.445 0.902
0.056 0.863 0.595 0.028 0.523 0.192 0.013 0.064 0.305 0.658 1.000 0.521 0.492 0.506 0.331
0.393 0.320 0.714 0.546 0.555 0.805 0.296 0.690 0.048 0.419 0.521 1.000 0.641 0.599 0.546
0.281 0.754 0.274 0.817 0.727 0.386 0.844 0.876 0.997 0.032 0.492 0.641 1.000 0.928 0.820
0.374 0.728 0.072 0.990 0.023 0.794 0.701 0.151 0.181 0.445 0.506 0.599 0.928 1.000 0.802
0.300 0.237 0.921 0.810 0.713 0.494 0.973 0.872 0.113 0.902 0.331 0.546 0.820 0.802 1.000
$
$

HTH,
tyler_durden

Yes, you are right. The original matrix is NOT a diagonal one. It is upper half symmetric with 1.000 in all the "diagonal" positions.

---------- Post updated at 01:32 PM ---------- Previous update was at 01:30 PM ----------

Thanks, I need to digest your script first. I am just a newbe in shell script and PERL programming.

---------- Post updated at 01:50 PM ---------- Previous update was at 01:32 PM ----------

That's a great solution! Thanks you Tyler!

When I tried to convert my 25000x25000 matrix, I got the "Out of memory!" message and the program stopped. Another problem I noticed is, after I checked the original data, there is ID for each row, i.e.:
"244901_AT" 1.000 0.234 0.435 0.123 0.012 0.102 0.325 0.412 0.087 0.098
"243903_AT" 1.000 0.111 0.412 0.115 0.058 0.091 0.190 0.045 0.058
"244501_AT" 1.000 0.205 0.542 0.335 0.054 0.117 0.203 0.125
"254902_AT" 1.000 0.587 0.159 0.357 0.258 0.654 0.341
"247906_AT" 1.000 0.269 0.369 0.687 0.145 0.125
"242901_AT" 1.000 0.222 0.451 0.134 0.333
"243906_AT" 1.000 0.112 0.217 0.095
"244908_AT" 1.000 0.508 0.701
"294902_AT" 1.000 0.663
"245902_AT" 1.000

and the output square matrix should be like this:
"244901_AT" 1.000 0.234 0.435 0.123 0.012 0.102 0.325 0.412 0.087 0.098
"243903_AT" 0.234 1.000 0.111 0.412 0.115 0.058 0.091 0.190 0.045 0.058
"244501_AT" 0.435 0.111 1.000 0.205 0.542 0.335 0.054 0.117 0.203 0.125
"254902_AT" 0.123 0.412 0.205 1.000 0.587 0.159 0.357 0.258 0.654 0.341
"247906_AT" 0.012 0.115 0.542 0.587 1.000 0.269 0.369 0.687 0.145 0.125
"242901_AT" 0.102 0.058 0.335 0.159 0.269 1.000 0.222 0.451 0.134 0.333
"243906_AT" 0.325 0.091 0.054 0.357 0.369 0.222 1.000 0.112 0.217 0.095
"244908_AT" 0.412 0.190 0.117 0.258 0.687 0.451 0.112 1.000 0.508 0.701
"294902_AT" 0.087 0.045 0.203 0.654 0.145 0.134 0.217 0.508 1.000 0.663
"245902_AT" 0.098 0.058 0.125 0.341 0.125 0.333 0.095 0.701 0.663 1.000

Then I can retrieve each gene by grep the ID of the first column of each row. I should have posted this information first. Sorry about this. Thanks again Tyler!

another perl for your reference:

open FH,"<a.txt";
while(<FH>){
  my @tmp = split;
  $hash{$.}=[@tmp];
  if ($.==1){
   print;
  }
  else{
    for(my $i=1;$i<=$.-1;$i++){
       print $hash{$i}->[$.-$i]," ";
    }
    print $_;
  }
}

Here's a Perl solution for the type of data you posted:

$
$ cat diagmtx.txt
"244901_AT" 1.000 0.234 0.435 0.123 0.012 0.102 0.325 0.412 0.087 0.098
"243903_AT" 1.000 0.111 0.412 0.115 0.058 0.091 0.190 0.045 0.058
"244501_AT" 1.000 0.205 0.542 0.335 0.054 0.117 0.203 0.125
"254902_AT" 1.000 0.587 0.159 0.357 0.258 0.654 0.341
"247906_AT" 1.000 0.269 0.369 0.687 0.145 0.125
"242901_AT" 1.000 0.222 0.451 0.134 0.333
"243906_AT" 1.000 0.112 0.217 0.095
"244908_AT" 1.000 0.508 0.701
"294902_AT" 1.000 0.663
"245902_AT" 1.000
$
$ # show the Perl program
$ cat convert_2.pl
#!/usr/bin/perl -w
@mtx = ();
$file = $ARGV[0];
open (F,$file) or die "Can't open $file: $!";
while(<F>){
  @x = split;
  $id = shift @x;
  $therest = "@x\n";
  shift @x;
  push @mtx, [@x];
  if ($.==1) {
   print;
  } else {
    print $id," ";
    for($i=0;$i<=$.-2;$i++) {
      print $mtx[$i][0]," ";
      shift @{$mtx[$i]};
    }
    print $therest;
  }
}
close (F) or die "Can't close $file: $!";
$
$ perl convert_2.pl diagmtx.txt
"244901_AT" 1.000 0.234 0.435 0.123 0.012 0.102 0.325 0.412 0.087 0.098
"243903_AT" 0.234 1.000 0.111 0.412 0.115 0.058 0.091 0.190 0.045 0.058
"244501_AT" 0.435 0.111 1.000 0.205 0.542 0.335 0.054 0.117 0.203 0.125
"254902_AT" 0.123 0.412 0.205 1.000 0.587 0.159 0.357 0.258 0.654 0.341
"247906_AT" 0.012 0.115 0.542 0.587 1.000 0.269 0.369 0.687 0.145 0.125
"242901_AT" 0.102 0.058 0.335 0.159 0.269 1.000 0.222 0.451 0.134 0.333
"243906_AT" 0.325 0.091 0.054 0.357 0.369 0.222 1.000 0.112 0.217 0.095
"244908_AT" 0.412 0.190 0.117 0.258 0.687 0.451 0.112 1.000 0.508 0.701
"294902_AT" 0.087 0.045 0.203 0.654 0.145 0.134 0.217 0.508 1.000 0.663
"245902_AT" 0.098 0.058 0.125 0.341 0.125 0.333 0.095 0.701 0.663 1.000
$
$

I think summer_cherry's program is a much more optimized version. It -
(i) does not store the entire N X N square matrix in any data structure.
(ii) stores only the information present in the file, since that is sufficient to generate the other half of the square matrix.
(iii) uses hashes for fast access.

My second version uses a multi-dimensional array to store only the necessary information i.e. everything after the "id" and "1.000" per element. It also keeps on chopping the array element right after it is printed, by using the "shift" operator. So while the run time would be higher for this, the memory consumption should be lesser.

HTH,
tyler_durden

1 Like

Thanks you so much! Sometime I found the ID were printed right before the 1.000. Do you have any clue about that? Thanks a lot again!

Could you give a little more detail on how you intend to use the data once you've chosen the ID? I'd like to understand why you even need the square matrix.

Thanks a lot, Tyler! It works out perfectly with small matrix.

Yes, the memory is an issue for me with the 25000x25000 matrix. I will try to use other pc with more RAM. Thank you very much again!

---------- Post updated at 09:08 PM ---------- Previous update was at 08:58 PM ----------

The matrix is the correlation coefficiencies of the expression level of ~25000 genes of the genome. Some genes express similarly that can be indicated by high correlation coefficiency, but most of them not. The ID is used to track the gene name of the genome, and to find the pattern of expression.
Say, if I want see which genes are expressing similarly with 244901_AT, GREP it would give the single row of the correlation coefficiencies. Theoretically, the half matrix contains all the information of the square one, but it is hard for me to retrieve any specific gene(s) of my interest. By the way, I am a geneticist and just started trying programming.

I can't imagine that it would be an easy thing to just grep some gene ID and look at a line with 25000 values. But it seems like that's what you want to see. ???

How high of a correlation coefficient is enough to be considered as expressing similarly? And would it be fair to say that, given a specific gene name, you just want to know which other genes, if any, express similarly? Or, given a gene name and a specified coefficient, you want to know which other genes have a coefficent greater than or equal to the specified value?

A little bit of calculation is in order. Your actual text file is probably a huge one.

$ cat diagmtx.txt
"244901_AT" 1.000 0.234 0.435 0.123 0.012 0.102 0.325 0.412 0.087 0.098
"243903_AT" 1.000 0.111 0.412 0.115 0.058 0.091 0.190 0.045 0.058
"244501_AT" 1.000 0.205 0.542 0.335 0.054 0.117 0.203 0.125
"254902_AT" 1.000 0.587 0.159 0.357 0.258 0.654 0.341
"247906_AT" 1.000 0.269 0.369 0.687 0.145 0.125
"242901_AT" 1.000 0.222 0.451 0.134 0.333
"243906_AT" 1.000 0.112 0.217 0.095
"244908_AT" 1.000 0.508 0.701
"294902_AT" 1.000 0.663
"245902_AT" 1.000
$

There's an ID of 11 characters, followed by 25000 tokens, each of 6 characters in line 1, 24999 in line 2, and so on. So we are looking at at a file of size 1.88 GB approximately.

11*25000 + (6*25000 + 6*24999 + 6*24998 + ... + 6*2 + 6*1)
= 1.88 GB

The perl scripts are going to use approximately this + some more memory for temporary operations. And once done, your final file would be about twice that size ~ 3.75 GB.
That's a huge size for a text file and it's going to be a challenge to process such a file.

If you are trying this out on your home desktop/laptop, then I'd imagine you need one of those computers with 3 - 4 GB RAM (Gamer's Laptops ? - the ones that are optimized for computer games). Otherwise you may want to try it out in a server at your work place.

Finally, you may want to consider not storing the other half of your matrix. A simple script can generate the entire line of the square matrix, given a line number.

$
$ # display diagmtx.txt with the line numbers
$ cat -n diagmtx.txt
     1    "244901_AT" 1.000 0.234 0.435 0.123 0.012 0.102 0.325 0.412 0.087 0.098
     2    "243903_AT" 1.000 0.111 0.412 0.115 0.058 0.091 0.190 0.045 0.058
     3    "244501_AT" 1.000 0.205 0.542 0.335 0.054 0.117 0.203 0.125
     4    "254902_AT" 1.000 0.587 0.159 0.357 0.258 0.654 0.341
     5    "247906_AT" 1.000 0.269 0.369 0.687 0.145 0.125
     6    "242901_AT" 1.000 0.222 0.451 0.134 0.333
     7    "243906_AT" 1.000 0.112 0.217 0.095
     8    "244908_AT" 1.000 0.508 0.701
     9    "294902_AT" 1.000 0.663
    10    "245902_AT" 1.000
$
$ # now show me what the line no. 8 of the square matrix looks like
$
$ ##
$ perl -lne 'BEGIN {$n=8}
>           if ($. < $n) {@x=split; $s .= $x[$n-$.+1]." "}
>           elsif($. == $n) {($id, $therest)=unpack("A11xA*");
>                            print $id," ",$s,$therest}' diagmtx.txt
"244908_AT" 0.412 0.190 0.117 0.258 0.687 0.451 0.112 1.000 0.508 0.701
$
$

HTH
tyler_durden