Randomization a matrix - perl / Shell

_man · December 5, 2012, 7:11am

Hello all,

I have a tricky question! (at least for me it is!). I'll try to explain it carefully here. Hope you can help me solving the whole or even parts of it! Here it is:

I have a big input 0\1 table as a very simplified one is shown below:
(The last row and column are the sum and presented just for better understanding. They do not exist in actual input data.)

table1(input):

C1 C2 C3 c4 c5
V1 1 1 1 1 0 =4
V2 1 1 1 0 1 =4
V3 1 1 1 1 1 =5
V4 0 1 0 0 1 =2

=3 =4 =3 =2 =3 =15

As you can see the number of '1's are varied among the rows and columns.

I used the following script to calculate the co-presence of the variables in above table (Thanks to elixir_sinari).

awk 'NR>1{name[NR-1]=$1;for(i=2;i<=NF;i++) if($i==1) { oneset[NR-1,i]=1;count[NR-1]++; q++} val[NR-1]=q; q=0}
END{
for(i=1;i<=(NR-1);i++)
{
 if(i==1)
 {
  print "*"
  for(j=1;j<=(NR-1);j++)
   print name[j]
  printf "\n"
 }
 print name"("val")"
 for(j=1;j<=(NR-1);j++)
 {
  n=0
  for(k=2;k<=NF;k++)
   if(oneset[i,k] && oneset[j,k])
    n++
  print (count==0)?"NA":(n/count)
 }
 printf "\n"
}
}' ORS='\t'  OFMT='%.2f' input

And here is what it gives for the mentioned input file:

table2(reference table):

```
  V1      V2      V3      V4
```

V1(4) 1 0.75 1 0.25
V2(4) 0.75 1 1 0.50
V3(5) 0.80 0.80 1 0.40
V4(2) 0.50 1 1 1

I am wondering if some high co-presence values happened by chance or not. In order to answer this question I am interested to randomize my input data couple of 1000 times (or even more). The randomization should be in a way that the sum of each row and column be the same as it is in our input.

One example of randomization could be this one:

table3(Random1)

C1 C2 C3 c4 c5
V1 1 1 1 0 1 =4
V2 0 1 1 1 1 =4
V3 1 1 1 1 1 =5
V4 1 1 0 0 0 =2

=3 =4 =3 =2 =3 =15

As you see the sum of each row and column are the same as table1(input).

After creating each random table, the mentioned script has to apply on it which gonna give a table in the format of table2(Reference). For table3(Random1) it would be like this:

table4(Random1_co-presence)

```
  V1      V2      V3      V4
```

V1(4) 1 0.75 1 0.50
V2(4) 0.75 1 1 0.25
V3(5) 0.80 0.80 1 0.40
V4(2) 1 0.50 1 1

And here is the tricky part. each co-presence table for each randomization step has to be compared with table2(reference). If the value in each cell was equal or greater than the corresponding value in table2(reference) 1 has to be pushed to table5(output) for that cell.

with 1 randomization, table5(output) would be like this:

table5(output):

V1 V2 V3 V4
V1 1 1 1 1
V2 1 1 1 0
V3 1 1 1 1
V4 1 0 1 1

You can see that by 1 randomization and calculating the co-presence table, all values (except two of them) are equal or greater than values in corresponding cells of table2(reference). And that's why we see in table5(output)

When it goes to the next randomization the same calculation has to be done and for those equal or greater values table5(output) has to be updated in each round of randomization.

Lets assume that we do the randomization 10 times. The table5(output) would be something similar to this:

V1 V2 V3 V4
V1 10 8 10 6
V2 1 10 10 2
V3 10 10 10 10
V4 7 3 10 10

(In the following link I found a discussion regarding generating random matrix with similar conditions to my problem. It might be useful...
random - Randomize matrix in perl, keeping row and column totals the same - Stack Overflow)

Thanks for your patience to read this thread! Any ideas in Perl or Shell would be so helpful!