Weighted adjacency list to adjacency matrix

dietmar13 · September 9, 2013, 4:48pm

dear awk gurus,

i would need a fast (therefore) awk solution for the reformation of an uncomplete weighted adjacency list to a complete sorted adjacency matrix.

example (FS=OFS=,):

 
a,d,0.33
a,b,0.25
b,c,0.11

should give:

 
,a,b,c,d
a,1,0.25,0,0.33
b,0.25,1,0.11,0
c,0,0.11,1,0
d,0.33,0,0,1

I have found this thread but this does not work, because (as I think) there are missing combinations, which should be zero or 1 for self-combinations (a-a, b-b, ...).

my idea would be, to read in a 2-dimensional array with the first two columns as indices (already also with transposed indices), build the union of both indices (should not be necessary, if I already transpose the indices?!?), sort the indices, loop two times over all indices and write line for line the values of the array, including zeros for all missing index-pairs and '1' for all self-index pairs.

I can provide a few code snippets, but some problems are outside my awk-capabilities:

 
awk 'BEGIN { FS =","; OFS="," }
w[$1,$2]=$3; w[$2,$1]=$3
<if i am right, both dimensions of indices should be already complete?>
END {
<first i have to write the column headers>
print OFS; for (i in sorted(w)) {print i OFS }; print "\n";
<now i want loop over the alphabetically sorted indices>
for (i in sorted(w)) {
print i OFS; for (j in sorted(w)) {
if (w(i,j) not defined) print 0;
else if (i==j) print 1;
else print w(i,j)
} print "\n" }
}' ADJlist.txt > ADJmatrix.txt

could you please complete my script, or (as i always learn here) provide a much more elegant solution...

dietmar

Scrutinizer · September 9, 2013, 5:40pm

Your approach seems essentially right to me. Here is one way it could be moulded into awk:

awk '
  BEGIN{
    FS=OFS=","
    h="a,b,c,d"
    n=split(h,F)
    print x,h
  }
  {
    A[$1,$2]=A[$2,$1]=$3
  }
  END{
    for(i=1; i<=n; i++) {
      s=F
      for(j=1; j<=n; j++) s=s OFS (i==j ? 1 : A[F,F[j]]+0)
      print s
    }
  }
' file

dietmar13 · September 9, 2013, 6:18pm

I always forget, you take the input too literally.

of course is my real table much, much larger and has not only A,B,C,D.

some things i do not understand:
what means print x,h (x is not defined befor) and
s=F what is F - an new array, used for what?

what i would need is a loop over the sorted index (which are long names like ENSG000006234, ENSG000001345), and i do not know the number of the indices. they are given after i have read in the complete file.

two things i would need. how do i get the length of the 1. (and therefore also the 2.) dimension of the 2D-index, how can i sort these 1. dim index and than loop over this sorted index.

Scrutinizer · September 9, 2013, 7:24pm

x is an empty string, so ""
F contains the indexes after the split of the header, an n would be the number..

You could solve the header with something like this, but that would still leave the sort part...

awk '
  {
    A[$1,$2]=A[$2,$1]=$3
    if(!P[$1]++) F[++n]=$1
    if(!P[$2]++) F[++n]=$2
  }
  END{
    for(i=1; i<=n; i++) h=h OFS F
    print h
    for(i=1; i<=n; i++) {
      s=F
      for(j=1; j<=n; j++) s=s OFS (i==j?1:A[F,F[j]]+0)
      print s
    }
  }
' FS=, OFS=, file

So you would need to add a bubble sort in the script.

Or you could first extract the indexes, sort them and feed them to the previous suggestion as variable h

dietmar13 · September 10, 2013, 1:08am

thanks again,

i searched a little bit and found the asorti function from gawk...

i only have to sort the array F and loop over this sorted array:

n = asort(F, G)

 
gawk '
BEGIN { FS=OFS="," }  
{
    A[$1,$2]=A[$2,$1]=$3
    if(!P[$1]++) F[++n]=$1
    if(!P[$2]++) F[++n]=$2
  }
  END{
n = asort(F, G)     
for(i=1; i<=n; i++) h=h OFS G
    print h
    for(i=1; i<=n; i++) {
      s=G
      for(j=1; j<=n; j++) s=s OFS (i==j?1:A[G,G[j]]+0)
      print s
    }
  }
' file

It seems to work !

dietmar