Convert a 3 column tab delimited file to a matrix

Hi all,

I have a 3 columns input file like this:

CPLX9PC-4943    CPLX9PC-4943    1
CPLX9PC-4943    CpxID123        0
CPLX9PC-4943    CpxID126        0
CPLX9PC-4943    CPLX9PC-5763    0.5
CPLX9PC-4943    CpxID13 0
CPLX9PC-4943    CPLX9PC-6163    0
CPLX9PC-4943    CPLX9PC-6164    0.04
CPLX9PC-4943    CPLX9PC-6165    0.027027
::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::::::::::::

I need it in the form of a matrix where column 1 and column 2 are row and column labels and column 3 contains the values to fill in the matrix.

Thanks and Regards

what is the desired output you are looking for. Please provide the example.

I am taking an example data set here:

a       a       1
a       b       2
a       c       4
b       a       2
b       b       1
b       c       7
c       a       4
c       b       7
c       c       1

Now the desired output would be:

        a       b       c
a       1       2       4
b       2       1       7
c       4       7       1

My data set is similar to it but I have 25 variables(in this example, I have 3 variables a, b and c). Basically output would be a similarity matrix with two same halves. Thanks for the reply.

Try this:

awk '
{a[$1,$2]=$3; b[$1];}
END{
  for(i in b)   #print header
    printf("\t%s",i);  
  printf("\n");
  for(i in b) { #each row
     printf("%s\t",i); 
     for(j in b)  #each column
        printf("%d\t", a[i,j]);
     printf("\n");
   } 
}' SUBSEP="\t" input.txt 

Not that the rows/columns are not sorted.

But honestly, for this kind of data-mangling job, you are better off using some other data processing tool. E.g. with GNU R and "reshape" package it's piece of cake:

 d = read.table("mydata.txt")
 library(reshape)
 cast(d, V1 ~ V2 )
2 Likes

If you don't have GNU R and want a table that adjusts to the input given, you could try something like:

awk '
NF == 3 {
        if(!($1 in rh)) {
                # Add a new row heading.
                row[++nr] = $1  
                rh[$1]
                if(length($1) > rw) rw = length($1)
        }
        if(!($2 in ch)) {
                # Add a new column heading.
                col[++nc] = $2
                ch[$2]
                cw[nc] = length($2)
        }
        # Add a datapoint.
        d[$1, $2] = $3
}
END {   printf("%*s", rw, "") 
        for(i = 1; i <= nc; i++) {
                if(cw < 11) cw = 12
                printf(" %*.*s", cw, cw, col)
        }
        printf("\n")
        for(i = 1; i <= nr; i++) {
                printf("%*.*s", -rw, rw, row)
                for(j = 1; j <= nc; j++)
                        if((row, col[j]) in d)
                                printf(" %*.6f", cw, d[row, col[j]])
                        else    printf(" %*s", cw, "")
                printf("\n")
        }
}' input

As always, if you're using a Solaris/SunOS system, use /usr/xpg4/bin/awk or nawk instead of awk .
With the input file given in the 1st message in this thread, it produces the output:

             CPLX9PC-4943     CpxID123     CpxID126 CPLX9PC-5763      CpxID13 CPLX9PC-6163 CPLX9PC-6164 CPLX9PC-6165
CPLX9PC-4943     1.000000     0.000000     0.000000     0.500000     0.000000     0.000000     0.040000     0.027027

And, with the input given in message #3 in this thread, it produces:

             a            b            c
a     1.000000     2.000000     4.000000
b     2.000000     1.000000     7.000000
c     4.000000     7.000000     1.000000
1 Like

Or..

$ awk '{A[$1,$2]=$3;B[$1]++;C[$2]++}END{for(i in C){s=s?s"\t"i:"\t"i}
print s;
for(i in B){s=i;
for(j in C){s=s"\t"A[j,i]}
print s}}' file

        a       b       c
a       1       2       4
b       2       1       7
c       4       7       1
1 Like

This post results (after adapting the output format) in

              CPLX9PC-4943     CpxID123     CpxID126 CPLX9PC-5763      CpxID13 CPLX9PC-6163 CPLX9PC-6164 CPLX9PC-6165
 CPLX9PC-4943            1            0            0          0.5            0            0         0.04     0.027027
1 Like

When I tried your script it printed 0s and 1s instead of column 3 values. Anyway I followed your suggestion to use R and it worked fine for me. Thanks.