Transforming 3 columns to matrix format

bala06 · November 30, 2012, 2:13pm

Dear All

I have a huge data for 3 columns similar to this

D2cls0		D2cls0		1
D2cls0		D2cls1		0.308
D2cls0		D2cls2		0.554
D2cls0		D2cls3		0.287
D2cls0		D2cls4		0.633
D2cls0		D2cls5		0.341
D2cls0		D2cls6		0.665
D2cls0		D2cls7		0.698
D2cls0		D2cls8		0.625
D2cls0		D2cls9		0.429

I want to transform into matrix like this.


	D2cls0	D2cls1	D2cls2	D2cls3	D2cls4	D2cls5	D2cls6	D2cls7	D2cls8	D2cls9
D2cls0	1	0.308	0.554	0.287	0.633	0.341	0.665	0.698	0.625	0.429

Kindly advice and Please help.

Many Thanks
Balaji

joeyg · November 30, 2012, 2:18pm

What commands have your tried?
have you looked at the paste command with a -s

bala06 · November 30, 2012, 2:21pm

Dear Joeyg

I am familiar in awk or shell scripting.

could you please kindly help me

Many Thanks
Balaji

joeyg · November 30, 2012, 2:45pm

$ cat sample30.txt | cut -f3 | tr "\n" "\t" > sample30x.txt
$ cat sample30.txt | cut -f5 | tr "\n" "\t" >> sample30x.txt

rdrtx1 · November 30, 2012, 4:16pm

try also:

awk '
{ a[NR]=$1; b[NR]=$2; c[NR]=$3; }
END {
  for (i=0;i<=NR;i++) printf ("%-10s",b);
  print "";
  for (i=1;i<=1;i++ )  printf ("%-10s",a);
  for (i=1;i<=NR;i++)  printf ("%-10s",c);
  print "";
}
' input

vgersh99 · November 30, 2012, 5:31pm

a bit more generic solution for multiple instances of field one in the input and where the sequence of field 2 doesn't matter.
Sample file - myFile:

D2cls0          D2cls0          1
D2cls0          D2cls1          0.308
D2cls0          D2cls2          0.554
D2cls0          D2cls3          0.287
D2cls0          D2cls4          0.633
D2cls0          D2cls5          0.341
D2cls0          D2cls6          0.665
D2cls0          D2cls7          0.698
D2cls0          D2cls8          0.625
D2cls0          D2cls9          0.429
D2cls1          D2cls0          1
D2cls1          D2cls1          0.308
D2cls1          D2cls2          0.554
D2cls1          D2cls3          0.287
D2cls1          D2cls4          0.633
D2cls1          D2cls5          0.341
D2cls1          D2cls6          0.665
D2cls1          D2cls7          0.698
D2cls1          D2cls8          0.625
D2cls1          D2cls9          0.429

awk -f bala.awk myFile where bala.awk is:

BEGIN {
   OFS="\t"
}
{
  row[$1,$2]=$3
  if (!($2 in f2)) { header=(header)?header OFS $2:$2;f2[$2]}
  if (col1[c]!=$1)
     col1[++c]=$1
}
END {
  printf("%*s%s\n", length(col1[1])+2, " ",header)
  ncol=split(header,colA,OFS)
  for(i=1;i<=c;i++) {
    printf("%s", col1)
    for(j=1;j<=ncol;j++)
      printf("%s%s%c", OFS, row[col1,colA[j]], (j==ncol)?ORS:"")
  }
}

produces:

        D2cls0  D2cls1  D2cls2  D2cls3  D2cls4  D2cls5  D2cls6  D2cls7  D2cls8  D2cls9
D2cls0  1       0.308   0.554   0.287   0.633   0.341   0.665   0.698   0.625   0.429
D2cls1  1       0.308   0.554   0.287   0.633   0.341   0.665   0.698   0.625   0.429

RudiC · December 1, 2012, 5:14am

This proposal does not depend on column 1 being sorted (as others above do), it will collect lines correctly into the matrix elements even if the key comes up again later with more values. It would not, of course, have the across columns sorted, then. Empty/missing elements are leaving a gap in the output:

$ awk  '{for (i=1; i<=LnCnt; i++) if ($1 == Ln) break; if (i > LnCnt) Ln[++LnCnt]=$1}
        {for (j=1; j<=HdCnt; j++) if ($2 == Hd[j]) break; if (j > HdCnt) Hd[++HdCnt]=$2}
        {Mx[$1,$2] = $3}
        END {printf "%10s", ""; for (j=1; j<=HdCnt; j++)  printf "%10s", Hd[j]; printf "\n";
             for (i=1; i<=LnCnt; i++)  {printf "%10s", Ln;
                                        for (j=1; j<=HdCnt; j++) printf "%10s", Mx[Ln, Hd[j]];
                                        printf "\n"
                                       }
            }
       ' file
              D2cls0    D2cls1    D2cls2    D2cls8    D2cls9    D2cls3    D2cls4    D2cls5    D2cls6    D2cls7
    D2cls0         1     0.308     0.554     0.625     0.429     0.287     0.633               0.665     0.698
    D2cls2       0.1     1.308     1.554               1.429     1.287     1.633     1.341     1.665     1.698

bala06 · December 1, 2012, 10:55am

Hi All

Thanks for all your help and suggestions.

RudiC could you please tell me how to run your awk program.

Many Thanks
Balaji

rudic:

This proposal does not depend on column 1 being sorted (as others above do), it will collect lines correctly into the matrix elements even if the key comes up again later with more values. It would not, of course, have the across columns sorted, then. Empty/missing elements are leaving a gap in the output:

$ awk  '{for (i=1; i<=LnCnt; i++) if ($1 == Ln) break; if (i > LnCnt) Ln[++LnCnt]=$1}
   {for (j=1; j<=HdCnt; j++) if ($2 == Hd[j]) break; if (j > HdCnt) Hd[++HdCnt]=$2}
   {Mx[$1,$2] = $3}
   END {printf "%10s", ""; for (j=1; j<=HdCnt; j++)  printf "%10s", Hd[j]; printf "\n";
   for (i=1; i<=LnCnt; i++)  {printf "%10s", Ln;
   for (j=1; j<=HdCnt; j++) printf "%10s", Mx[Ln, Hd[j]];
   printf "\n"
   }
   }
   ' file
   D2cls0    D2cls1    D2cls2    D2cls8    D2cls9    D2cls3    D2cls4    D2cls5    D2cls6    D2cls7
   D2cls0         1     0.308     0.554     0.625     0.429     0.287     0.633               0.665     0.698
   D2cls2       0.1     1.308     1.554               1.429     1.287     1.633     1.341     1.665     1.698

RudiC · December 1, 2012, 12:00pm

How to run it or how it works internally?
1) copy the code from my post to your command line and replace "file" with your filename.
2)

{for (i=1; i<=LnCnt; i++) if ($1 == Ln) ... check and retain unique row keys.
{for (j=1; j<=HdCnt; j++) if ($2 == Hd[j]) ... check and retain unique col leys.
{Mx[$1,$2] = $3}                           ... fill matrix elements depending on row & col value.
END                 ... print out header line and print all collected rows (loop i) with the resp. col (loop j) values

vgersh99 · December 1, 2012, 12:50pm

RudiC, good idea - should have thought of it myself.
Here's my second take: not depending on column 1 being sorted; taking of the potential gaps/missing rows.

awk 'BEGIN {
   OFS="\t"
}
{
  row[$1,$2]=$3
  w=(length($2)>w)?length($2):w
  if (!($2 in f2)) { header=(header)?header OFS $2:$2;f2[$2]}
  if (!($1 in col1a)) {
     col1[++c]=$1
     col1a[$1]
  }
}
END {
  empty=sprintf("%*s",w," ")
  printf("%*s%s\n", length(col1[1])+2, " ",header)
  ncol=split(header,colA,OFS)
  for(i=1;i<=c;i++) {
    printf("%s", col1)
    for(j=1;j<=ncol;j++) {
      idx=(col1 SUBSEP colA[j])
      printf("%s%s%c", OFS, (idx in row)?row[idx]:empty, (j==ncol)?ORS:"")
    }
  }
}' myFile