awk to convert table-by-row to matrix table

Hello,
I need some help to reformat this table-by-row to matrix?

infile:

site1 A:o,p,q,r,s,t
site1 C:y,u
site1 T:v,w
site1 -:x,z
site2 A:p,r,t,v,w,z
site2 C:u,y
site2 G:q,s
site2 -:o,x
site3 A:o,q,s,t,u,z
site3 C:y
site3 T:v,w,x
site3 -:p,r
outfile:

SITE     o p q r s t v u w x y z      
site1    A A A A A A T C T - C -
site2    - A G A G A A C A - C A
site3    A - A - A A T A T T C A

It seems AWK may do the job easily with array, but I am not sure. Thanks in advance!

Try this adaption from an earlier post ( Transforming 3 columns to matrix format Post: 302738343 ; I'm sorry, I seem not to be able to enter clickable URLs any more ... ) :

awk     '{for (i=1; i<=LnCnt; i++) if ($1 == Ln) break; if (i > LnCnt) Ln[++LnCnt]=$1}
         {for (k=3; k<=NF; k++) {for (j=1; j<=HdCnt; j++) if ($k == Hd[j]) break
                                 if (j > HdCnt) Hd[++HdCnt]=$k
                                 Mx[$1,$k] = $2}
         }
         END {                           printf "%10s", ""
                                         for (j=1; j<=HdCnt; j++)  printf "%3s", Hd[j]
                                         printf "\n";
              for (i=1; i<=LnCnt; i++)  {printf "%10s", Ln;
                                         for (j=1; j<=HdCnt; j++) printf "%3s", Mx[Ln, Hd[j]];
                                         printf "\n"
                                        }
             }
        ' FS="[ :,]" file
            o  p  q  r  s  t  y  u  v  w  x  z
     site1  A  A  A  A  A  A  C  C  T  T  -  -
     site2  -  A  G  A  G  A  C  C  A  A  -  A
     site3  A  -  A  -  A  A  C  A  T  T  T  A
1 Like

Hi,
Another solution in gawk:

gawk -F'[ :,]' '{PROCINFO["sorted_in"]="@ind_str_asc"; A[$1]=1;for (i=3;i<=NF;i++) {C[$i]=$i;B[$1,$i]=$2}}END{printf "SITE\t";for ( jj in C ) printf jj" ";print "" ; for ( ii in A ) {printf ii"\t";for ( jj in C ) {printf B[ii,jj]" "}; print ""}}' infile

Happy gnu year.
Regards.

Thank you both!
Amazing scripts!
disedorgue, Your script is what i was thinking to use, but need more digestion.
More challenging with real data, that some members (i.e. columns) of some site are with missing data as indicated with "-" in previous post, but not listed in the infile, which will be assigned with an empty cell and will cause confusion.

infile: 
site1 A:o,p,q,r,s,t
site1 C:y,u
site1 T:v,w
site1 -:x,z
site2 A:p,r,t,v,w,z
site2 C:u,y
site2 G:q,s
site3 A:o,q,s,t,u,z
site3 C:y
site3 T:v,w,x
output:
SITE    o p q r s t u v w x y z 
site1    A A A A A A C T T - C - 
site2     A G A G A C A A  C A 
site3    A  A  A A A T T T C A 

How to add "-" to those "empty" cells so that the table is aligned? Thanks a lot again!

Try this:

$ awk -F'[ :,]' '{PROCINFO["sorted_in"]="@ind_str_asc"; A[$1]=1;for (i=3;i<=NF;i++) {C[$i]=$i;B[$1,$i]=$2}}END{printf "SITE\t";for ( jj in C ) printf jj" ";print "" ; for ( ii in A ) {printf ii"\t";for ( jj in C ) {if (B[ii,jj]) {printf B[ii,jj]" "} else {printf "- "}}; print ""}}' infile

Regards.

1 Like

Thanks a lot!
Has the "Thank" option been removed for each post on the forum? I could not thank the reply by clicking the "Thank" button anymore.

yifangt,
The :b:Thanks button is still there. Is it possible that you just didn't notice it because disedorgue's code was presented as a single long line and you didn't scroll far enough to the right to see it?

1 Like

Yes, I was aware of the long single line. Something weird with this thread: 1) When I first posted, nothing was there, so that I re-posted, and have 3 duplicates, embarrassing:D!
2)Looking for the Thank you:b:button, but it was not there;
3) Refreshed the page, all came out. But, that was not always the case. I had thought the site is under maintenance, (forgetting the admins of the site are all Expert:p!) Probably my side e.g. browser/coockies etc:confused:.
Thank you anyway!