how to rearrange a matrix with awk

Hi, every one. I have two files ,one is in matrix like this, one is a list with the same data as the matrix.

       AB   AE    AC    AD   AA    AF   
SA     3     4     5       6     4      6
SC     5     7     2      8     4      3
SD     4     6      5     3      8     3 
SE     45    34    5    12     09    1
SB      34   33    34    45    67    23
AA   SA   4
AB   SA   3
AD   SC   8
AF    SB   23
.       .
.       .
.       .
.       .

How can I get the matrix in order like this? Can I do it using awk?

       AA   AB    AC    AD   AE    AF   
SA     
SB     
SC     
SD     
SE      

Thank you!

With gawk:

gawk 'NR==1{for(i=1;i<=NF;i++) y[i+1]=$i;next}
{x[++n]=$1;for(i=2;i<=NF;i++) data[$1,y]=$i}
END{
nx=asort(x);ny=asort(y)
for(i=1;i<=nx;i++)
{
 if(i==1)
 {
  for(j=0;j<=ny;j++)
   print y[j]
  printf "\n"
 }
 print x
 for(j=1;j<=ny;j++)
  print data[x,y[j]]
 printf "\n"
}}' ORS='\t' matrix_file

And, if you don't have gawk, it can also be done with other awks with a user-defined function to sort arrays. Let me know if that is the case with you.

1 Like

Some versions of awk provide a asort() and asorti() functions to sort data.
This example uses the sort utility to sort data and should work whether or not your version of awk has those functions. This would probably also be easier if it read in the three column input file rather than the unsorted matrix input, but I didn't want to take the time to add the entries you listed as ... to test the program. This example also automatically adjusts for varying input field widths:

#!/bin/ksh
# Usage: sortmatrix [unsorted_matrix_file]
# The following awk program reads in a matrix with unsorted row and
# column headings and writes the same data rearranged with sorted row and
# column headings.  The output column widths automatically adjust for
# input data width.  If this script is invoked with no operands, it will
# read unsorted matrix information from a file named "input".
#
# This script is written using ksh and awk on OS X.  Other than changing
# the #! line above, any shell supporting basic POSIX shell syntax
# should work.  If you're using a Solaris system, change "awk" to
# "/usr/xpg4/bin/awk" (nawk might also work, but I don't remember if it
# was updated to support "%*s" to specify a runtime settable field
# width in awk printf() statements).
#
# awk variables name key:
#       c0w     row heading Column Width
#       cc      data Column Count (not counting row heading)
#       cw[]    data Column Width
#       dbg     if non-0/non-null log DeBuGging info into the file named by dlf
#       dlf     name of Debugging Log File
#       i       loop control
#       ich[]   unsorted Input Column Headers
#       och[]   sorted Output Column Headers
#       sc      Sort Command used to execute to sort column and row headings
#       tmpfile TeMPFILE to hold results of sorting row and column headings
tmpfile="sortmatrix.order"
awk -v dbg=0 -v tmpfile="$tmpfile" 'BEGIN{
        sc="sort -o " tmpfile
        if(dbg) dlf = "debug.out"
}
dbg{    printf("input: %s\n", $0) > dlf
}
NR == 1{# Read and sort data oolumn headings from the 1st input line
        for(i = 1; i <= NF; i++) {
                if(dbg) print $i > dlf
                ich = $i
                print $i | sc
        }
        close(sc)
        # Produce the list of sorted output column headings
        cc = 0
        while((getline och[++cc] < tmpfile) == 1) { # read the sorted headings
                cw[och[cc]] = length(och[cc]) # set initial column width
                if(dbg) printf("och[%d]=%s, cw[%s]=%d\n", cc, och[cc], och[cc],
                        cw[och[cc]]) > dlf
        }
        close(tmpfile)
        # Verify that the sort worked
        if(cc != NF) {
                printf("sortmatrix: %d columns read, but sort returned %d\n",
                        NF, cc)
                exit 1
        }
        next
}
{       # Process remaining input lines
        # Feed the row heading into sort
        print $1 | sc
        if(length($1) > c0w) {
                # Increase output column 0 width to match row heading length
                c0w = length($1)
                if(dbg) printf("c0w increeased to %d\n", c0w) > dlf
        }
        for(i = 2; i <= NF; i++) {
                # Save the data in an array indexed by row heading and
                # column heading.
                data[$1,ich[i - 1]] = $i
                if(dbg) printf("data[%s,%s]=%s\n", $1, ich[i-1], $i) > dlf
                if(cw[ich[i - 1]] < length($i)) {
                        # A data field is wider than the column heading,
                        # adjust the column width.
                        cw[ich[i - 1]] = length($i)
                        if(dbg) printf("cw[%s] increased to %d\n",
                                $i, length($i)) > dlf
                }
        }
}
END{    # Finish the row headings sort
        close(sc)
        # Print the column headings
        printf("%*s", c0w, "")
        for(i = 1; i <= cc; i++) {
                cw[och]++ # Add room to put a space between output columns.
                printf("%*s", cw[och], och)
        }
        printf("\n")
        # Read the sorted row headings and print the data
        rowcnt = 1
        while((getline row < tmpfile) == 1) {
                printf("%-*s", c0w, row)
                for(i = 1; i <= cc; i++)
                        printf("%*s", cw[och], data[row,och])
                printf("\n")
                rowcnt++
        }
        # Verify that the sort worked
        if(rowcnt != NR) {
                printf("sortmatrix: Read %d data rows, sort only returned %d\n",
                        NR - 1, rowcnt - 1)
                exit 2
        }
}' ${1:-input}
rm $tmpfile

When the above is saved in a file named sortmatrix and made executable with chmod +x sortmatrix run with a file named input that contains:

       AB   AE    AC    AD   AA    AF   AC12 AClonger
SA     3     4     5       6     4      6 121 long1
SC     5     7     2      8     4      3 123 long3
SD     4     6      5     3      8     3 124 long4
SE     45    34    5    12     09    1 125 long5
SB      34   33    34    45    67    2 122 long2
more   m1 m2 m3 m4 m5 m6 m7-xxx m8

it produces the following output:

     AA AB AC   AC12 AClonger AD AE AF
SA    4  3  5    121    long1  6  4  6
SB   67 34 34    122    long2 45 33  2
SC    4  5  2    123    long3  8  7  3
SD    8  4  5    124    long4  3  6  3
SE   09 45  5    125    long5 12 34  1
more m5 m1 m3 m7-xxx       m8 m4 m2 m6

Note that if you change -v dbg=0 to -v dbg=1 , this script will produce a debugging log file providing data that may be useful if you need to make changes to some part of the program's logic.

3 Likes

Many thanks!

If all of your header names are exactly two alphabetic characters followed by one or more numeric characters, you could change the sort command for the column headings from:

sc="sort -o " tmpfile

to:

csc="sort -k1.1,1.2 -k1.3n,1 -o " tmpfile

and use csc (column sort command) instead of sc when sorting the column headers. But if only some of your names fit this pattern, the easiest thing to do is to rename your headers so they have leading zeros to make an alphabetic sort work, (i.e., SA01, SA02, SA03, SA05, SA10, SA11, SA15, SA25, and SA27). This could be coded into the awk program, but I don't have the time to devote to adding the leading zeros before sorting them, stripping off the leading zeros after sorting them, and figuring out what to do if one of you input column names already as a 0 followed by other numeric characters.

Do you also have a problem with sort order for the first field in the rows in your matrix?

1 Like

No problem! Thank you! I'm really appreciate it.