Awk: conversion of matrix formats

hello,

i would need a fast awk script for conversion of network formats (from 'sif' to 'adjacency' format):

sif (pp means only: protein-protein interaction):
A pp B
A pp C
B pp D
D pp E

in an adjacency n x n matrix:

 
  A B C D E
A 0 1 1 0 0
B 1 0 0 1 0
C 1 0 0 0 0
D 0 1 0 0 1
E 0 0 0 1 0

my idea:

go through all rows and build two indexed arrays (if array-names taken from the input file, i.e. $1, are allowed - i think this is called name substitution):

 
names[$1]=dummy
names[$3]=dummy
$1[$3] = 1
$3[$1] = 1

then loop over all array-names for (i in names) to write the column headers.

then loop nested two times over all array-names for (j in names); for (k in names) and write "1" if j[k] is 1 else "0" . (I hope indices are always sorted the same way).

do you think this could work? and perhaps you can provide some code drafts (I am rather untrained in awk).

if substitution for array names doesn't work, perhaps 'two dimensional' arrays would work?

 
names($1)=dummy
names($3)=dummy
pp($1,$3)=1
pp($3,$1)=1

the rest as above, loop two times over name indices and check if pp(j,k) is 1.

thank you very much...

dietmar

Are names always a single character?
If not, are names always the same length?
If not, is there a maximum length you want to consider, or should the output column widths adjust to the names found in the input?
How big is the array (i.e., number of different names)?

Your thoughts were quite easy to translate into awk speak, at least for the simple sample that you gave:

awk     '!HD[$1]        {HD[$1]++}
         !HD[$3]        {HD[$3]++}
                        {PP[$1,$3] = 1
                         PP[$3,$1] = 1
                        }
         END            {printf "  "
                         for (i in HD)   printf "%s ", i; printf "\n"
                         for (i in HD)  {printf "%s ", i
                                         for (j in HD) printf "%s ", PP[i,j]?PP[i,j]:"0"
                                         printf "\n"
                                        }
                        }
        ' file
  A B C D E 
A 0 1 1 0 0 
B 1 0 0 1 0 
C 1 0 0 0 0 
D 0 1 0 0 1 
E 0 0 0 1 0 

If Don Cragun's suspicions come true, this might need to be seriously reworked, though.
And, the (i in HD) supplies the names in arbitrary order, esp. in large files, it's by sheer luck they seem sorted in above case. So, in the end you may need to add some sorting code on top.

names are of any length
output shoul be tab delimited
there are up to millions names
(I should have clarified this)

I think RudiC's script is nearly perfect. Only the space delimite has to be changed to tab (perhaps you could change it, otherwise I will try: if I am right if have only to remove the "%s " part and set the OFS to tab.).

and only for my curiosity:
why check for !HD[$1]: when the $1 is already in use than happens nothing? the index is already set. the ++ is only for using of the array (to fill indices), the number for each index is never used?

sorting is no problem, if they come for all three loops in the same order!

thank you very much.

dietmar

---------- Post updated at 02:39 AM ---------- Previous update was at 01:34 AM ----------

now the script works with one exception:

#!/bin/bash

fn=$1
fname=${fn%.*}
echo $fname

awk 'BEGIN {FS="\t"};
    NF >= 3    
    {HD[$1]++; HD[$3]++; PP[$1,$3] = 1; PP[$3,$1] = 1 }
    END    {printf "\t"
        for (i in HD) { printf "%s\t" ,i } printf "\n"
        for (i in HD) {printf "%s\t", i ;
            for (j in HD) { printf "%s\t", PP[i,j]?PP[i,j]:"0" } ;
            printf "\n" } }' $fn > $fname.adj

BUT: I get the complete input file in front of my matrix output file, and I don't see why this happens...

You are absolutely right - the !HD check is redundant, used out of sheer habit to keep HD at the logic levels 1 and 0. Try this simplified version with <TAB>s as separators, as OFS will not work:

awk     '               {HD[$1]++
                         HD[$3]++
                         PP[$1,$3] = PP[$3,$1] = 1
                        }
         END            {printf "\t"
                         for (i in HD)   printf "%s\t", i; printf "\n"
                         for (i in HD)  {printf "%s\t", i
                                         for (j in HD) printf "%s\t", PP[i,j]?PP[i,j]:"0"
                                         printf "\n"
                                        }
                        }
        ' file

---------- Post updated at 08:48 ---------- Previous update was at 08:41 ----------

That's because above pattern is true, and it is separated from what you want to be its action by a new line, so assuming the default action which is print $0 .

RudiC,

thank you - now it works perfect.

what newline does in awk is new for me...

Hi dietmar,
If any names are more than 7 characters long (assuming tab stops set at every 8 column positions), your output headings won't line up with the data values. If this is a concern to you, you could change the script to print the top headings vertically instead of horizontally and adjust the printing of the row headings to make the 1st column in your output be the width of the longest name.

Furthermore, with millions of rows and columns, the output produced will not be a text file (due to excessive line lengths), so you will be restricted by the number of utilities you can use to post-process your output.

RudiC has already commented on the part removed above.

Assuming that there are no empty or blank lines in your input files, that column and row headings are less than 8 characters long (or you don't care about column alignment), and that you don't want the extra tab at the end of each line that your script currently produces, you could also try this slightly simplified script:

#!/bin/bash
fn=$1
fname=${fn%.*}
echo $fname

awk '{  HD[$1]; HD[$3]; PP[$1,$3] = PP[$3,$1] = 1}
END {   for (i in HD) printf "\t%s", i
        printf "\n"
        for (i in HD) {
                printf "%s", i
                for (j in HD) printf "\t%d", PP[i,j]?1:0
                printf "\n"
        }
}' $fn > $fname.adj

which puts:

	A	B	C	D	E
A	0	1	1	0	0
B	1	0	0	1	0
C	1	0	0	0	0
D	0	1	0	0	1
E	0	0	0	1	0

in your output file when the file named by $1 contains the input given in the 1st message in this thread. (Again as RudiC stated, the order of rows and columns may vary, but the row headings and the column headings should be in the same order.)

If you wanted to run this on a Solaris/SunOS system, you would need to use /usr/xpg4/bin/awk or nawk instead of awk .

dear don cragun,

i use the table for another program, therefore i don't need a pretty print. I need tab's.

and there i found another little problem,
after each line there seems to be a tab, but I don't need this tab for the last field in each line.

how can i easily replace the last tab with a newline.

fn=$1
fname=${fn%.*}
echo $fname

awk 'BEGIN {FS="\t"};
    NF >= 3 {HD[$1]++; HD[$3]++; PP[$1,$3] = PP[$3,$1] = 1 }
    END    {printf "\t"
        for (i in HD) { printf "%s\t" ,i } printf "\n" #<- here replace the tab with newline
        for (i in HD) {printf "%s\t", i ;
            for (j in HD) { printf "%s\t", PP[i,j]?PP[i,j]:"0" } printf "\n" #<- here replace the last tab with newline
            } }' $fn > $fname.adj

---------- Post updated at 07:55 AM ---------- Previous update was at 07:47 AM ----------

i should read more carefully:

here is the solution:

awk 'BEGIN {FS="\t"};
    NF >= 3 {HD[$1]++; HD[$3]++; PP[$1,$3] = PP[$3,$1] = 1 }
    END    {
        for (i in HD) { printf "\t%s" ,i } printf "\n"
        for (i in HD) {printf "%s", i ;
            for (j in HD) { printf "\t%s", PP[i,j]?PP[i,j]:"0" } printf "\n"
            } }' $fn > $fname.adj

This proposal makes use of the fact that we're dealing with a symmetrical matrix, needing to retain only the upper triangular matrix. A sorting step would be much easier to implement, and only half of the "interaction" array elements would be needed:

awk     '               {B1 = B3 = 0                                    # boolean variable for finding headers
                         for (i=1; i<=n; i++)                           # check all headers found up to now
                                {B1 = B1 || (HD == $1)               # if $1 or $3 found in headers array,
                                 B3 = B3 || (HD == $3)               # record in respective boolean var
                                }
                           if (!B1) HD[++n] = $1                        # if new header, record in new
                           if (!B3) HD[++n] = $3                        # header array element
                         PP[$1,$3] = 1                                  # record protein interaction
                        }
         END            {printf "\t"                                    # a header sort step may slip in here!
                         for (i=1; i<=n; i++)   printf "%s\t", HD    # print column headers
                                                printf "\n"
                         for (i=1; i<=n; i++)
                                {printf "%s\t",HD
                                 for (j=1; j<i; j++)
                                        printf "%d\t", PP[HD[j],HD]  # print lower triangular matrix
                                 printf "%d\t", 0                       # diagonal elements are always zero!
                                 for (j=i+1; j<=n; j++)
                                        printf "%d\t", PP[HD,HD[j]]  # print upper triangular matrix
                                 printf "\n"
                                }
                        }
        ' file
1 Like

Please look again at the script I suggested in my last post in this thread. It got rid of the trailing tabs. The spots marked in red above are the areas that need to change to get rid of trailing tabs.

The BEGIN clause marked in green above does not match the sample input you provided, so I took it out. If you really have tab delimiters in your input AND have spaces in some of your field names, put that clause back in the script I provided. The ++'s marked in orange don't affect the output of your program, but will make it run a little slower. (It won't be noticeable with your sample data, but if you have millions of lines of input, it will make a difference. You will note that I removed them in the script I provided.)

1 Like

@Don Cragun and @RudiC

The task is solved, my downstream application works and I have learned a lot about programing awk...

thank you!

dietmar