Help making simple perl or bash script to create a simple matrix

Hello all!
This is my first post and I'm very new to programming. I would like help creating a simple perl or bash script that I will be using in my work as a junior bioinformatician.

Essentially, I would like to take a tab-delimted or .csv text with 3 columns and write them to a "3D" matrix:

Input:

gene   sample   identifier
a         1        @
b         2        #
c         3,4      @
d         5        %  
d         5        *

Output:

          
        1         2          3         4         5

a       @        

b                 #

c                            @         @

d                                                %,*

Is this easy to do?
Thanks in advance

Jonathon

Not really, no. :wink: Is this actual input? Are there any boundaries for these rows/columns? always single letter / single number?

Sorry I'm just realizing the format of my message messed up when I posted. I'm trying to put it in a table now. Or is there a quick way to add an example excel file?
Jon

not sure about excel. just select the area and click "CODE"

is the input file actually an .xls? will you be needing this conversion to work with standard unix utilities right? I've already got it mocked up in awk assuming space or tab delimiters.

I updated my post, it should be more clear now.
The input will be a tab-delimited text file saved from excel.
Thank you so much for replying so fast. I'm loving this community already.

$ ./gene  input
gene      1         2         3         4         5
a         @
b                   #
c                             @         @
d                                                 %,*

it's an awk script.

#!/usr/bin/awk -f

# keep list of unique genes, in order
!($1 in genes_uniq) { genes_uniq[$1]; genes[gene_idx++]=$1; }

{
        split($2, cols, /,/)
        for (col in cols) {
                if (cols[col] > max_col) max_col=cols[col]
                matrix[$1,cols[col]] = matrix[$1,cols[col]] "," $3
        }
}

END {
        # print header
        printf("%-9s ", "gene")
        for (col = 1; col <= max_col; col++)
                printf("%-9d ", col);
        printf("\n")

        for (i = 0; i < gene_idx; i++) {
                printf("%-9s ", genes);
                for (col = 1; col <= max_col; col++)
                        printf("%-9s ", substr(matrix[genes,col],2));
                printf("\n");
        }
}

I'm assuming it can be run as

$ awk -f code.awk input.txt > output.txt

?

Hi neutronscott,
I'm trying to run your script but it seems to be giving me an endless loop.
I tried running:

$ awk -f code.awk test.txt >output.txt

and

$ awk code.awk test.txt >output.txt

I've attached the example test.txt file and how I want the output to look.

Thanks again for you time

Ok. Three things I missed then.

  1. I forgot to skip the header, which gave a non-numerical value and really messed up the column loop.
  2. My output used spaces rather than tabs
  3. Didn't expect quotes around field 2

new code:

#!/usr/bin/awk -f

BEGIN { FS="\t" }
NR==1 {next}    # skip header

# keep list of unique genes, in order
!($1 in genes_uniq) { genes_uniq[$1]; genes[gene_idx++]=$1; }

{
        # unquote
        gsub(/(^"|"$)/,"",$2)
        split($2, cols, /,/)
        for (col in cols) {
                if (cols[col] > max_col) max_col=cols[col]
                matrix[$1,cols[col]] = matrix[$1,cols[col]] "," $3
        }
}

END {
        # print header
        printf("gene\t")
        for (col = 1; col <= max_col; col++)
                printf("%d%c", col, (col==max_col)?"\n":"\t");

        for (i = 0; i < gene_idx; i++) {
                printf("%s\t", genes);
                for (col = 1; col <= max_col; col++)
                        printf("%s%c", substr(matrix[genes,col],2),
                                (col==max_col)?"\n":"\t");
        }
}

also you can use awk -f script.awk input >output or chmod a+x script.awk and simply run ./script.awk input >output

It's very close. It seems to work when there is a single identifier for each gene or sample, but there is a problem with gene c and d. For gene c, two samples (3 and 4) share the same identifier, but it is printing sample 4 incorrectly.

For gene d and sample 5, i was hoping % and * would be printed in the same cell, seperated by a comma. It bugs with the second identifier for gene d.

I've attached the output your script gave.

Thanks again.

Output

 gene 1 2 3 4 5
a     @
 
b       #
 
c         @
      @
 
d             %
,*

ah. windows formatted input file is adding \r to those identifiers. add this line near top somewhere (like after the NR==1 rule)

{sub(/\r$/,"")}

It works perfectly with the test input I had, but it's reproducing the endless loop with my data file. I wonder, in my data file all three of sample, gene, identifier will be strings, not integers. I wonder if that is the issue.

gene sample identifier
C10orf107 NJATRTSP228B,ATRTSP228,ATRT34 pL132F
C10orf11 ATRT5B,ATRT5 
C10orf111 ATRT2B,ATRT2,ATRT4B,ATRT4,ATRT16B,ATRT16 pR62W
C10orf113 ATRT15B,ATRT15 
C10orf113 ATRT63 pA312T
C10orf12 ATRT33 pL314P
C10orf12 ATRT63 pE396G
C10orf12 ATRT45B,ATRT45 pP988A
C10orf12 ATRT46B,ATRT46 pR191C

Yes. Almost changed that but give me an hour or so. I left work :slight_smile:

Hey man, I am in no position to complain about time, you've been such an extremely big help, hopefully I'll be able to pay you back. Take your time, this is a favor!
Cheers

no favor. is hobby. keeps my skills going.
it's hard to see if lines up on terminal, how's this look in spreadsheet?

#!/usr/bin/awk -f

# TAB delimited
BEGIN { FS="\t" }

# skip header
NR==1 { next }

# strip DOS line endings
{sub(/\r$/,"")}

# keep list of unique genes, in order
!($1 in genes_by_name) { genes_by_name[$1]=genes_idx; genes_by_num[genes_idx++]=$1; }

{
        # unquote
        gsub(/(^"|"$)/,"",$2)
        split($2, cols, /,/)
        for (col in cols) {
                # keep list of unique samples, in order
                if (! (cols[col] in samples_by_name)) {
                        samples_by_name[cols[col]]=samples_idx
                        samples_by_num[samples_idx++]=cols[col]
                }
                matrix[$1,cols[col]] = matrix[$1,cols[col]] "," $3
        }
}

END {
        # print header
        printf("gene\t")
        for (i = 0; i < samples_idx; i++)
                printf("%s\t", samples_by_num)
        printf("\n")

        for (i = 0; i < genes_idx; i++) {
                printf("%s\t", genes_by_num);
                for (j = 0; j < samples_idx; j++)
                        printf("%s\t", substr(matrix[genes_by_num,samples_by_num[j]],2))
                printf("\n")
        }
}

It looks like this worked. Simply amazing. Thanks so much for your help. If I find an error I will let you know.
Cheers.

Just giving a shoot: :slight_smile:

Using Python3.2:

f = open('/path/to/input/file.txt' , 'r')
matrix={}
f.readline()
for line in f:
	(gene,sample,id) = line.split()
	sample = sample.split(',')
	for sam in sample:
		matrix.setdefault(gene,{}).setdefault(sam,[]).append(id)

when printing the 'matrix'

The Output is:-
print(matrix)
>>> {'a': {'1': ['@']}, 'c': {'3': ['@'], '4': ['@']}, 'b': {'2': ['#']}, 'd': {'5': ['%', '*']}}

for example
1- If I want gene "c" in sample "4" I write print(matrix['c']['4'][0])
2- If I want gene "c" in sample "3" I write print(matrix['c']['3'][0])
etc...

:D:D:D:D