Hello all!
This is my first post and I'm very new to programming. I would like help creating a simple perl or bash script that I will be using in my work as a junior bioinformatician.
Essentially, I would like to take a tab-delimted or .csv text with 3 columns and write them to a "3D" matrix:
Input:
gene sample identifier
a 1 @
b 2 #
c 3,4 @
d 5 %
d 5 *
Sorry I'm just realizing the format of my message messed up when I posted. I'm trying to put it in a table now. Or is there a quick way to add an example excel file?
Jon
not sure about excel. just select the area and click "CODE"
is the input file actually an .xls? will you be needing this conversion to work with standard unix utilities right? I've already got it mocked up in awk assuming space or tab delimiters.
I updated my post, it should be more clear now.
The input will be a tab-delimited text file saved from excel.
Thank you so much for replying so fast. I'm loving this community already.
It's very close. It seems to work when there is a single identifier for each gene or sample, but there is a problem with gene c and d. For gene c, two samples (3 and 4) share the same identifier, but it is printing sample 4 incorrectly.
For gene d and sample 5, i was hoping % and * would be printed in the same cell, seperated by a comma. It bugs with the second identifier for gene d.
It works perfectly with the test input I had, but it's reproducing the endless loop with my data file. I wonder, in my data file all three of sample, gene, identifier will be strings, not integers. I wonder if that is the issue.
Hey man, I am in no position to complain about time, you've been such an extremely big help, hopefully I'll be able to pay you back. Take your time, this is a favor!
Cheers
no favor. is hobby. keeps my skills going.
it's hard to see if lines up on terminal, how's this look in spreadsheet?
#!/usr/bin/awk -f
# TAB delimited
BEGIN { FS="\t" }
# skip header
NR==1 { next }
# strip DOS line endings
{sub(/\r$/,"")}
# keep list of unique genes, in order
!($1 in genes_by_name) { genes_by_name[$1]=genes_idx; genes_by_num[genes_idx++]=$1; }
{
# unquote
gsub(/(^"|"$)/,"",$2)
split($2, cols, /,/)
for (col in cols) {
# keep list of unique samples, in order
if (! (cols[col] in samples_by_name)) {
samples_by_name[cols[col]]=samples_idx
samples_by_num[samples_idx++]=cols[col]
}
matrix[$1,cols[col]] = matrix[$1,cols[col]] "," $3
}
}
END {
# print header
printf("gene\t")
for (i = 0; i < samples_idx; i++)
printf("%s\t", samples_by_num)
printf("\n")
for (i = 0; i < genes_idx; i++) {
printf("%s\t", genes_by_num);
for (j = 0; j < samples_idx; j++)
printf("%s\t", substr(matrix[genes_by_num,samples_by_num[j]],2))
printf("\n")
}
}
f = open('/path/to/input/file.txt' , 'r')
matrix={}
f.readline()
for line in f:
(gene,sample,id) = line.split()
sample = sample.split(',')
for sam in sample:
matrix.setdefault(gene,{}).setdefault(sam,[]).append(id)
for example
1- If I want gene "c" in sample "4" I write print(matrix['c']['4'][0])
2- If I want gene "c" in sample "3" I write print(matrix['c']['3'][0])
etc...