How transpose column in a row?

Salvatore_espos · December 28, 2018, 4:26am

Hello guys,
First of all happy holidays and happy new year.
I'm new in bioinformatic and also it is my first time that I write in this forum. Therefore, sorry if I make some mistakes.
I'm writing to ask your help to fix a problem:
I have a file like this:

gene1	GO:0016491|GO:0055114
gene2	GO:0004665|GO:0006571|GO:0008977|GO:0055114
gene3	GO:0005515
gene4	GO:0016491|GO:0055114
gene5	GO:0004427|GO:0009678|GO:0015992|GO:0016020

I have to modify the file in order to have an output like this:

gene1.     GO:0016491
gene1      GO:0055114
gene2      GO:0004665
gene2      GO:0006571
gene2      GO:0008977
gene2      GO:0055114

Could anyone help me to modify the file, please?
Thank you to all for your help in advance
Sal

bakunin · December 28, 2018, 5:20am

Yes. My suggestion is to enter "column" and "row" as keywords into the advanced search form, select "Search titles only" (directly under the keywords) and hit <ENTER>. You will be presented the same plethora of hits as i was presented with because we had this question over and over again. If there are still questions left you are welcome back to ask them.

I hope this helps.

bakunin

Salvatore_espos · December 28, 2018, 6:13am

Hi Bakunin,
thank you for your suggestion. However, I have alredy search in different discussions here. There many examples but I did not find something helpul for me.
Could you suggest a specific discussun that I can read deeply, please?
Thank you for replay me

bakunin · December 28, 2018, 7:05am

This is because you didn't describe your problem well enough. I mean, you know what you have and you know what you would like to get out of it, but as far as i can see you haven't thought through what it would take to get from here to there. There is noprogramming involved, just plain thinking:

You have this:

gene1	GO:0016491|GO:0055114
gene2	GO:0004665|GO:0006571|GO:0008977|GO:0055114
gene3	GO:0005515
gene4	GO:0016491|GO:0055114
gene5	GO:0004427|GO:0009678|GO:0015992|GO:0016020

and want to get this:

gene1.     GO:0016491
gene1      GO:0055114
gene2      GO:0004665
gene2      GO:0006571
gene2      GO:0008977
gene2      GO:0055114

Now, first step is: which lines of the outcome are correlated to which input? Obvious this line:

gene2	GO:0004665|GO:0006571|GO:0008977|GO:0055114

accounts for these:

gene2      GO:0004665
gene2      GO:0006571
gene2      GO:0008977
gene2      GO:0055114

I suppose the reason why "gene3", "gene3" and "gene5" are missing from what you showed as output is that it is simply a sample and you did cut somewhere - yes? Or are there filters in place you haven't told us about? If so, which ones?

Now, concentrating on transforming the one line, what did we find:

1) the input line has a first field (like "gene1", "gene2", etc.), which should show up as first part of the respective output line(s). The field is delimited by the start of the line and the first tab character, if i interpret your data correctly.

2) The second field consists of several sub-fields which are delimited by pipe characters ("|"). For every such sub-field there should be a separate line in the output with the first field and the respective sub-field.

If this is correct as i described it the necessary code to implement it already "springs out" of that, no? So try that and show your efforts, then we will go over what you have written and - if necessary - correct it. Some questions you should answer for yourself, though, just to know if you have to guard against such possibilities in your code:

a) Could there be input lines with no second fields, like "gene2" here:

gene1	GO:0016491|GO:0055114
gene2	
gene3	GO:0005515

And, if yes, what do you want to do with them?

b) will the sub-fields in the second field always be of this form ("G"-"O"-":" plus 6 digits 0-9) or might there be something else, like:

gene1	GO:0016491|who knows|GO:0055114

If yes, what should be done with these?

c) could there be "double entries" like one of these:

gene1	GO:0016491|GO:0016491
gene2	GO:0005515
gene2	GO:0005515

Again, if yes: what do you want to do with these?

d) What about long lines? Might it happen that the line has so many sub-fields that it is broken into the next line like this:

gene1	GO:0016491|GO:0016492|GO:0016493|GO:0016494|GO:0016495|....(a long line)
		GO:0016600|GO:0016601

I am sure you get what this aims at and you surely know your data better than me, so maybe some of my points are moot - but it pays to make oneself aware of the point being moot. So sit down, analyse your problem, try to write some code and show it here. We gladly help, but we help you help yourself, we won't do your work for you.

I hope this helps.

bakunin

RudiC · December 28, 2018, 8:02am

Welcome to the forum.

On top of what bakunin suggested, you'll find quite some threads dealing with similar topics and giving you ideas / starting points at the lower left of this page under "More UNIX and Linux Forum Topics You Might Find Helpful". Esp. this one comes close to a solution to your problem.

Salvatore_espos · December 29, 2018, 5:55am

Hi guys,
First of all I would say thank you to both. I'm sorry if I didn't explain well the problem and of course I've thought about it, but it was not easy explain it fully. I try to better explain the problem and what I've done.
I read what RudiC suggested, and I studied the following code on my table:

BEGIN { FS = OFS = "|" }
{ for (fld = 1; fld <= NF; fld++) {
  print $0
  }
}

After that my file changed from its native form:

VIT_AGLc1g09770.1    GO:0016491|GO:0055114
VIT_AGLc9g366030.1    GO:0004665|GO:0006571|GO:0008977|GO:0055114
VIT_AGLc6g304750.1    GO:0005515
VIT_AGLc1g09770.2    GO:0016491|GO:0055114
VIT_AGLc11g42510.1    GO:0004427|GO:0009678|GO:0015992|GO:0016020
VIT_AGLc11g41480.1    GO:0004672|GO:0005524|GO:0006468
VIT_AGLc1g09770.3    GO:0016491|GO:0055114
VIT_AGLc6g304750.3    GO:0005515
VIT_AGLc1g09770.4    GO:0016491|GO:0055114
VIT_AGLc5g276360.1    GO:0004672|GO:0005524|GO:0006468

To:

VIT_AGLc1g09770.1    GO:0016491|GO:0055114
VIT_AGLc1g09770.1    GO:0016491|GO:0055114
VIT_AGLc9g366030.1    GO:0004665|GO:0006571|GO:0008977|GO:0055114
VIT_AGLc9g366030.1    GO:0004665|GO:0006571|GO:0008977|GO:0055114
VIT_AGLc9g366030.1    GO:0004665|GO:0006571|GO:0008977|GO:0055114
VIT_AGLc9g366030.1    GO:0004665|GO:0006571|GO:0008977|GO:0055114
VIT_AGLc6g304750.1    GO:0005515
VIT_AGLc1g09770.2    GO:0016491|GO:0055114
VIT_AGLc1g09770.2    GO:0016491|GO:0055114

Hence, each time that I have a field (in the second column of my table, separated by |) > 1, it prints the row for the maximum number of field.
This a part of what I was looking for and I was happy to got it. However, I should have in the second column, only one field "each time"
Indeed the expected results should look like this:

VIT_AGLc1g09770.1    GO:0016491
VIT_AGLc1g09770.1    GO:0055114
VIT_AGLc9g366030.1    GO:0004665
VIT_AGLc9g366030.1    GO:0006571
VIT_AGLc9g366030.1    GO:0008977
VIT_AGLc9g366030.1    GO:0055114
VIT_AGLc6g304750.1    GO:0005515
VIT_AGLc1g09770.2    GO:0016491
VIT_AGLc1g09770.2    GO:0055114

Anyway I m grateful to both.

nezabudka · December 29, 2018, 10:03am

awk -F'\t|\\|' '{for(i=2;i<=NF;i++)print $1"\t"$i}'

Salvatore_espos · December 29, 2018, 11:12am

What nezabudka suggested work well. Thank you very much.
The last question: What does -F'\t|\\|' mean? I know -F'\t' read that my table is separated by tab, but I don't know this one.

Thank you so much for the help!

nezabudka · December 29, 2018, 11:52am

-F - field separator defined as \t or |
| - or
\\| escape character |