Hi all,
Newbie here, so please bear over with my stupid question
I have used far too long time today on figuring this out, so I hope that someone here can help me move on.
I have some annotation data for a transcriptome where I want to split a column containing NCBI accession IDs into a column with GIs and REFs (I still want keep the rest of the columns - just split the IDs)
This is an example of a row in my data set:
TRINITY_DN17272_c0_g1_i1 gi|242003970|ref|XP_002422928.1| 57.5 127 54 0 2 382 65 191 1.2e-41 176.0
What I want:
TRINITY_DN17272_c0_g1_i1 242003970 XP_002422928.1 57.5 127 54 0 2 382 65 191 1.2e-41 176.0
I have tried the following:
cut -d'|' -f2,4 File.m8 | tr '|' '\t'
It splits perfectly - too perfectly because I loose everything else besides the two columns:
242003970 XP_002422928.1
Can someone please help me out?
Thanks! Birgitte
RudiC
June 19, 2017, 10:26am
2
Your target field separator is not clear - do you want the two elements to be each a field of its own with a <TAB> between them, or both making up one field, separated by a space or two?
For the first case, try
awk '{split ($2, T, "|"); $2 = T[2] OFS T[4]}1' OFS="\t" file
TRINITY_DN17272_c0_g1_i1 242003970 XP_002422928.1 57.5 127 54 0 2 382 65 191 1.2e-41 176.0
1 Like
Aia
June 19, 2017, 10:30am
3
perl -pae '$F[1] = join "\t", (split /\|/, $F[1])[1,3] and $_ = join "\t", @F' BioBing.example
1 Like
A less brute-force method:
tr '|' '\t' < infile > outfile
1 Like
Aia
June 19, 2017, 11:31am
5
Hi Corona688,
The OP requires more than just substituting every pipe symbol for a tab.
1 Like
If you columns are fixed, then perhaps you could:-
tab=$(printf "\t")
tr '|' "$tab" < input_file | cut -d "$tab" -f1,3,5,7-
Does that help? What I'm trying is to convert every |
to a tab and then cut the fields you want separating on a tab character. I hope it does what you wanted. I've set the variable tab just to make it clear. Feel free to replace it as it works best for you.
Kind regards,
Robin
1 Like
Aia
June 19, 2017, 11:06pm
7
biobing:
Hi all,
Newbie here, so please bear over with my stupid question
I have used far too long time today on figuring this out, so I hope that someone here can help me move on.
I have some annotation data for a transcriptome where I want to split a column containing NCBI accession IDs into a column with GIs and REFs (I still want keep the rest of the columns - just split the IDs)
This is an example of a row in my data set:
TRINITY_DN17272_c0_g1_i1 gi|242003970|ref|XP_002422928.1| 57.5 127 54 0 2 382 65 191 1.2e-41 176.0
What I want:
TRINITY_DN17272_c0_g1_i1 242003970 XP_002422928.1 57.5 127 54 0 2 382 65 191 1.2e-41 176.0
I have tried the following:
cut -d'|' -f2,4 File.m8 | tr '|' '\t'
It splits perfectly - too perfectly because I loose everything else besides the two columns:
242003970 XP_002422928.1
Can someone please help me out?
Thanks! Birgitte
Since gi and ref seems to be always there, perhaps another suggestion:
perl -pe 's/\|?(?:\s+gi|ref)?\|\s*/\t/g' file
Output:
TRINITY_DN17272_c0_g1_i1 242003970 XP_002422928.1 57.5 127 54 02382 65 191 1.2e-41 176.0
1 Like
Thank you! This worked like a charm!
aia:
Since gi and ref seems to be always there, perhaps another suggestion:
perl -pe 's/\|?(?:\s+gi|ref)?\|\s*/\t/g' file
Output:
TRINITY_DN17272_c0_g1_i1 242003970 XP_002422928.1 57.5 127 54 02382 65 191 1.2e-41 176.0