How to split a column based on |?

BioBing · June 19, 2017, 9:45am

Hi all,

Newbie here, so please bear over with my stupid question

I have used far too long time today on figuring this out, so I hope that someone here can help me move on.

I have some annotation data for a transcriptome where I want to split a column containing NCBI accession IDs into a column with GIs and REFs (I still want keep the rest of the columns - just split the IDs)

This is an example of a row in my data set:

TRINITY_DN17272_c0_g1_i1	gi|242003970|ref|XP_002422928.1|	57.5	127	54	0	2	382	65	191	1.2e-41	176.0

What I want:

TRINITY_DN17272_c0_g1_i1	242003970   XP_002422928.1   57.5	127	54	0	2	382	65	191	1.2e-41	176.0

I have tried the following:

cut -d'|' -f2,4 File.m8 | tr '|' '\t'

It splits perfectly - too perfectly because I loose everything else besides the two columns:

242003970   XP_002422928.1

Can someone please help me out?

Thanks! Birgitte

RudiC · June 19, 2017, 10:26am

Your target field separator is not clear - do you want the two elements to be each a field of its own with a <TAB> between them, or both making up one field, separated by a space or two?
For the first case, try

awk '{split ($2, T, "|"); $2 = T[2] OFS T[4]}1' OFS="\t" file
TRINITY_DN17272_c0_g1_i1    242003970    XP_002422928.1    57.5    127    54    0    2    382    65    191    1.2e-41    176.0

Aia · June 19, 2017, 10:30am

perl -pae '$F[1] = join "\t", (split /\|/, $F[1])[1,3] and $_ = join "\t", @F' BioBing.example

Corona688 · June 19, 2017, 10:56am

A less brute-force method:

tr '|' '\t' < infile > outfile

Aia · June 19, 2017, 11:31am

Hi Corona688,

The OP requires more than just substituting every pipe symbol for a tab.

rbatte1 · June 19, 2017, 12:15pm

If you columns are fixed, then perhaps you could:-

tab=$(printf "\t")
tr '|' "$tab" < input_file | cut -d "$tab" -f1,3,5,7-

Does that help? What I'm trying is to convert every | to a tab and then cut the fields you want separating on a tab character. I hope it does what you wanted. I've set the variable tab just to make it clear. Feel free to replace it as it works best for you.

Kind regards,
Robin

Aia · June 19, 2017, 11:06pm

biobing:

Hi all,

Newbie here, so please bear over with my stupid question

I have used far too long time today on figuring this out, so I hope that someone here can help me move on.

I have some annotation data for a transcriptome where I want to split a column containing NCBI accession IDs into a column with GIs and REFs (I still want keep the rest of the columns - just split the IDs)

This is an example of a row in my data set:
TRINITY_DN17272_c0_g1_i1	gi|242003970|ref|XP_002422928.1|	57.5	127	54	0	2	382	65	191	1.2e-41	176.0
What I want:
TRINITY_DN17272_c0_g1_i1	242003970   XP_002422928.1   57.5	127	54	0	2	382	65	191	1.2e-41	176.0
I have tried the following:
cut -d'|' -f2,4 File.m8 | tr '|' '\t'
It splits perfectly - too perfectly because I loose everything else besides the two columns:
242003970   XP_002422928.1
Can someone please help me out?

Thanks! Birgitte

Since gi and ref seems to be always there, perhaps another suggestion:

perl -pe 's/\|?(?:\s+gi|ref)?\|\s*/\t/g' file

Output:

TRINITY_DN17272_c0_g1_i1        242003970       XP_002422928.1  57.5    127     54      02382     65      191     1.2e-41 176.0

BioBing · June 20, 2017, 3:14am

Thank you! This worked like a charm!

aia:

Since gi and ref seems to be always there, perhaps another suggestion:
perl -pe 's/\|?(?:\s+gi|ref)?\|\s*/\t/g' file
Output:
TRINITY_DN17272_c0_g1_i1        242003970       XP_002422928.1  57.5    127     54      02382     65      191     1.2e-41 176.0