How to split a column based on |?

Hi all,

Newbie here, so please bear over with my stupid question :slight_smile:

I have used far too long time today on figuring this out, so I hope that someone here can help me move on.

I have some annotation data for a transcriptome where I want to split a column containing NCBI accession IDs into a column with GIs and REFs (I still want keep the rest of the columns - just split the IDs)

This is an example of a row in my data set:

TRINITY_DN17272_c0_g1_i1	gi|242003970|ref|XP_002422928.1|	57.5	127	54	0	2	382	65	191	1.2e-41	176.0

What I want:

TRINITY_DN17272_c0_g1_i1	242003970   XP_002422928.1   57.5	127	54	0	2	382	65	191	1.2e-41	176.0

I have tried the following:

cut -d'|' -f2,4 File.m8 | tr '|' '\t'

It splits perfectly - too perfectly because I loose everything else besides the two columns:

242003970   XP_002422928.1

Can someone please help me out?

Thanks! Birgitte

Your target field separator is not clear - do you want the two elements to be each a field of its own with a <TAB> between them, or both making up one field, separated by a space or two?
For the first case, try

awk '{split ($2, T, "|"); $2 = T[2] OFS T[4]}1' OFS="\t" file
TRINITY_DN17272_c0_g1_i1    242003970    XP_002422928.1    57.5    127    54    0    2    382    65    191    1.2e-41    176.0
1 Like
perl -pae '$F[1] = join "\t", (split /\|/, $F[1])[1,3] and $_ = join "\t", @F' BioBing.example
1 Like

A less brute-force method:

tr '|' '\t' < infile > outfile
1 Like

Hi Corona688,

The OP requires more than just substituting every pipe symbol for a tab.

1 Like

If you columns are fixed, then perhaps you could:-

tab=$(printf "\t")
tr '|' "$tab" < input_file | cut -d "$tab" -f1,3,5,7-

Does that help? What I'm trying is to convert every | to a tab and then cut the fields you want separating on a tab character. I hope it does what you wanted. I've set the variable tab just to make it clear. Feel free to replace it as it works best for you.

Kind regards,
Robin

1 Like

Since gi and ref seems to be always there, perhaps another suggestion:

perl -pe 's/\|?(?:\s+gi|ref)?\|\s*/\t/g' file

Output:

TRINITY_DN17272_c0_g1_i1        242003970       XP_002422928.1  57.5    127     54      02382     65      191     1.2e-41 176.0
1 Like

Thank you! This worked like a charm!