Split command

siya1 · March 13, 2013, 12:14pm

Hi I have a sequence which looks like this

# PH01000000
PH01000000G0240 P.he_genemodel_v1.0 CDS 120721 121773 . - . ID=PH01000000G0240.CDS;Parent=PH01000000G0240
PH01000001G0190 P.he_genemodel_v1.0 mRA 136867 137309 . - . ID=PH01000001G0190.mRNA;Parent=PH01000001G0190
.............................................
PH01278028G0010 P.he_genemodel_v1.0 CDS 27 501.. . - . ID=PH01278028G0010;Description="oereed"
PH01278104G0010 P.he_genemodel_v1.0 CDS 34 171 . - . ID=PH01278104G0010.CDS;Parent=PH01278104G0010

i want to split the first colum into 2 columns seperating first 10 bits as column 1 and then remainnig as column 2 and retain the remaining columns as it is.

PH01000000  G0240 P.he_genemodel_v1.0 CDS 120721 121773 . - . ID=PH01000000G0240.CDS;Parent=PH01000000G0240
PH01000001  G0190 P.he_genemodel_v1.0 mRA 136867 137309 . - . ID=PH01000001G0190.mRNA;Parent=PH01000001G0190
.............................................
PH01278028   G0010 P.he_genemodel_v1.0 CDS 27 501.. . - . ID=PH01278028G0010;Description="oereed"
PH01278104   G0010 P.he_genemodel_v1.0 CDS 34 171 . - . ID=PH01278104G0010.CDS;Parent=PH01278104G0010

i am doing this becoz i want to modify the first column and after modification i want to merge again.
So is it possible to first split the 1st column into 2 and then after my modification merge them again?

What command can i use to split and merge them

RudiC · March 13, 2013, 12:19pm

One way would be

sed 's:.:& :10' file

to split, and

 sed 's: ::' file

to merge again.

Don_Cragun · March 13, 2013, 2:07pm

What makes you think you need to split the 1st column before modifying it?

Why not just modify the 1st 10 characters on the line instead of splitting, modifying the 1st 10 characters on the line, and merging?

siya1 · March 13, 2013, 2:32pm

my required result is

string0G0240 P.he_genemodel_v1.0 CDS 120721 121773 . - . ID=PH01000000G0240.CDS;Parent=PH01000000G0240
string1G0190 P.he_genemodel_v1.0 mRA 136867 137309 . - . ID=PH01000001G0190.mRNA;Parent=PH01000001G0190
.............................................
string278028G0010 P.he_genemodel_v1.0 CDS 27 501.. . - . ID=PH01278028G0010;Description="oereed"
string278104G0010 P.he_genemodel_v1.0 CDS 34 171 . - . ID=PH01278104G0010.CDS;Parent=PH01278104G0010

So if i need this to happen,I need to replace the entries of this format
PH01 by string in first column directly
but if i do it

the entries of
PH01278028G0010 will become string 278028G0001 as per my requirement
but my entries of
PH01000000G0240 will look like string000000G0240 which i want as string0G0240
so i thought i will split from 10 bits n do selective replace only on the first column

Is my approach too run around the situation?
thanks between for your feedback!!

Don_Cragun · March 13, 2013, 3:28pm

siya,
Your description of what you are trying to do is not at all clear. Looking at the "required result" in message #4 in this thread, I'm guessing that you want to replace PH01 immediately followed by up to four zeros with string . If that is what you want, the following awk script will do that for you:

awk 'match($1, /^PH010{0,4}/) {
        $1 = "string" substr($1, RLENGTH+1)
}
1' input

If you are using a Solaris/SunOS system, use /usr/xpg4/bin/awk or nawk instead of awk .

If the file input contains the data specified in message #1 in this thread, the output is:

string0G0240 P.he_genemodel_v1.0 CDS 120721 121773 . - . ID=PH01000000G0240.CDS;Parent=PH01000000G0240
string1G0190 P.he_genemodel_v1.0 mRA 136867 137309 . - . ID=PH01000001G0190.mRNA;Parent=PH01000001G0190
.............................................
string278028G0010 P.he_genemodel_v1.0 CDS 27 501.. . - . ID=PH01278028G0010;Description="oereed"
string278104G0010 P.he_genemodel_v1.0 CDS 34 171 . - . ID=PH01278104G0010.CDS;Parent=PH01278104G0010

which matches what you specified in message #4 in this thread.

siya1 · March 13, 2013, 5:30pm

Hi,
Sorry for the confusion!!

I want to basically convert ONLY the first column of my entire sequence
from

[

B]PH01000000G0240                   to                       string0G0240
PH01000001G0190                   to                       string1G0190
PH01000002G0120                   to                       string2G0120

,....
....

PH01270000G0010                   to                       string270000G0010   
PH01278028G0014                   to                       string278028G0014   
PH012781040010                     to                       string278104G0010

With respect to code,why does it have {0,4 }in initial part?

I dint understand the part in code : awk 'match($1, /^PH010{0,4}/)
Please do advise.
Thanks:confused:

Don_Cragun · March 13, 2013, 9:35pm

Apparently my script didn't work for you. That is because you won't describe in English the transformation that is to be performed. I explained in my last post what the script I gave you would do. And, it made all of the transformations your 5 examples showed.

But, it will not insert the G shown in red in your new example. That G did not appear at all in the 1st string whether or not we would break it into an initial 10 character field and a 2nd field with the remaining characters, or left it as a single field.

PLEASE explain in English what you are trying to do instead of giving a small set of inconsistent examples!

siya1 · March 14, 2013, 7:42pm

Thanks for the suggestions. I could figure out my solution though.