find common entries and match the number with long sequence and cut that sequence in output

manigrover · September 18, 2012, 2:18am

Hi all,

I have a file like this

ID   3BP5L_HUMAN             Reviewed;         393 AA.
AC   Q7L8J4; Q96FI5; Q9BQH8; Q9C0E3;
DT   05-FEB-2008, integrated into UniProtKB/Swiss-Prot.
DT   05-JUL-2004, sequence version 1.
DT   05-SEP-2012, entry version 71.
FT   COILED       59    140       Potential.
FT   COILED      169    272       Potential.
SQ   SEQUENCE   393 AA;  43499 MW;  3693431765F90FDC CRC64;
     MAELRQVPGG RETPQGELRP EVVEDEVPRS PVAEEPGGGG SSSSEAKLSP REEEELDPRI
     QEELEHLNQA SEEINQVELQ LDEARTTYRR ILQESARKLN TQGSHLGSCI EKARPYYEAR
     RLAKEAQQET QKAALRYERA VSMHNAAREM VFVAEQGVMA DKNRLDPTWQ EMLNHATCKV
     NEAEEERLRG EREHQRVTRL CQQAEARVQA LQKTLRRAIG KSRPYFELKA QFSQILEEHK
     AKVTELEQQV AQAKTRYSVA LRNLEQISEQ IHARRRGGLP PHPLGPRRSS PVGAEAGPED
     MEDGDSGIEG AEGAGLEEGS SLGPGPAPDT DTLSLLSLRT VASDLQKCDS VEHLRGLSDH
     VSLDGQELGT RSGGRRGSDG GARGGRHQRS VSL

The expected output is:

first line: whatever in front of ID 1 alphanumeric number before Human

next lines :whatever in front of "FT "and "coiled" in digits

next lines :search for number mentioned before word coiled and cut the velow sequence accoridngly that

so 59-140 shoul dbe onse ssequence and 169-272 other sequence

3BP5L
COILED       59    140       
COILED      169    272
MAELRQVPGG RETPQGELRP EVVEDEVPRS PVAEEPGGGG SSSSEAKLSP REEEELDPRI
     QEELEHLNQA SEEINQVELQ LDEARTTYRR ILQESARKLN TQGSHLGSCI EKARPYYEAR
     RLAKEAQQET QKAALRYERA VSMHNAAREM (should belong to number 59-140 although I just did it myself as an example)

VFVAEQGVMA DKNRLDPTWQ EMLNHATCKV
     NEAEEERLRG EREHQRVTRL CQQAEARVQA LQKTLRRAIG KSRPYFELKA QFSQILEEHK
     AKVTELEQQV AQAKTRYSVA LRNLEQISEQ IHARRRGGLP PHPLGPRRSS PVGAEAGPED
     MEDGDSGIEG AEGAGLEEGS SLGPGPAPDT DTLSLLSLRT VASDLQKCDS VEHLRGLSDH
     VSLDGQELGT RSGGRRGSDG GARGGRHQRS VSL(this should belong to number 169-272 although I just did it myself as an example)

in above answers i PUT SEQUENCES JUST AN Example but I want to cut the sequences of these numbers from the whole sequence and write in the output.

DGPickett · September 19, 2012, 1:15pm

Is 59 zero based as in start with the 60th character, and is 140 1-based, as in stop with the 140th character (I guss it could be zero-based stop before the cahracter at offset 140)?

The values are indented and divided into groups -- is that what the file looks like? You want the same indentation and division in the output?

ksh/bash can read the 6 fields into an array of variables, save the pairs of numbers in a secondary array, mathematically decompose the numbers into line, field and offset, and compose the output with the same white space. Some line's field values will set a state variable, and the right state begins the field cutting. Initial state says you capture and print from the ID line, second state is look for and print from the FT COILED lines or find SQ SEQUENCE, and SQ SEQUENCE says start looking for the subset of output fields. I assume the COILED lines are forward sequenced.