extracting substrings

smriti_shridhar · December 1, 2008, 1:34am

Hi guys,
I am stuck in this problem. Please help.

I have two files.
FILE1 (with records starting from '>' )
>TC1723_3 similar to Scific_A7Q9Q3
EMSPSQDYCDDYFKLTYPCTAGAQYYGRGALPVYWNYNYGAIGEALKLDLLNHPEYIEQN
ATMAFQAAIWRWMNPMKKGQPSAHDAFVGNWKP
>TC214_2 similar to Quiet_Ref100_Q8W2B2 Cluster; Capsule catabar holesome, partial (58%)
S**ELSSCYQRRKMRYSFLIFLTLALLLTTSSAQQCGKQAGGRVCANKLCCSQYGFCGS
SRNYCGAGCQSNCRSVASGNTESEAANAHRKNLPGHSN*SCYSF*FTMNIIMFHVCLLR
TTNKN

FILE2 ( with 3 columns, col1 is ID col2 and col3 are the substring co-ordinates). It is a single space separated file but shown with '-' for clarity
TC1723_3 - 10 - 40
TC214_2 - 5 - 115

I need the OUTPUT FILE as -
>TC1723_3 similar to Scific_A7Q9Q3 (Region 10 - 40 of 95)
DYFKLTYPCTAGAQYYGRGALPVYWNYNYGA
>TC214_2 similar to Quiet_Ref100_Q8W2B2 Cluster; n=1; Capsule catabar holesome, partial (58%) (Region 5 - 115 of 125)
SSCYQRRKMRYSFLIFLTLALLLTTSSAQQCGKQAGGRVCANKLCCSQYGFCGSSRNYC
GAGCQSNCRSVASGNTESEAANAHRKNLPGHSN*SCYSFFTMNIIMFHV
where (Region 10 - 40 of 95) represents region of substring and 95 is the total length of the subsring following the line beginning with '>'

Thanks in advance.

otheus · December 5, 2008, 7:48am

awk '/^>/ { last=$0; next; } !/^>/ { print last,substr($0,10,30);' }

In the unlikely case that awk reports that it doesn't know about "substr", use "nawk", "mawk", or "gawk".

Also, you said 10 - 40. I'm assuming that means starting at the 10th character and stopping but including the 39th character.