how to fetch substring from records into another file

Hi all,
Im stuck in findind solution to ths problem. Please guide me if u have any ideas.

I have two files.

===FILE1===
>bi|2138271|geb|AAC15885.1|precursor [Sambucus nigra]
MRVIAAAMLYLYIVVLAICSVGIQGIDYPSVSFNLAGAKSATWDFLRMPHDLVGEDNKYNDGEPITGNII
GRDGLCVDVRNGYDTDGTPLQLWPCGTQRNQQWTFYTDDTIRSMGKCMTANGLSNGSNIMIFNCSTAVEN
AIKWEVTIDGSIINPSSG
>bi|21083|em|CAA26939.1| precursor [Ricinus communis]
MKPGGNTIVIWMYAVATWLCFGSTSGWSFTLEDNNIFPKQYPIINFTTAGATVQSYTNFIRAVRGRLTTG
ADVRHEIPVLPNRVGLPINQRFILVELSNHAELSVTLALDVTNAYVVGYRAGNSAYFFHPDNQEDAEAIT
HLFTDVQNRYTFAFGGNYDRLEQLAGNLRENIELGNGPLEEAISALYYYSTGGTQLPTL
>bi|19526601|geb|AAL87006.1| chain A [Viscum album]
YERLRLRVTHQTTGEEYFRFITLLRDYVSSGSFSNEIPLLRQSTIPVSDAQRFVLVELTNEGGDSITAAI
DVTNLYVVAYQAGDQSYFLRDAPRGAETHLFTGTTRSSLPFNGSYPDLERYAGHRDQIPLGIDQLIQSVT
ALRFPGGNTRTQARSILILIQMISEAARFNPILWRARQYINSGASFLPDVY

=====FILE2====
bi|2138271|geb|AAC15885----- 20 ---- 32
bi|2138271|geb|AAC15885 ----- 92 ---- 110
bi|19526601|geb|AAL8700 ----- 74 ---- 92
bi|2138271|geb|AAC15885 ----- 26 ---- 38
bi|21083|em|CAA26939.1| ----- 19 ---- 37
bi|21083|em|CAA26939.1| ----- 52 ---- 70

please note that ----- is not present in actual file i just wrote to make the columns visible otherwise it is a tab separated file.

It has three columns-
column1 contains patterns from FILE1
column 2 contains the start position to cut the string
and column3 contains end position til the string has to be cut
WHERE POSITION 1 STARTS FROM THE LINE CONTAINING ALL THE ALPHABETS IN CAPS WITHOUT ANY SPACE AND ENDS WHERE NEXT '>' SYMBOL APPEARS

I want to have a new file that contains header(LINE BEGINING WITH '>' SYMBOL) maching first column of FILE2 and then only a substring

for eg- in case of first record of FILE2

The ouput file should contain-
>bi|2138271|geb|AAC15885.1|precursor [Sambucus nigra] (20-32)
SVGIQGIDYPSV

and so on------ for all the other records of FILE2

Your contribution is highly appretiated.
Thanks
Smriti :slight_smile:

I am not sure that the output of the following script is what what you are looking for.
Try and adapt it.

awk -F'\t' '

NR==FNR {
   if (NR==1)
      Key_len = length($1) + 1;
   k = ">" substr($0,1, Key_len-1);
   n = ++Keys[k];
   From[k,n] = $2;
     To[k,n] = $3;
    Len[k,n] = $3 - $2;
   next;
}

function print_selected(    i,k) {
   if (selected) {
      k = substr(Header, 1, Key_len);
      for (i=1; i<=Keys[k]; i++) {
         printf("%s (%s-%s)\n", Header, From[k,i], To[k,i]);
         print substr(Alphabets, From[k,i], Len[k,i]);
      }
   }
}

/^>/ {
   print_selected();
   selected  = (substr($0, 1, Key_len) in Keys);
   Header    = $0;
   Alphabets = "";
   next;
}

selected {
   Alphabets = Alphabets $0;
}

END {
   print_selected();
}
' FILE2 FILE1

Input file FIL1 :

>bi|2138271|geb|AAC15885.1|precursor [Sambucus nigra]
MRVIAAAMLYLYIVVLAICSVGIQGIDYPSVSFNLAGAKSATWDFLRMPHDLVGEDNKYNDGEPITGNII
GRDGLCVDVRNGYDTDGTPLQLWPCGTQRNQQWTFYTDDTIRSMGKCMTANGLSNGSNIMIFNCSTAVEN
AIKWEVTIDGSIINPSSG
>bi|21083|em|CAA26939.1| precursor [Ricinus communis]
MKPGGNTIVIWMYAVATWLCFGSTSGWSFTLEDNNIFPKQYPIINFTTAGATVQSYTNFIRAVRGRLTTG
ADVRHEIPVLPNRVGLPINQRFILVELSNHAELSVTLALDVTNAYVVGYRAGNSAYFFHPDNQEDAEAIT
HLFTDVQNRYTFAFGGNYDRLEQLAGNLRENIELGNGPLEEAISALYYYSTGGTQLPTL
>bi|19526601|geb|AAL87006.1| chain A [Viscum album]
YERLRLRVTHQTTGEEYFRFITLLRDYVSSGSFSNEIPLLRQSTIPVSDAQRFVLVELTNEGGDSITAAI
DVTNLYVVAYQAGDQSYFLRDAPRGAETHLFTGTTRSSLPFNGSYPDLERYAGHRDQIPLGIDQLIQSVT
ALRFPGGNTRTQARSILILIQMISEAARFNPILWRARQYINSGASFLPDVY

Input file FILE2 :

bi|2138271|geb|AAC15885 20      32
bi|2138271|geb|AAC15885 92      110
bi|19526601|geb|AAL8700 74      92
bi|2138271|geb|AAC15885 26      38
bi|21083|em|CAA26939.1| 19      37
bi|21083|em|CAA26939.1| 52      70

Output:

>bi|2138271|geb|AAC15885.1|precursor [Sambucus nigra] (20-32)
SVGIQGIDYPSV
>bi|2138271|geb|AAC15885.1|precursor [Sambucus nigra] (92-110)
LWPCGTQRNQQWTFYTDD
>bi|2138271|geb|AAC15885.1|precursor [Sambucus nigra] (26-38)
IDYPSVSFNLAG
>bi|21083|em|CAA26939.1| precursor [Ricinus communis] (19-37)
LCFGSTSGWSFTLEDNNI
>bi|21083|em|CAA26939.1| precursor [Ricinus communis] (52-70)
TVQSYTNFIRAVRGRLTT
>bi|19526601|geb|AAL87006.1| chain A [Viscum album] (74-92)
NLYVVAYQAGDQSYFLRD

Jean-Pierre.

hey! this is wht I ws exactly looking for. Thank you so much Jean.

you know what Jean? It didn't work initially. Then I removed the field separator - tab to keep the space as default value and I also changed it in FILE2. Then it worked perfectly.

Before also I noticed that ths field separator value does't work in my shell. Tell me if you know why this can happen.

Thanks again. :slight_smile:
smriti

I run that script on files which need to extract longer substrings and the format of output file is getting disturbed. I tried some options of printf but coudn't correct it. Please help me out.

===FILE1====
>bi|37779709|geb|AAP20876.1| kinetic [ ternata]
MASKLLLFLLPAILGLIIPRPAVAVGTNYLLSGETLDTDGHLKNGDFDFIMQEDCNAVLYNGNWQSNTAN
KGRDCKLTLTDRGELVINNGEGSAVWRSGSQSAKGNYAAVLHPEGKLVIYGPSVFKINPWVPGLNSLRLG
NVPFTCNMLFSGQVLYGDGKITARNHMLVMQGDCNLVLYGGKCDWQSNTHGNGEHCFLRLNHKGELIIKD
DDFKSIWSSQSSSKQGDYVFILQDNGYGVIYGPAIWATSSKRSVAAQETMIGMVTEKVN
>bi|146403769|geb|ABQ32294.1| pure [an eg]
MAKLLLFLLPAILGLLIPRSAVALGTNYLLSGQTLNTDGHLKNGDFDLVMQNDCNLVLYNGNWQSNTANN
GRDCKLTLTDYGELVIKNGDGSTVWRSRAKSVKGNYAAVLHPDGRLVVFGPSVFKIDPWVPGLNSLRFRN
IPFTDNLLFSGQVLYGDGRLTAKNHQLVMQGDCNLVLYGGKYGWQSNTHGNGEHCFLRLNHKGELIIKDD
DFRPSGAAVPAPSR

===FILE2====
bi|37779709|geb|AAP20876.1| 28 264
bi|146403769|geb|ABQ32294.1| 27 224

===OUTPUTFILE===
>gi|37779709|gb|AAP20876.1| lectin [Pinellia ternata] (28-264)
NYLLSGETLDTDGHLKNGDFDFIMQEDCNAVLYNGNWQSNTANKGRDCKLTLTDRGELVINNGEGSAVWRSGSQSAKGNYAAVLHPEGKLVIYGPSVFKINPWVPGLNSLRLGNVPFTCNMLFSGQVLYGDGKITARNHMLVMQGDCNLVLYGGKCDWQSNTHGNGEHCFLRLNHKGELIIKDDDFKSIWSSQSSSKQGDYVFILQDNGYGVIYGPAIWATSSKRSVAAQETMIGM
>gi|146403769|gb|ABQ32294.1| lectin [Colocasia esculenta] (27-224)
NYLLSGQTLNTDGHLKNGDFDLVMQNDCNLVLYNGNWQSNTANNGRDCKLTLTDYGELVIKNGDGSTVWRSRAKSVKGNYAAVLHPDGRLVVFGPSVFKIDPWVPGLNSLRFRNIPFTDNLLFSGQVLYGDGRLTAKNHQLVMQGDCNLVLYGGKYGWQSNTHGNGEHCFLRLNHKGELIIKDDDFRPSGAAVPAPS

where as the output should be like this:
>gi|37779709|gb|AAP20876.1| lectin [Pinellia ternata] (28-264)
NYLLSGETLDTDGHLKNGDFDFIMQEDCNAVLYNGNWQSNTANKGRDCKLTLTDRGELVINNGEGSAVWR
SGSQSAKGNYAAVLHPEGKLVIYGPSVFKINPWVPGLNSLRLGNVPFTCNMLFSGQVLYGDGKITARNHML
VMQGDCNLVLYGGKCDWQSNTHGNGEHCFLRLNHKGELIIKDDDFKSIWSSQSSSKQGDYVFILQDNGY
GVIYGPAIWATSSKRSVAAQETMIGM
>gi|146403769|gb|ABQ32294.1| lectin [Colocasia esculenta] (27-224)
NYLLSGQTLNTDGHLKNGDFDLVMQNDCNLVLYNGNWQSNTANNGRDCKLTLTDYGELVIKNGDGSTVWR
SRAKSVKGNYAAVLHPDGRLVVFGPSVFKIDPWVPGLNSLRFRNIPFTDNLLFSGQVLYGDGRLTAKNHQLV
MQGDCNLVLYGGKYGWQSNTHGNGEHCFLRLNHKGELIIKDDDFRPSGAAVPAPS

where each line after header line should not have more than 70 characters.

I will be thankful to you. :slight_smile:
-smriti

Should the overlong lines be truncated or folder over multiple lines, or what?

To truncate (this truncates to eight characters; feel free to add more dots):

sed -e 's/^\(........\).*/\1/'

To fold, have a look at the fold command, or maybe use something like

sed -e 's/\(.......\)\(.\)/\1\
\2/g'

(Yes, that's a literal newline, to do the folding.) Again, this example folds at every eight characters; feel free to add more dots. (Actually the first line will be one character shorter. I'm too lazy to fix that. Or rather, the fix will depend on your sed dialect, and I don't want to go there.)

This could just as well be done with awk or perl or the truncation even with cut.

I don't want to truncate it and cut will also do that only. one more thing is that i want to apply ths folding only on lines after the line with '>' symbol.

and
sed -e 's/\(.......\)\(.\)/\1\
\2/g'
is giving a vague output ---
It is giving a / after every eight characters.

I gave the command as
sed -e 's/\(.......\)\(.\)/\1\\2/g' filename

Thanks
-smriti

No, there should be a literal newline between \1\ and \2

Anyway, fold should not touch lines which are shorter than 70 characters. As long as your > lines are shorter than that, you should be fine.

Here's another one for you:

perl -pe 's/(.{70})(?!$)/$1\n/g unless m/^>/'

I tried
perl -pe 's/(.{70})(?!$)/$1\n/g unless m/^>/' file_test

it is giving the error---

perl -pe 's/(.{70})(?file_test)/$1\n/g unless m/^>/' file_test
Sequence (?f...) not recognized in regex; marked by <-- HERE in m/(.{70})(?f <-- HERE ile_test)/ at -e line 1.

I tried this

perl -pe 's/(.{70})(?!$)/$1\n/g unless m/^>/' file_test

and it gave the error

Sequence (?f...) not recognized in regex; marked by <-- HERE in m/(.{70})(?f <-- HERE ile_test)/ at -e line 1.

what exactly is (?!$) doing?

Replace the print_selected function by :

function print_selected(    i,k,p,str) {
   if (selected) {
      k = substr(Header, 1, Key_len);
      for (i=1; i<=Keys[k]; i++) {
         printf("%s (%s-%s)\n", Header, From[k,i], To[k,i]);
         str = substr(Alphabets, From[k,i], Len[k,i]);
         p = 1
         while (p<=Len[k,i]) {
            print substr(str, p, 70);
            p += 70;
         }
      }
   }
}

jean-Pierre.

Do you have a really old version of Perl? In Perl 5 regular expressions, ?! is a negative assertion lookahead; it checks that the text just ahead of the match does not match the given regular expression (in this case, end of line). I used that to avoid the problem I had with the sed script, which caused the first line to be wrapped one character earlier than subsequent ones.

You could try this as well:

perl -pe 'unless (m/^>/) { chomp if (length() % 70 == 1); s/(.{70})/$1\n/g; }'

As aigles suggests, if the output from the awk script is wrong, fixing the awk script is probably better, though.

your code is giving the desired output. Thanks.

ya, I'll prefer to change awk script coz that makes my task easier.

I hope one day i'll learn to code better, like you guys. :slight_smile:

-smriti

The newer version is not working in case the substring is smaller than 70 characters.

Actually FILE2 is ouput of some other program and the substring can be as small as 10 characters and as long as 300 characters or more..

for the same FILE1, FILE2 can be--
bi|37779709|geb|AAP20876.1| 28 264
bi|146403769|geb|ABQ32294.1| 27 224
bi|146403769|geb|ABQ32294.1| 20 34

in which the third row needs to extract only 14 characters.. :confused:

-smriti

I tried it again and its running with the following FILE2

bi|37779709|geb|AAP20876.1| 28 264
bi|146403769|geb|ABQ32294.1| 27 224
bi|146403769|geb|ABQ32294.1| 20 34

BUT IF DIFFERENT ID LIES BETWEEN TWO SAME IDs

LIKE-
bi|146403769|geb|ABQ32294.1| 20 34
bi|37779709|geb|AAP20876.1| 28 264
bi|146403769|geb|ABQ32294.1| 27 224

THEN IT IS SKIPPED AND I GET THE RESULT ONLY FOR
bi|146403769|geb|ABQ32294.1| 20 34
bi|146403769|geb|ABQ32294.1| 27 224

AND NOT FOR
bi|37779709|geb|AAP20876.1| 28 264

but i think i can solve this by sorting FILE2 first and then using it.

Thanks
-smriti

after trying many other combinations I realised that it is related to length of COLUMN ONE of FILE2 , if it varies the output varies.

and sometimes it gives a blank line at the end in the output and if i put ths last entry of FILE2 somewhere in the middle all the entries after that are not processed. And I can't understand this.

i'll try by making the length of column 1 same in all entries.

Thanks for all your support.
-smriti

The order of the output correspond to the order of headers in file1.
The logic is :
For each record in file 1
Proceed extractions specified in file 2 relative to this record

The script:

awk '

NR==FNR {
   if (NR==1)
      Key_len = length($1) + 1;
   k = ">" substr($0,1, Key_len-1);
   n = ++Keys[k];
   From[k,n] = $2;
     To[k,n] = $3;
    Len[k,n] = $3 - $2;
   next;
}

function print_selected(    i,k,p,str) {
   if (selected) {
      k = substr(Header, 1, Key_len);
      for (i=1; i<=Keys[k]; i++) {
         printf("%s (%s-%s)\n", Header, From[k,i], To[k,i]);
         str = substr(Alphabets, From[k,i], Len[k,i]);
         p = 1
         while (p<=Len[k,i]) {
            print substr(str, p, 70);
            p += 70;
         }
      }
   }
}

/^>/ {
   print_selected();
   selected  = (substr($0, 1, Key_len) in Keys);
   Header    = $0;
   Alphabets = "";
   next;
}

selected {
   Alphabets = Alphabets $0;
}

END {
   print_selected();
}
' sm2.dat sm1.dat

Input file 1 (sm1.dat):

>bi|2138271|geb|AAC15885.1|precursor [Sambucus nigra]
MRVIAAAMLYLYIVVLAICSVGIQGIDYPSVSFNLAGAKSATWDFLRMPHDLVGEDNKYNDGEPITGNII
GRDGLCVDVRNGYDTDGTPLQLWPCGTQRNQQWTFYTDDTIRSMGKCMTANGLSNGSNIMIFNCSTAVEN
AIKWEVTIDGSIINPSSG
>bi|21083|em|CAA26939.1| precursor [Ricinus communis]
MKPGGNTIVIWMYAVATWLCFGSTSGWSFTLEDNNIFPKQYPIINFTTAGATVQSYTNFIRAVRGRLTTG
ADVRHEIPVLPNRVGLPINQRFILVELSNHAELSVTLALDVTNAYVVGYRAGNSAYFFHPDNQEDAEAIT
HLFTDVQNRYTFAFGGNYDRLEQLAGNLRENIELGNGPLEEAISALYYYSTGGTQLPTL
>bi|19526601|geb|AAL87006.1| chain A [Viscum album]
YERLRLRVTHQTTGEEYFRFITLLRDYVSSGSFSNEIPLLRQSTIPVSDAQRFVLVELTNEGGDSITAAI
DVTNLYVVAYQAGDQSYFLRDAPRGAETHLFTGTTRSSLPFNGSYPDLERYAGHRDQIPLGIDQLIQSVT
ALRFPGGNTRTQARSILILIQMISEAARFNPILWRARQYINSGASFLPDVY

Input file 2 (sm2.dat) :

bi|2138271|geb|AAC15885 92      110
bi|19526601|geb|AAL8700 74      92
bi|2138271|geb|AAC15885 20      132
bi|21083|em|CAA26939.1| 19      37
bi|21083|em|CAA26939.1| 52      70
bi|2138271|geb|AAC15885 26      38

Output :

>bi|2138271|geb|AAC15885.1|precursor [Sambucus nigra] (92-110)
LWPCGTQRNQQWTFYTDD
>bi|2138271|geb|AAC15885.1|precursor [Sambucus nigra] (20-132)
SVGIQGIDYPSVSFNLAGAKSATWDFLRMPHDLVGEDNKYNDGEPITGNIIGRDGLCVDVRNGYDTDGTP
LQLWPCGTQRNQQWTFYTDDTIRSMGKCMTANGLSNGSNIMI
>bi|2138271|geb|AAC15885.1|precursor [Sambucus nigra] (26-38)
IDYPSVSFNLAG
>bi|21083|em|CAA26939.1| precursor [Ricinus communis] (19-37)
LCFGSTSGWSFTLEDNNI
>bi|21083|em|CAA26939.1| precursor [Ricinus communis] (52-70)
TVQSYTNFIRAVRGRLTT
>bi|19526601|geb|AAL87006.1| chain A [Viscum album] (74-92)
NLYVVAYQAGDQSYFLRD

Jean-Pierre.

My script assumes that all keys in file 2 are same length.
The key length is given by the key of the first line.

NR==FNR {
   if (NR==1)
      Key_len = length($1) + 1;
   k = ">" substr($0,1, Key_len-1);

Jean-Pierre.

Ya I got your point but can we do it other way round also like-

For each record in file 2

Proceed extractions in file 1 relative to this record

If we can match first column of FILE2 as pattern in FILE1 header line and when it is found we proceed to extract the subtring in the lines following the header line till the next '>' appears by using values of column 2 and 3 of FILE2.

Just want to know your opinion on this logic as I am in learning phase. I'll try to work this out on my own if you think it will not be unnecessarily complicated.

Just want to develop my logics :slight_smile:

Give your suggestions and hints to elaborate it if you find it ok.

Thanks Jean
-smriti

The answer depends of the key length (first column of file2).
If keys are fixed length, there is no problem to invert the logic.

Read file1: Memorize each header and datas records in an array indexed by the key.
Read file2:For each record, if the record had been memorized (key is an index in the array) proceed extraction from memorized datas.

If keys are variable length, the header and records from file1 will be memorized in a array indexed by record number, we can't use another index because we don't know what part of the record is the key.
For each extraction specified in file 2, we will scan sequentially the array until we found the right record (more time consuming).

Jean-Pierre.

Ya Keys can be of fixed length.. that is not a problem.
So I'll try this out and will ask you if I'll have any doubts.

Thanks Jean,
-smriti