rahim42
December 26, 2013, 7:25am
1
I have a text file, input.fasta contains some protein sequences. input.fasta is shown below.
>P02649
MKVLWAALLVTFLAGCQAKVEQAVETEPEPELRQQTEWQSGQRWELALGRFWDYLRWVQT
LSEQVQEELLSSQVTQELRALMDETMKELKAYKSELEEQLTPVAEETRARLSKELQAAQA
RLGADMEDVCGRLVQYRGEVQAMLGQSTEELRVRLASHLRKLRKRLLRDADDLQKRLAVY
QAGAREGAERGLSAIRERLGPLVEQGRVRAATVGSLAGQPLQERAQAWGERLRARMEEMG
SRTRDRLDEVKEQVAEVRAKLEEQAQQIRLQAEAFQARLKSWFEPLVEDMQRQWAGLVEK
VQAAVGTSAAPVPSDNH
>P17427
MPAVSKGDGMRGLAVFISDIRNCKSKEAEIKRINKELANIRSKFKGDKALDGYSKKKYVC
KLLFIFLLGHDIDFGHMEAVNLLSSNRYTEKQIGYLFISVLVNSNSELIRLINNAIKNDL
ASRNPTFMGLALHCIANVGSREMAEAFAGEIPKILVAGDTMDSVKQSAALCLLRLYRTSP
DLVPMGDWTSRVVHLLNDQHLGVVTAATSLITTLAQKNPEEFKTSVSLAVSRLSRIVTSA
I would like to extract the parts of the above sequences based on another file ids.txt . It is shown below.
P02649 3 23
P02649 130 162
P17427 25 27
For example, From the sequence P02649, I need to extract the positions from 3rd character to 23rd character. The output would be
>P02649:3-23
VLWAALLVTFLAGCQAKVEQA
>P02649:130-162
CGRLVQYRGEVQAMLGQSTEELRVRLASHLRKL
>P17427:25-27
SKE
I tried the following python code.
#!/usr/bin/python
import sys
import socket
import string
import random
import os
import re
FASTA= sys.argv[1]
BED= sys.argv[2]
fasta= open(FASTA, 'U')
fasta_dict= {}
for line in fasta:
line= line.strip()
if line == '':
continue
if line.startswith('>'):
seqname= line.lstrip('>')
seqname= re.sub('\..*', '', seqname)
fasta_dict[seqname]= ''
else:
fasta_dict[seqname] += line
fasta.close()
bed= open(BED, 'U')
for line in bed:
line= line.strip().split('\t')
outname= line[0] + ':' + line[1] + '-' + line[2]
print('>' + outname)
s= int(line[1])
e= int(line[2])
print(fasta_dict[line[0]][s:e])
bed.close()
sys.exit()
But I got the error like this
Traceback (most recent call last):
File "domainseq.py", line 29, in <module>
outname= line[0] + ':' + line[1] + '-' + line[2]
IndexError: list index out of range
your help would be appreciated
Klashxx
December 26, 2013, 8:37am
2
Use:
line= line.strip().split()
Instead of:
line= line.strip().split('\t')
rahim42
December 26, 2013, 8:57am
3
Thank you for your answer. I corrected the code. But the first character of each sequence is not printing. I got the output like this
>P02649:3-23
LWAALLVTFLAGCQAKVEQA
>P02649:130-162
GRLVQYRGEVQAMLGQSTEELRVRLASHLRKL
>P17427:25-27
KE
My desired output is
>P02649:3-23 VLWAALLVTFLAGCQAKVEQA >P02649:130-162 CGRLVQYRGEVQAMLGQSTEELRVRLASHLRKL >P17427:25-27 SKE
An Awk approch :
$ cat input.fasta
>P02649
MKVLWAALLVTFLAGCQAKVEQAVETEPEPELRQQTEWQSGQRWELALGRFWDYLRWVQT
LSEQVQEELLSSQVTQELRALMDETMKELKAYKSELEEQLTPVAEETRARLSKELQAAQA
RLGADMEDVCGRLVQYRGEVQAMLGQSTEELRVRLASHLRKLRKRLLRDADDLQKRLAVY
QAGAREGAERGLSAIRERLGPLVEQGRVRAATVGSLAGQPLQERAQAWGERLRARMEEMG
SRTRDRLDEVKEQVAEVRAKLEEQAQQIRLQAEAFQARLKSWFEPLVEDMQRQWAGLVEK
VQAAVGTSAAPVPSDNH
>P17427
MPAVSKGDGMRGLAVFISDIRNCKSKEAEIKRINKELANIRSKFKGDKALDGYSKKKYVC
KLLFIFLLGHDIDFGHMEAVNLLSSNRYTEKQIGYLFISVLVNSNSELIRLINNAIKNDL
ASRNPTFMGLALHCIANVGSREMAEAFAGEIPKILVAGDTMDSVKQSAALCLLRLYRTSP
DLVPMGDWTSRVVHLLNDQHLGVVTAATSLITTLAQKNPEEFKTSVSLAVSRLSRIVTSA
$ cat ids.txt
P02649 3 23
P02649 130 162
P17427 25 27
$ awk '{print $0,length($0)}' input.fasta
>P02649 7
MKVLWAALLVTFLAGCQAKVEQAVETEPEPELRQQTEWQSGQRWELALGRFWDYLRWVQT 60
LSEQVQEELLSSQVTQELRALMDETMKELKAYKSELEEQLTPVAEETRARLSKELQAAQA 60
RLGADMEDVCGRLVQYRGEVQAMLGQSTEELRVRLASHLRKLRKRLLRDADDLQKRLAVY 60
QAGAREGAERGLSAIRERLGPLVEQGRVRAATVGSLAGQPLQERAQAWGERLRARMEEMG 60
SRTRDRLDEVKEQVAEVRAKLEEQAQQIRLQAEAFQARLKSWFEPLVEDMQRQWAGLVEK 60
VQAAVGTSAAPVPSDNH 17
>P17427 7
MPAVSKGDGMRGLAVFISDIRNCKSKEAEIKRINKELANIRSKFKGDKALDGYSKKKYVC 60
KLLFIFLLGHDIDFGHMEAVNLLSSNRYTEKQIGYLFISVLVNSNSELIRLINNAIKNDL 60
ASRNPTFMGLALHCIANVGSREMAEAFAGEIPKILVAGDTMDSVKQSAALCLLRLYRTSP 60
DLVPMGDWTSRVVHLLNDQHLGVVTAATSLITTLAQKNPEEFKTSVSLAVSRLSRIVTSA 60
$ awk 'FNR==NR{A[FNR]=$1 FS $2 FS $3;next}{x=$1;for(i in A){split(A,S);if(x == ">"S[1]){print x":"S[2]"-"S[3];getline;print substr($0,S[2],S[3]-S[2])}} }' ids.txt input.fasta
>P02649:3-23
VLWAALLVTFLAGCQAKVEQ
>P02649:130-162
>P17427:25-27
SK
rahim42
December 26, 2013, 9:07am
5
Thank you for your answer. I corrected the code. But the first character of each sequence is not printing. I got the output like this
>P02649:3-23
LWAALLVTFLAGCQAKVEQA
>P02649:130-162
GRLVQYRGEVQAMLGQSTEELRVRLASHLRKL
>P17427:25-27
KE
My desired output is
>P02649:3-23
VLWAALLVTFLAGCQAKVEQA
>P02649:130-162
CGRLVQYRGEVQAMLGQSTEELRVRLASHLRKL
>P17427:25-27
SKE
---------- Post updated at 09:04 AM ---------- Previous update was at 09:00 AM ----------
An Awk approch :
$ cat input.fasta
>P02649
MKVLWAALLVTFLAGCQAKVEQAVETEPEPELRQQTEWQSGQRWELALGRFWDYLRWVQT
LSEQVQEELLSSQVTQELRALMDETMKELKAYKSELEEQLTPVAEETRARLSKELQAAQA
RLGADMEDVCGRLVQYRGEVQAMLGQSTEELRVRLASHLRKLRKRLLRDADDLQKRLAVY
QAGAREGAERGLSAIRERLGPLVEQGRVRAATVGSLAGQPLQERAQAWGERLRARMEEMG
SRTRDRLDEVKEQVAEVRAKLEEQAQQIRLQAEAFQARLKSWFEPLVEDMQRQWAGLVEK
VQAAVGTSAAPVPSDNH
>P17427
MPAVSKGDGMRGLAVFISDIRNCKSKEAEIKRINKELANIRSKFKGDKALDGYSKKKYVC
KLLFIFLLGHDIDFGHMEAVNLLSSNRYTEKQIGYLFISVLVNSNSELIRLINNAIKNDL
ASRNPTFMGLALHCIANVGSREMAEAFAGEIPKILVAGDTMDSVKQSAALCLLRLYRTSP
DLVPMGDWTSRVVHLLNDQHLGVVTAATSLITTLAQKNPEEFKTSVSLAVSRLSRIVTSA
$ cat ids.txt
P02649 3 23
P02649 130 162
P17427 25 27
$ awk '{print $0,length($0)}' input.fasta
>P02649 7
MKVLWAALLVTFLAGCQAKVEQAVETEPEPELRQQTEWQSGQRWELALGRFWDYLRWVQT 60
LSEQVQEELLSSQVTQELRALMDETMKELKAYKSELEEQLTPVAEETRARLSKELQAAQA 60
RLGADMEDVCGRLVQYRGEVQAMLGQSTEELRVRLASHLRKLRKRLLRDADDLQKRLAVY 60
QAGAREGAERGLSAIRERLGPLVEQGRVRAATVGSLAGQPLQERAQAWGERLRARMEEMG 60
SRTRDRLDEVKEQVAEVRAKLEEQAQQIRLQAEAFQARLKSWFEPLVEDMQRQWAGLVEK 60
VQAAVGTSAAPVPSDNH 17
>P17427 7
MPAVSKGDGMRGLAVFISDIRNCKSKEAEIKRINKELANIRSKFKGDKALDGYSKKKYVC 60
KLLFIFLLGHDIDFGHMEAVNLLSSNRYTEKQIGYLFISVLVNSNSELIRLINNAIKNDL 60
ASRNPTFMGLALHCIANVGSREMAEAFAGEIPKILVAGDTMDSVKQSAALCLLRLYRTSP 60
DLVPMGDWTSRVVHLLNDQHLGVVTAATSLITTLAQKNPEEFKTSVSLAVSRLSRIVTSA 60
$ awk 'FNR==NR{A[FNR]=$1 FS $2 FS $3;next}{x=$1;for(i in A){split(A,S);if(x == ">"S[1]){print x":"S[2]"-"S[3];getline;print substr($0,S[2],S[3]-S[2])}} }' ids.txt input.fasta
>P02649:3-23
VLWAALLVTFLAGCQAKVEQ
>P02649:130-162
>P17427:25-27
SK
---------- Post updated at 09:07 AM ---------- Previous update was at 09:04 AM ----------
Dear Hedge,
Thanks for your answer. But your output is different from my desired output.
rahim42:
Thank you for your answer. I corrected the code. But the first character of each sequence is not printing. I got the output like this
>P02649:3-23
LWAALLVTFLAGCQAKVEQA
>P02649:130-162
GRLVQYRGEVQAMLGQSTEELRVRLASHLRKL
>P17427:25-27
KE
My desired output is
>P02649:3-23
VLWAALLVTFLAGCQAKVEQA
>P02649:130-162
CGRLVQYRGEVQAMLGQSTEELRVRLASHLRKL
>P17427:25-27
SKE
---------- Post updated at 09:04 AM ---------- Previous update was at 09:00 AM ----------
Sorry for late reply I was bit busy .. Here is what you expect....
$ cat ids.txt
P02649 3 23
P02649 130 162
P17427 25 27
$ cat input.fasta
>P02649
MKVLWAALLVTFLAGCQAKVEQAVETEPEPELRQQTEWQSGQRWELALGRFWDYLRWVQT
LSEQVQEELLSSQVTQELRALMDETMKELKAYKSELEEQLTPVAEETRARLSKELQAAQA
RLGADMEDVCGRLVQYRGEVQAMLGQSTEELRVRLASHLRKLRKRLLRDADDLQKRLAVY
QAGAREGAERGLSAIRERLGPLVEQGRVRAATVGSLAGQPLQERAQAWGERLRARMEEMG
SRTRDRLDEVKEQVAEVRAKLEEQAQQIRLQAEAFQARLKSWFEPLVEDMQRQWAGLVEK
VQAAVGTSAAPVPSDNH
>P17427
MPAVSKGDGMRGLAVFISDIRNCKSKEAEIKRINKELANIRSKFKGDKALDGYSKKKYVC
KLLFIFLLGHDIDFGHMEAVNLLSSNRYTEKQIGYLFISVLVNSNSELIRLINNAIKNDL
ASRNPTFMGLALHCIANVGSREMAEAFAGEIPKILVAGDTMDSVKQSAALCLLRLYRTSP
DLVPMGDWTSRVVHLLNDQHLGVVTAATSLITTLAQKNPEEFKTSVSLAVSRLSRIVTSA
awk '
function char(){
n=split(X[p],Z)
for(i=1;i<=n;i+=2){print p":"Z"-"Z[i+1] RS substr(s,Z,Z[i+1]-Z+1)}
}
FNR==NR{
X[">"$1] = X[">"$1] ? X[">"$1] FS $2 FS $3 : $2 FS $3
next
}
!/>/{s = s $0}
/>/{if(s){ char();s = ""}p=$1}
END{char()}
' ids.txt input.fasta
>P02649:3-23
VLWAALLVTFLAGCQAKVEQA
>P02649:130-162
CGRLVQYRGEVQAMLGQSTEELRVRLASHLRKL
>P17427:25-27
SKE
1 Like
Yoda
December 26, 2013, 11:38am
7
Another approach:
awk '
NR == FNR {
ID[NR] = $0
next
}
/>/ {
i = $0
}
!/>/ {
IN = IN ? IN $0 : $0
}
END {
for ( j in ID )
{
for ( k in IN )
{
split ( ID[j], A )
if ( k ~ A[1] )
{
print k ":" A[2] "-" A[3]
print substr ( IN[k], A[2], A[3] - A[2] + 1 )
}
}
}
}
' ids.txt input.fasta
1 Like
This is easily done with the package bedtools with FastaFromBed function
https://code.google.com/p/bedtools/
# cat input1.fasta
>P02649
MKVLWAALLVTFLAGCQAKVEQAVETEPEPELRQQTEWQSGQRWELALGRFWDYLRWVQT
LSEQVQEELLSSQVTQELRALMDETMKELKAYKSELEEQLTPVAEETRARLSKELQAAQA
RLGADMEDVCGRLVQYRGEVQAMLGQSTEELRVRLASHLRKLRKRLLRDADDLQKRLAVY
QAGAREGAERGLSAIRERLGPLVEQGRVRAATVGSLAGQPLQERAQAWGERLRARMEEMG
SRTRDRLDEVKEQVAEVRAKLEEQAQQIRLQAEAFQARLKSWFEPLVEDMQRQWAGLVEK
VQAAVGTSAAPVPSDNH
>P17427
MPAVSKGDGMRGLAVFISDIRNCKSKEAEIKRINKELANIRSKFKGDKALDGYSKKKYVC
KLLFIFLLGHDIDFGHMEAVNLLSSNRYTEKQIGYLFISVLVNSNSELIRLINNAIKNDL
ASRNPTFMGLALHCIANVGSREMAEAFAGEIPKILVAGDTMDSVKQSAALCLLRLYRTSP
DLVPMGDWTSRVVHLLNDQHLGVVTAATSLITTLAQKNPEEFKTSVSLAVSRLSRIVTSA
# cat pos1.bed
P02649 3 23
P02649 130 162
P17427 25 27
# ./bin/bedtools-2.17.0/bin/fastaFromBed -fi input1.fasta -bed pos1.bed -fo out1.fasta
# cat out1.fasta
>P02649:3-23
LWAALLVTFLAGCQAKVEQA
>P02649:130-162
GRLVQYRGEVQAMLGQSTEELRVRLASHLRKL
>P17427:25-27
KE
make sure you adjust for the 0 and 1-based coordinate system
2 Likes