Ibk
September 21, 2018, 3:45pm
1
Hi all,
I have a fasta file of a reference sequnce, I will like to retrieve sequences corresponding to a list of start and end position in another file
>my_ref_seq
GCCCTATAAGGGCAGAAGCTTGTCCTTCTTGTGCCAGTTATGACGTTTGTCCTAACTGCACATCTGGTAG
CATCCCCGATGATGGGTCTTCCAAAGGACAGATCCCCTCCTGGGAAGATGTTACAAAAACTTCTACCTAC
TCTCTCTTGTTGTCTGAGGACACATCCGACGAACTGCACCCTGACGACCTGGTTAACGTTGCTGCTCATA
TCCGAAAGGCCTTGTCCACTCAGTCTCACCCGGCAAATGTTGATATGTGCAAAGAACAGCTCACATCTCT
GTTGGTTATGGCTGAAGCAATGCTACCCCAACGCTCTAGGTCGACACTCCCACTACACCAAAAATATGTG
position.txt
my_ref_seq 1550 1590
my_ref_seq 2399 2429
my_ref_seq 3008 3038
expected result
seq_id Start End sequence
my_ref_seq 1550 1590 TCCGAAAGGCCTTGTCCACTCAGTCTCACC
my_ref_seq 2399 2429 TCTCTCTTGTTGTCTGAGGACACATCCGAC
my_ref_seq 3008 3038 GAACTGTCCGAAAGGCCTTGTCCACTCAAA
Thank you for the help.
RudiC
September 21, 2018, 4:28pm
2
Unfortunately, this can't be tested with the data given. Howsoever, try
awk '
NR == 1 {print "seq_id Start End sequence"
next}
NR == FNR {SEQ = SEQ $0
next
}
{print $0, substr (SEQ, $2, $3-$2+1)
}
' fastafile position.txt
Ibk
September 22, 2018, 4:21am
3
Thank you Rudic,
The code is almost perfect but returns the input fasta file as header before the expected output.
I don't want the input fasta file added to the output.
Thanks
------ Post updated 09-22-18 at 04:21 AM ------
Thank you Rudic
Fixed it
Ibk
September 24, 2018, 5:23pm
4
Hi,
After trying to retrieve the sequence using
awk 'NR == 1 {print "seq_id Start End Nul_Seq"} NR == FNR {SEQ = SEQ $0;next}{if ($2>$3) print $0, substr (SEQ, $3, $2-$3+1); else print $0, substr (SEQ, $2, $3-$2+1)}' my_seq.fasta file_2.txt > output.txt
I have a sequence NC_004064.1 (from NCBI) in a file and another file with Query ID, start position and end position
>my_querry_seq.fasta
GTGATTTAATTATAGAGAGATAGTGACTTTCACTTTTCTTTCTTTGCATTTCGCTGGACTCAAACTGTTC
CATAATGGCTCCAGTTGTTTCCAGAGATCAATGTAAACCTAAAACACCAAAGCCCCACCGACCAGCCCCT
CCCCACAGGTGCACTACCAGGTGCCCTGAGGATTGTGGGTGGTACGTGGGTCGCTGCTCCTGCCCCAACG
TCTGCCAACGCGAGGGTTGGGACGACTTCTTCGTAGCTGACAAGGTGAAGCCTCCTAGCTACGTCGCCTC
AAAGACCTCGGTCGCAGATGTTGTCGACTGGCTACTTGAAGAAGACCCTGCTACGGACGGCCCCTCGGAA
TTTGACCTCACGCAATTTTTCCAGGCTTACACGGACAAGTCCCACCAGATCCACCGTGACTATGCCCCGG
ACCAGCTTGCTCAGGCCCTTGACATGGCTTACATACTATCGGTGGACCCCCCGGACATCAAGCTCCCAGA
GTATGAGGCAACTAGGTTCACCCATGACACCTCCTACAAGGGCAAACTGCCTAAGTGGCTGCGTGTGTAT
GGCATCAAGTCCCGAGAGCTAGCGAAGAAGGCTGTAACCAACATTCGGGGTGGAGCCCATTGGGCCAAGG
GGCTGTTCAAGCAAATGTGGGACTCGTTGCCTGGTTGGAGTGAGGTGGAGGCGTACTTCAAAGCTTTCTT
CGCTGGCATCATCACCGGTGTTGAGGACGCGCTGTCGAAGTCTCCGTCATCGGTGTGGACCTCACTCAAG
CTCACGCCGTTGCTATACATATGGCGCAACATCAACGAGTGCTCTGATATTGCTGTGATACTAGGAGCCT
TCTGGGCGACACTGGAACTTTACAACATCCCATCCAAAGTGTACGACCTGGTGTCAACTGCACTTGGCCC
file_2.txt
ID start end
my_seq 10 40
my_seq 35 65
my_seq 50 20
my_seq 76 96
I want to use the second file to retrieve the sequence from the start to end position printed in front of each
Note that in line 3 the start (50) position is greater than the end (20) but want this to print from 50 to 20
I want the sequence printed in front in this format
Expected Output
ID Start End Sequence
I will really appreciate your help.
Thanks
RudiC
September 25, 2018, 8:34am
5
I can't immediately see an error, so - where are you stuck, what doesn't work? Error messages?