Xterra
February 20, 2011, 12:57pm
1
I really need some help with this task. I have a bunch of FASTA files with hundreds of DNA sequences that look like this:
>SeqID1
AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGAC
TGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT
>Sequence 22
AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGAC
TGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT
>Seq-39
AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGAC
TGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT
And I need to change the format (Phylip) so they can look like this:
3 100
SeID1_____AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT
Sequence__AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT
Seq-39____AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT
The first number at the very top is the number of sequences followed by the length of the sequences.
The first column is the Sequence ID that needs to be 8 characters long followed by 2 blank spaces and then the actual sequence. If the SequenceID is longer than 8 characters, then the extra characters should be removed. If the SequenceID is shorter than 8, blank spaces should be added to keep the length to 8. In my example I have added underscores to keep the sequences aligned and accurately reflect how the output file should look but in the outfile they should be blank spaces.
Any help will be greatly appreciate it!
drl
February 20, 2011, 1:35pm
2
Hi.
Looks like Sequence Manipulator has a number of format conversion codes, including
Fasta2Phylip.pl: convert sequence file in fasta format to sequential phylip format
Input: fasta sequence file.
Output: phylip sequence file.
Good luck ... cheers, drl
---------- Post updated at 12:35 ---------- Previous update was at 12:26 ----------
Hi.
I Googled for:
convert fasta to phylip format awk OR perl
and these were the first 2 hits of about 1500 ... cheers, drl
Xterra
February 20, 2011, 5:24pm
3
Helpful website but I still need and AWK script that I can modify and couple with all my other steps in my bash script.
Any help will be greatly appreciate it!
Try:
awk '$1=substr($1" ",1,8)" "' FS="\n" OFS= RS=\> file
$ awk '$1=substr($1" ",1,8)" "' FS="\n" OFS= RS=\> file
SeqID1 AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT
Sequence AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT
Seq-39 AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT
awk -vRS=">" -vFS="\n" -vOFS="" '$0!=""{$1=substr($1,1,8);$1=sprintf ("%-10s",$1)}$0!=""' file > file.tmp; awk 'NR==1{"wc -l /tmp/b|cut -d\" \" -f1"|getline a; print a,length($2)}1' file.tmp
Code ugly as hell, but working.
rdcwayx
February 20, 2011, 7:14pm
6
For the other request, based on Scrutinizer's code
$ awk '$1=substr($1" ",1,8)" "' FS="\n" OFS= RS=\> file |awk '{s=length($2)}END{print NR-1, s}'
3 100
Xterra
February 23, 2011, 5:35pm
7
I need to combine both codes
awk '$1=substr($1" ",1,8)" "' FS="\n" OFS= RS=\> file
awk '{s=length($2)}END{print NR-1, s}' file
So I can get the desired output
3 100
SeID1_____AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT
Sequence__AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT
Seq-39____AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT
I have been trying but I just cannot get the code to do what I want.
Can anyone explain me how can I combine them?
Thanks!
try:
awk -v RS=">" -v FS="\n" '{printf $1"\t";for(i=2;i<=NF;i++) printf $i;print ""}' fasta
1 Like
Xterra
February 23, 2011, 9:32pm
9
I am not getting the expected output file.
What I need is to combine these 2 awk codes into one
awk '$1=substr($1" ",1,8)" "' FS="\n" OFS= RS=\> file
awk '{s=length($2)}END{print NR-1, s}' file
The first one is getting the SeqIDs and the sequences but it does not list in the first line the number of sequences or the length.
The second one takes care of the number of sequences and the length but it does not output the IDs and sequences.
I would like to modify the first code so it can include the number of sequences and length in the very first line.
Thanks!
awk '$1=substr($1" ",1,8)" "' FS="\n" OFS= RS=\> fastafile |awk 'NF>0{s=length($2);t=t"\n"$0}END{print NR-1,s"\n"t}'
1 Like
Xterra
February 23, 2011, 10:11pm
11
I only had to remove "\n" to get the rigth format.
Thank you very, very much!
Xterra
February 23, 2011, 10:24pm
12
The code seems to have some issues when the Sequence IDs are shorter than 8 characters. I have uploaded the corresponding input and output files.
Try:
^M is Ctrl+V+M
sed '$!N;s/^M\n/\t/;s/>//' input|awk '{s=length($2);t==0?t=$0:t=t"\n"$0}END{print NR,s"\n" t}' output
Xterra
February 24, 2011, 4:06pm
14
It works great on CygWin and VirtualBox but not on my Linux box (redhat)