Changing from FASTA to PHYLIP format

Xterra · February 20, 2011, 12:57pm

I really need some help with this task. I have a bunch of FASTA files with hundreds of DNA sequences that look like this:

>SeqID1
AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGAC
TGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT
>Sequence 22
AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGAC
TGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT
>Seq-39
AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGAC
TGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT

And I need to change the format (Phylip) so they can look like this:

3 100 
SeID1_____AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT 
Sequence__AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT 
Seq-39____AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT

The first number at the very top is the number of sequences followed by the length of the sequences.
The first column is the Sequence ID that needs to be 8 characters long followed by 2 blank spaces and then the actual sequence. If the SequenceID is longer than 8 characters, then the extra characters should be removed. If the SequenceID is shorter than 8, blank spaces should be added to keep the length to 8. In my example I have added underscores to keep the sequences aligned and accurately reflect how the output file should look but in the outfile they should be blank spaces.
Any help will be greatly appreciate it!

drl · February 20, 2011, 1:35pm

Hi.

Looks like Sequence Manipulator has a number of format conversion codes, including

Fasta2Phylip.pl: convert sequence file in fasta format to sequential phylip format

Input: fasta sequence file.

Output: phylip sequence file.

Good luck ... cheers, drl

---------- Post updated at 12:35 ---------- Previous update was at 12:26 ----------

Hi.

I Googled for:

convert fasta to phylip format awk OR perl

and these were the first 2 hits of about 1500 ... cheers, drl

Xterra · February 20, 2011, 5:24pm

Helpful website but I still need and AWK script that I can modify and couple with all my other steps in my bash script.
Any help will be greatly appreciate it!

Scrutinizer · February 20, 2011, 5:55pm

Try:

awk '$1=substr($1"       ",1,8)"  "' FS="\n" OFS= RS=\> file

$ awk '$1=substr($1"       ",1,8)"  "' FS="\n" OFS= RS=\> file

SeqID1    AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT
Sequence  AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT
Seq-39    AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT

bartus11 · February 20, 2011, 6:02pm

awk -vRS=">" -vFS="\n" -vOFS="" '$0!=""{$1=substr($1,1,8);$1=sprintf ("%-10s",$1)}$0!=""' file > file.tmp; awk 'NR==1{"wc -l /tmp/b|cut -d\" \" -f1"|getline a; print a,length($2)}1' file.tmp

Code ugly as hell, but working.

rdcwayx · February 20, 2011, 7:14pm

For the other request, based on Scrutinizer's code

$  awk '$1=substr($1"       ",1,8)"  "' FS="\n" OFS= RS=\> file |awk '{s=length($2)}END{print NR-1, s}'

3 100

Xterra · February 23, 2011, 5:35pm

I need to combine both codes

awk '$1=substr($1"       ",1,8)"  "' FS="\n" OFS= RS=\> file

awk '{s=length($2)}END{print NR-1, s}' file

So I can get the desired output

3 100 
SeID1_____AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT 
Sequence__AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT 
Seq-39____AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT

I have been trying but I just cannot get the code to do what I want.
Can anyone explain me how can I combine them?
Thanks!

yinyuemi · February 23, 2011, 6:45pm

try:

awk -v RS=">" -v FS="\n" '{printf $1"\t";for(i=2;i<=NF;i++) printf $i;print ""}' fasta

Xterra · February 23, 2011, 9:32pm

I am not getting the expected output file.
What I need is to combine these 2 awk codes into one

awk '$1=substr($1"       ",1,8)"  "' FS="\n" OFS= RS=\> file

awk '{s=length($2)}END{print NR-1, s}' file

The first one is getting the SeqIDs and the sequences but it does not list in the first line the number of sequences or the length.
The second one takes care of the number of sequences and the length but it does not output the IDs and sequences.
I would like to modify the first code so it can include the number of sequences and length in the very first line.
Thanks!

yinyuemi · February 23, 2011, 9:54pm

awk '$1=substr($1"       ",1,8)"  "' FS="\n" OFS= RS=\> fastafile |awk 'NF>0{s=length($2);t=t"\n"$0}END{print NR-1,s"\n"t}'

Xterra · February 23, 2011, 10:11pm

I only had to remove "\n" to get the rigth format.
Thank you very, very much!

Xterra · February 23, 2011, 10:24pm

The code seems to have some issues when the Sequence IDs are shorter than 8 characters. I have uploaded the corresponding input and output files.

yinyuemi · February 24, 2011, 12:38am

Try:
^M is Ctrl+V+M

sed '$!N;s/^M\n/\t/;s/>//' input|awk '{s=length($2);t==0?t=$0:t=t"\n"$0}END{print NR,s"\n" t}' output

Xterra · February 24, 2011, 4:06pm

It works great on CygWin and VirtualBox but not on my Linux box (redhat)