Changing from FASTA to PHYLIP format

I really need some help with this task. I have a bunch of FASTA files with hundreds of DNA sequences that look like this:

>SeqID1
AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGAC
TGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT
>Sequence 22
AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGAC
TGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT
>Seq-39
AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGAC
TGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT

And I need to change the format (Phylip) so they can look like this:

3 100 
SeID1_____AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT 
Sequence__AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT 
Seq-39____AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT

The first number at the very top is the number of sequences followed by the length of the sequences.
The first column is the Sequence ID that needs to be 8 characters long followed by 2 blank spaces and then the actual sequence. If the SequenceID is longer than 8 characters, then the extra characters should be removed. If the SequenceID is shorter than 8, blank spaces should be added to keep the length to 8. In my example I have added underscores to keep the sequences aligned and accurately reflect how the output file should look but in the outfile they should be blank spaces.
Any help will be greatly appreciate it!

Hi.

Looks like Sequence Manipulator has a number of format conversion codes, including

Fasta2Phylip.pl: convert sequence file in fasta format to sequential phylip format

Input: fasta sequence file.

Output: phylip sequence file.

Good luck ... cheers, drl

---------- Post updated at 12:35 ---------- Previous update was at 12:26 ----------

Hi.

I Googled for:

convert fasta to phylip format awk OR perl

and these were the first 2 hits of about 1500 ... cheers, drl

Helpful website but I still need and AWK script that I can modify and couple with all my other steps in my bash script.
Any help will be greatly appreciate it!

Try:

awk '$1=substr($1"       ",1,8)"  "' FS="\n" OFS= RS=\> file
$ awk '$1=substr($1"       ",1,8)"  "' FS="\n" OFS= RS=\> file

SeqID1    AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT
Sequence  AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT
Seq-39    AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT
awk -vRS=">" -vFS="\n" -vOFS="" '$0!=""{$1=substr($1,1,8);$1=sprintf ("%-10s",$1)}$0!=""' file > file.tmp; awk 'NR==1{"wc -l /tmp/b|cut -d\" \" -f1"|getline a; print a,length($2)}1' file.tmp

Code ugly as hell, but working.

For the other request, based on Scrutinizer's code

$  awk '$1=substr($1"       ",1,8)"  "' FS="\n" OFS= RS=\> file |awk '{s=length($2)}END{print NR-1, s}'

3 100

I need to combine both codes

awk '$1=substr($1"       ",1,8)"  "' FS="\n" OFS= RS=\> file
awk '{s=length($2)}END{print NR-1, s}' file

So I can get the desired output

3 100 
SeID1_____AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT 
Sequence__AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT 
Seq-39____AACCATGACAGAGGAGATGTGAACAGATAGAGGGATGACAGATGACAGATAGACCCAGACTGACAGGTTCAAAGGCTGCAGTGCAGTGACGTGACGATTT 

I have been trying but I just cannot get the code to do what I want.
Can anyone explain me how can I combine them?
Thanks!

try:

awk -v RS=">" -v FS="\n" '{printf $1"\t";for(i=2;i<=NF;i++) printf $i;print ""}' fasta
1 Like

I am not getting the expected output file.
What I need is to combine these 2 awk codes into one

awk '$1=substr($1"       ",1,8)"  "' FS="\n" OFS= RS=\> file
awk '{s=length($2)}END{print NR-1, s}' file

The first one is getting the SeqIDs and the sequences but it does not list in the first line the number of sequences or the length.
The second one takes care of the number of sequences and the length but it does not output the IDs and sequences.
I would like to modify the first code so it can include the number of sequences and length in the very first line.
Thanks!

awk '$1=substr($1"       ",1,8)"  "' FS="\n" OFS= RS=\> fastafile |awk 'NF>0{s=length($2);t=t"\n"$0}END{print NR-1,s"\n"t}'
1 Like

I only had to remove "\n" to get the rigth format.
Thank you very, very much!

The code seems to have some issues when the Sequence IDs are shorter than 8 characters. I have uploaded the corresponding input and output files.

Try:
^M is Ctrl+V+M

sed '$!N;s/^M\n/\t/;s/>//' input|awk '{s=length($2);t==0?t=$0:t=t"\n"$0}END{print NR,s"\n" t}' output 

It works great on CygWin and VirtualBox but not on my Linux box (redhat) :confused: