parse fasta file to tabular file

yifangt · December 23, 2011, 4:55pm

Hello,
A bioperl problem I thought could be done with awk: convert the fasta format (Note: the length of each row is not the same for each entry as they were combined from different files!) to tabular format.

input.fasta:

>YAL069W-1.334 Putative promoter sequence
CCACACCACACCCACACACCCACACACCACACCACACACC
ACACCACACCCACACACACACATCCTAACACTACCCTAAC
ACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTT
>YAL068C-7235.2170 Putative promoter sequence
TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAAT
ACGTAAATGAAGTTTATATATAAATTTCCTTTTTATTGGA
>gi|31044174|gb|AY143560.1| Tintinnopsis fimbriata 18S ribosomal RNA gene, partial sequence
GAAACTGCGAATGGCTCATTAAAACAGTTATAGTTTATTTGGTAATCAAACTTACATGGATAACCGTGG
TAATTCTAGAGCTAATACATGCTGTTGTGCCCGACTCACGAAGGGCGGTATTTATTAGATATCAGCCAATA
AGCATCTGCTATTGTGGTGACTCATAGTAACTTAATCGGATCGCATGGGCTTGTCCCGCGACAAACCATT
>gi|31044185|gb|AY143571.1| Codonellopsis americana 18S ribosomal RNA gene, partial sequence
ATTACCCAATCCTGACTCAGGGAGGTAGTGACAAGAAATAATGGGTCGGGGTTCTGCCCCGGGACTGCA
GGGCACCACCAGGCGTGGAGCTTGCGGCTCAATTTGACTCAACACGGGGAAACTTACCAGGTCCAGACAT

I want to convert it to the tabular format as:

output.tab:

>YAL069W-1.334 Putative promoter sequence CCACACCACACCCACACACCCACACA......CACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTT
>YAL068C-7235.2170 Putative promoter sequence TACGAGAATAATTTCTCATCATCCAG......CATTTTCTTATGACGTAAATGAAGTTTATATATAAATTTCCTTTTTATTGGA
>gi|31044174|gb|AY143560.1| Tintinnopsis fimbriata 18S ribosomal RNA gene, partial sequence GAAACTGCGAATGGCTCA......ATTGTGGTGACTCATAGTAACTTAATCGGATCGCATGGGCTTGTCCCGCGACAAACCATT
>gi|31044185|gb|AY143571.1| Codonellopsis americana 18S ribosomal RNA gene, partial sequence ATTACCCAATCCTGACTC......CCAGGCGTGGAGCTTGCGGCTCAATTTGACTCAACACGGGGAAACTTACCAGGTCCAGACAT

i.e. each row has two columns: the first one is the header for the sequence name and description, the second column is the DNA sequence. This is quite common in bioinformatics daily task.
I am aware bioperl is the right tool to do the job, but I am trying to level up my awk when I read the RS variable. Not sure how to handle this situation for the RS and the FS variables.
Thanks a lot!

kato · December 23, 2011, 5:24pm

try this:

awk 'BEGIN{RS=">"}{gsub("\n","",$0); print ">"$0}' file

yifangt · December 23, 2011, 7:02pm

Thanks! It worked except the OFS is missing, which is the header and the sequence are not delimited as needed. I added the OFS="\t", but it did not work.

awk 'BEGIN{RS=">"; OFS="\t"}{gsub("\n","",$0); print ">"$0}' file
-----------------------------
output is:
>seq0FQTWEEFSRAAEKLYLADPMKVRVVLKYRHVDGNLCIKVTDDLVCLVYRTDQAQDVKKIEKF
>seq1KYRTWEEFTRAAEKLYQADPMKVRVVLKYRHCDGNLCIKVTDDVVCLLYRTDQAQDVKKIEKFHSQLMRLME LKVTDNKECLKFKTDQAQEAKKMEKLNNIFFTLM
>seq2EEYQTWEEFARAAEKLYLTDPMKVRVVLKYRHCDGNLCMKVTDDAVCLQYKTDQAQDVKKVEKLHGK
>seq3MYQVWEEFSRAVEKLYLTDPMKVRVVLKYRHCDGNLCIKVTDNSVCLQYKTDQAQDVK
>YAL069W-1.334 Putative promoter sequenceCCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACAC

Any clue?
YF

Franklin52 · December 24, 2011, 8:57am

Maybe something like this?

awk '/^>/ && NR>1{$0=RS $0}{printf $0}END{print ""}' file

kato · December 24, 2011, 10:18am

You could try using a tab, instead of replacing the new line with nothing:

awk 'BEGIN{RS=">"}{gsub("\n","\t",$0); print ">"$0}' file

yifangt · December 24, 2011, 6:22pm

Thanks Kato!
Your second version is much better. Is it possible to remove the tabs within the sequence fields? i.e. merge the sequence to a single field instead of being separated with the tab. gsub the first "\n" with "\t", but gsub the second "\n" and after with nothing. One step from what I want.
Merry Christmas!!!

kato · December 25, 2011, 5:58pm

Merry Christmas! With a few improvements after @Franklin52:

awk 'BEGIN{RS=">"}NR>1{sub("\n","\t"); gsub("\n",""); print RS$0}' file