Hello,
A bioperl problem I thought could be done with awk: convert the fasta format (Note: the length of each row is not the same for each entry as they were combined from different files!) to tabular format.
input.fasta:
>YAL069W-1.334 Putative promoter sequence
CCACACCACACCCACACACCCACACACCACACCACACACC
ACACCACACCCACACACACACATCCTAACACTACCCTAAC
ACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTT
>YAL068C-7235.2170 Putative promoter sequence
TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAAT
ACGTAAATGAAGTTTATATATAAATTTCCTTTTTATTGGA
>gi|31044174|gb|AY143560.1| Tintinnopsis fimbriata 18S ribosomal RNA gene, partial sequence
GAAACTGCGAATGGCTCATTAAAACAGTTATAGTTTATTTGGTAATCAAACTTACATGGATAACCGTGG
TAATTCTAGAGCTAATACATGCTGTTGTGCCCGACTCACGAAGGGCGGTATTTATTAGATATCAGCCAATA
AGCATCTGCTATTGTGGTGACTCATAGTAACTTAATCGGATCGCATGGGCTTGTCCCGCGACAAACCATT
>gi|31044185|gb|AY143571.1| Codonellopsis americana 18S ribosomal RNA gene, partial sequence
ATTACCCAATCCTGACTCAGGGAGGTAGTGACAAGAAATAATGGGTCGGGGTTCTGCCCCGGGACTGCA
GGGCACCACCAGGCGTGGAGCTTGCGGCTCAATTTGACTCAACACGGGGAAACTTACCAGGTCCAGACAT
I want to convert it to the tabular format as:
output.tab:
>YAL069W-1.334 Putative promoter sequence CCACACCACACCCACACACCCACACA......CACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTT
>YAL068C-7235.2170 Putative promoter sequence TACGAGAATAATTTCTCATCATCCAG......CATTTTCTTATGACGTAAATGAAGTTTATATATAAATTTCCTTTTTATTGGA
>gi|31044174|gb|AY143560.1| Tintinnopsis fimbriata 18S ribosomal RNA gene, partial sequence GAAACTGCGAATGGCTCA......ATTGTGGTGACTCATAGTAACTTAATCGGATCGCATGGGCTTGTCCCGCGACAAACCATT
>gi|31044185|gb|AY143571.1| Codonellopsis americana 18S ribosomal RNA gene, partial sequence ATTACCCAATCCTGACTC......CCAGGCGTGGAGCTTGCGGCTCAATTTGACTCAACACGGGGAAACTTACCAGGTCCAGACAT
i.e. each row has two columns: the first one is the header for the sequence name and description, the second column is the DNA sequence. This is quite common in bioinformatics daily task.
I am aware bioperl is the right tool to do the job, but I am trying to level up my awk when I read the RS variable. Not sure how to handle this situation for the RS and the FS variables.
Thanks a lot!