I am trying to write a bash script that would be able to read DNA sequences (each line in the file is a sequence) from a file, where sequences are separated by an empty line. I am then to find the amino acid that these DNA sequences encode per codon (each group of three literals.) For example, if I have a file with the sequence:
GCATGCTGCGATAACTTTGGCTGAACTTTGGCTGAAGCATGCTGCGAAACTTTGGCTGAACTTTGGCTG
then starting from GCA (first three literals,) I want to decode the DNA into amino acids based on the following table:
Codon(s) Amino-acid
TTT,TTC Phe
TTA,TTG,CTT,CTC,CTA,CTG Leu
ATT,ATC,ATA Ile
ATG Met
GTT,GTC,GTA,GTG Val
TCT,TCC,TCA,TCG Ser
CCT,CCC,CCA,CCG Pro
ACT,ACC,ACA,ACG Thr
GCT,GCC,GCA,GCG Ala
TAT,TAC Tyr
TAA,TAG Stop
CAT,CAC His
CAA,CAG Gln
AAT,AAC Asn
AAA,AAG Lys
GAT,GAC Asp
GAA,GAG Glu
TGT,TGC Cys
TGA Stop
TGG Trp
CGT,CGC,CGA,CGG Arg
AGT,AGC Ser
AGA,AGG Arg
GGT,GGC,GGA,GGG Gly
that is, I need to get:
AlaCysCysAspAsnPheGlyStopThrLeuAlaGluAlaCysCysGluThrLeuAlaGluLeuTrpLeu
Then I need to print the name of each Amino Acid and how many times it was used. For example:
Ala: 4
Cys: 4
and so on. I have 100s of files with DNA sequences in them, but I am not that good at bash. I tried awk and tr but I did not know how to code the table into a bash script.