Convert a DNA sequence into Amino Acid

I am trying to write a bash script that would be able to read DNA sequences (each line in the file is a sequence) from a file, where sequences are separated by an empty line. I am then to find the amino acid that these DNA sequences encode per codon (each group of three literals.) For example, if I have a file with the sequence:

   GCATGCTGCGATAACTTTGGCTGAACTTTGGCTGAAGCATGCTGCGAAACTTTGGCTGAACTTTGGCTG

then starting from GCA (first three literals,) I want to decode the DNA into amino acids based on the following table:

    Codon(s)                  Amino-acid
    TTT,TTC                    Phe
    TTA,TTG,CTT,CTC,CTA,CTG    Leu
    ATT,ATC,ATA                Ile
    ATG                       Met
    GTT,GTC,GTA,GTG            Val
    TCT,TCC,TCA,TCG            Ser
    CCT,CCC,CCA,CCG            Pro
    ACT,ACC,ACA,ACG            Thr
    GCT,GCC,GCA,GCG            Ala
    TAT,TAC                    Tyr
    TAA,TAG                    Stop
    CAT,CAC                    His
    CAA,CAG                    Gln
    AAT,AAC                    Asn
    AAA,AAG                    Lys
    GAT,GAC                   Asp
    GAA,GAG                   Glu
    TGT,TGC                    Cys
    TGA                        Stop
    TGG                        Trp
    CGT,CGC,CGA,CGG            Arg
    AGT,AGC                    Ser
    AGA,AGG                    Arg
    GGT,GGC,GGA,GGG            Gly

that is, I need to get:

    AlaCysCysAspAsnPheGlyStopThrLeuAlaGluAlaCysCysGluThrLeuAlaGluLeuTrpLeu

Then I need to print the name of each Amino Acid and how many times it was used. For example:

    Ala: 4
    Cys: 4

and so on. I have 100s of files with DNA sequences in them, but I am not that good at bash. I tried awk and tr but I did not know how to code the table into a bash script.

Ok... a bit messy but done very quickly... first I created a sed script (call the file dna.sed):
(you don't need the /g at the end of these... again, created this quicky)

s/ACC /Thr /g
s/ACA /Thr /g
s/ACG /Thr /g
s/GCT /Ala /g
s/GCC /Ala /g
s/GCA /Ala /g
s/GCG /Ala /g
s/TAT /Tyr /g
s/TAC /Try /g
s/TAA /Stop /g
s/TAG /Stop /g
s/CAT /His /g
s/CAC /His /g
s/CAA /Gln /g
s/CAG /Gln /g
s/AAT /Asn /g
s/AAC /Asn /g
s/AAA /Lys /g
s/AAG /Lys /g
s/GAT /Asp /g
s/GAC /Asp /g
s/GAA /Glu /g
s/GAG /Glu /g
s/TGT /Cys /g
s/TGC /Cys /g
s/TGA /Stop /g
s/TGG /Trp /g
s/CGT /Arg /g
s/CGC /Arg /g
s/CGA /Arg /g
s/CGG /Arg /g
s/AGT /Ser /g
s/AGC /Ser /g
s/AGA /Arg /g
s/AGG /Arg /g
s/GGT /Gly /g
s/GGC /Gly /g
s/GGA /Gly /g
s/GGG /Gly /g

then a script to process DNA sequence lines (assumes sequences each on a line):

while read dna;do 
  aawork=$(echo "${dna}" |sed -n -e 's/\(...\)/\1 /gp' | sed -f dna.sed)
  echo "$aawork" | sed 's/ //g'
  echo "$aawork" | tr ' ' '\012' | sort | sed '/^$/d' | uniq -c | sed 's/[ ]*\([0-9]*\) \(.*\)/\2: \1/' 
done

again script expects to read the sequences one at a time, you can redirect from a pipe, etc..

In my example below this is just with the sample line you provided.

$ dna.sh
GCATGCTGCGATAACTTTGGCTGAACTTTGGCTGAAGCATGCTGCGAAACTTTGGCTGAACTTTGGCTG
AlaCysCysAspAsnPheGlyStopThrLeuAlaGluAlaCysCysGluThrLeuAlaGluLeuTrpLeu
Ala: 4
Asn: 1
Asp: 1
Cys: 4
Glu: 3
Gly: 1
Leu: 4
Phe: 1
Stop: 1
Thr: 2
Trp: 1
1 Like

Here is what I did:

  • Created a file dna.sed
  • Created another bash script file with your code (starting with the usual #!)
  • Called the shell script "conversion.sh" and chmod it to excutable
  • ran it like this: ./conversion.sh < dna_input.dna
  • Got my result as expected for the test input file.

Thank you for your help.
How would one put this into a single script file?

The sed script file can be represented as a set of sed command "-e". You can use multiple -e's or use semicolon to separate the commands.