Fixing a shell script

faizlo · October 9, 2015, 4:13am

I have this shell script that I wrote to check an input file to see if it is empty or not, and then clean the file from any line that starts with the sign "<" (without quotation marks" and then spell the number of line of the file, and the empty lines, too. The script then will create two output files, DNA.out and RNA.out.
First I get an error message at that says:

./script.sh: line 3: [: dna_input.txt: integer expression expected

but it gives me the results I want.
Here is the code:

#!/bin/bash                                                                                                                      
#check to see if there is an input file:                                                                                         
if [ $1 -lt 1 ]
then
  echo "Usage: $0 file ..."
  exit 1
fi

#Check if the file is empty or not                                                                                               
file=$1
if [[ -s $1 ]]
then
  echo ""
  echo "**** $file has data."
  echo "Number of non-empty lines:"
  grep -cve '^\s*$' $file
  echo "Mumber of empty lines:"
  grep -ce '^\s*$' $file
  #grep -cvP '^\s*$' $file -- above line originally was like this one                                                            
  echo ""
  cat $1 | sed 's/>/\n>/g' > temp1.txt
  cat temp1.txt | sed '/^>/ d' > temp2.txt
  #Remove duplicate empty lines:                                                                                                 
  awk '!NF{if(++n <=1) print; next}; {n=0; print}' < temp2.txt > DNA.out
  echo ""
  #convert to mRNA and remove temp1, temp2 to avoid confusion                                                                    
  tr ACGT UGCA < DNA.out > RNA.out
  rm temp1.txt temp2.txt

else
  echo "**** $1 has no data, or file does not exist."
  echo "**** done!"
  echo ""
fi;

After I get the two output files (DNA.out and RNA.out) I use another script to convert the contents of these two files into Amino Acids. The conversion script is:

#!/bin/sh                                                                                                                        
while read rna;do
  aawork=$(echo "${rna}" |sed -n -e 's/\(...\)/\1 /gp' | sed -f rna.sed)
  echo "$aawork" | sed 's/ //g'
  echo "$aawork" | tr ' ' '\012' | sort | sed '/^$/d' | uniq -c | sed 's/[ ]*\([0-9]*\) \(.*\)/\2: \1/'
done

This is how I use it:

./conversion.sh < RNA.out

where rna.sed is:

s/UUU /Phe /g
s/UUC /Phe /g
s/UUA /Leu /g
s/UUG /Leu /g
s/UCU /Ser /g
s/UCC /Ser /g
s/UCA /Ser /g
s/UCG /Ser /g
s/UAU /Tyr /g
s/UAC /Tyr /g
s/UAA /STOP /g
s/UAG /STOP /g
s/UGU /Cys /g
s/UGC /Cys /g
s/UGA /STOP /g
s/UGG /Trp /g
s/CUU /Leu /g
s/CUC /Leu /g
s/CUA /Leu /g
s/CUG /Leu /g
s/CCU /Pro /g
s/CCC /Pro /g
s/CCA /Pro /g
s/CCG /Pro /g
s/CAU /His /g
s/CAC /His /g
s/CAA /Gln /g
s/CAG /Gln /g
s/CGU /Arg /g
s/CGC /Arg /g
s/CGA /Arg /g
s/CGG /Arg /g
s/AUU /Ile /g
s/AUC /Ile /g
s/AUA /Ile /g
s/AUG /Met /g
s/ACU /Thr /g
s/ACC /Thr /g
s/ACA /The /g
s/ACG /Thr /g
s/AAU /Asn /g
s/AAC /Asn /g
s/AAA /Lys /g
s/AAG /Lys /g
s/AGU /Ser /g
s/AGC /Ser /g
s/AGA /Arg /g
s/AGG /Arg /g
s/GUU /Val /g
s/GUC /Val /g
s/GUA /Val /g
s/GUG /Val /g
s/GCU /Ala /g
s/GCC /Ala /g
s/GCA /Ala /g
s/GCG /Ala /g
s/GAU /Asp /g
s/GAC /Asp /g
s/GAA /Glu /g
s/GAG /Glu /g
s/GGU /Gly /g
s/GGC /Gly /g
s/GGA /Gly /g
s/GGG /Gly /g

Now I want to know if I can put these two scripts together in one file and if possible to clean up the script.

My input file (sample) to be used with the first script (script.sh) is here:

>Header_Sequence_1
GTACGACGGAGTGTTATAAGATGGGAAATCGGATACCAGATGAAATTGTGGATCGGTGCAAAA
GTCGGCAGATATCGTTGAAGTCATAGGTGATTATGTTCAATTAAAGAAGCAAGGCCGAAACTAC
TTTGGACTCTGTCCTTTTCATGGAGAAAGCACACCTTCGTTTTCCGTATCGCCCGACAAACAGAT
TTTTCATTGCTTTGGCTGCGGAGCGGGCGGCAATGTTTTCTCTTTTTTAAGGCAGATGGAAGGCT
ATTCTTTTGCCGAGTCGGTTTCTCACCTTGCTGACAAATACCAAATTGATTTTCCAGATGATATAA
CAGTCCATTCCGGAGCCCGGCCAGAG

>Header_Sequence_2
TCTTCTGGAGAACAAAAAATGGCTGAGGCACATGAGCTCCTGAAGAAATTTTACCATCATTTGT
TAATAAATACAAAAGAAGGTCAAGAGGCACTGGATTATCTGCTTTCTAGGGGCTTTACGAAAGA
GCTGATTAATGAATTTCAGATTGGCTATGCTCTTGATTCTTGGGACTTTATCACGAAATTCCTTGT
AAAGAGGGGATTTAGTGAGGCGCAAATGGAAAAAGCGGGTCTCCTGATCAGACGCGAAGACGGAAGCGGATATTTCGACCGCTTCAGAAACC
GTGTCATGTTTCCGATCCATGATCATCACGGGGCTGTTGTTGCTTTCTCAGGCAGGGCTCTTGG

>Header_Sequence_3
CCGCTGTATTCTCAGCCAAGCGGTATAGTCTCCGCTGTATTCTCAGCCCCAGCCGTTCCACTCAG
AGGAACTTTAAAGGATGTTCCTGTTGAGGGCTCATCATCGTCATCGTCATCATCATCATCATCAT
CATCATCATCATCATCAACATCAACCGTCGCACCAGCAAATAAGGCAAGAACTGGAGAAGACGC
AGAAGGCAGTCAAGATTCTAGTGGTACTGAAGCTTCTGGTAGCCAGGGTTCTGAAGAGGAAGG
TAGTGAAGACGATGGCCAAACTAGTGCTGCTTCCCAACCCACTACTCCAGCTCAAAGTGAAGGC
GCAACTACCGAAACCATAGAAGCTACTCCAAAAGAAGAATGCGGCACTTCATTTGTAATGTGGT
TCGGAGAAGGTACCCCAGCTGCGACATTGAAGTGTGGTGCCTACACTATCGTCTATGCACCTAT
AAAAGACCAAACAGATCCCGCACCAAGATATATCTCTGGTGAAGTTACATCTGTAACCTTTGAA
AAGAGTGATAATACAGTTAAAATCAAGGTTAACGGTCAGGATTTCAGCACTCTCTCTGCTAATTC
AAGTAGTCCAACTGAAAATGGCGGATCTGCGGGTCAGGCTTCATCAAGATCAAGAAGATCACT
CTCAGAGGAAACCAGTGAAGCTGCTGCAACCGTCGATTTGTTTGCCTTTACCCTTGATGGTGGT
AAAAGAATTGAAGTGGCTGTACCAAACGTCGAAGATGCATCTAAAAGAGACAAGTACAGTTTG
GTTGCAGACGATAAACCTTTCTATACCGGCGCAAACAGCGGCACTACCAATGGTGTCTACAGGT
TGAATGAGAACGGAGACTTGGTTGATAAGGACAACACAGT

to sum up what I do:

./script.sh input.txt

Which generates: DNA.out and RNA.out (with the error I mentioned), then:

./conversion.sh < RNA.out

where conversion.sh uses rna.sed

I hope I could make my questions clear and I appreciate your help.

RudiC · October 9, 2015, 6:35am

This one creates three files: DNA.OUT, RNA.OUT, and AminoAcids from the input file, and prints the numbers of non-empty and empty lines in the file:

awk '
/^[     ]*$/    {EMP++
                 next
                }
/^>/    {$0=""
        }

        {print > "DNA.out"
         gsub (/A/, "U")
         gsub (/C/, "c")
         gsub (/G/, "C")
         gsub (/T/, "A")
         gsub (/c/, "G")
         print > "RNA.OUT"
         gsub (/.../, "& ")
         gsub (/UU[CU] /, "Phe ")
         gsub (/UA[UC] /, "Tyr ")
         gsub (/GC[ACGU] /, "Ala ")
         gsub (/GG[ACGU] /, "Gly ")
         gsub (/CC[ACGU] /, "Pro ")
         gsub (/AC[ACGU] /, "Thr ")
         gsub (/GU[ACGU] /, "Val ")
         gsub (/(CG[ACGU]|AG[AG]) /, "Arg ")
         gsub (/(CU[ACGU]|UU[AG]) /, "Leu ")
         gsub (/(UC[ACGU]|AG[CU]) /, "Ser ")
         print > "AminoAcids"
        }

END     {print "lines: ", NR-EMP
         print "empty: ", EMP
        }
' file

I did NOT convert ALL of the RNA-Amino combinations from your sed file; should be easliy doable for you with the samples given...

sea · October 9, 2015, 6:53am

Hello Faizlo
Your error:

./script.sh: line 3: [: dna_input.txt: integer expression expected

is produced by comparing a string with an integer.

#!/bin/bash                                                                                                                      
#check to see if there is an input file:                                                                                         
if [ $1 -lt 1 ]
then
  echo "Usage: $0 file ..."
  exit 1
fi

You probably wanted either:
if [ -z "$1" ] (if its empty, show error)
or
if [ ${#1} -eq 1 ] (if its just one string/word)

hth

vbe · October 9, 2015, 8:35am

I wonder if

#!/bin/bash                                                                                                                      
#check to see if there is an input file:                                                                                         
if [ $1 -lt 1 ]
then
  echo "Usage: $0 file ..."
  exit 1
fi

which is wrong as you saw the error : You cant compare CHARS with Integer which is what is expected, should not be more like:

#!/bin/bash                                                                                                                      
#check to see if there is an input file:                                                                                         
if [ "$#" -lt 1 ]
then
  echo "Usage: $0 file ..."
  exit 1
fi

faizlo · October 9, 2015, 2:33pm

Thank you all for your help. The error has gone.

RudiC: thank you for your awk script.

Can my scripts be combined together in just one script? My script gives the frequencies of each Amino Acid. Also, I ask to learn how to add a while loop in a script if it takes an input of its own.

Thank you all again

jasahasch · October 10, 2015, 1:15am

I like my code to run in /tmp file when I am using a static files.

#!/bin/bash                                    
for i in /home/*/Desktop/test.txt; do cp $i /tmp/test.txt && echo "Usage: $i file ... " || echo "File is not where you think it should be";
sleep 3
done
exit 0

Consider trapping ( think that is what it called. Do a google) the file. So after checking that's the file exists with code above, follow up with ...

cd /tmp
sudo ./test.txt

And then within test.txt. Have this line of code. Which also checks the file was placed in /tmp. Then runs your test.txt file.
This is called trapping and insure the program will stop running when it is completed. You can also run sub routines under the trap before running main code body.

FILE=/tmp/$(basename $0)
trap ` cp /your/file/ /to/anydir && echo " ... " ; ` EXIT
if [ -e $FILE ]; then

RudiC · October 10, 2015, 6:24am

Yes. Done.

That wasn't too clear from your spec. Try

awk  '
BEGIN           {split ("UU[CU] UA[UC] GC[ACGU] GG[ACGU] CC[ACGU] AC[ACGU] GU[ACGU] (CG[ACGU]|AG[AG]) (CU[ACGU]|UU[AG]) (UC[ACGU]|AG[CU])", TMP1)
                 for (i=split ("Phe Tyr Ala Gly Pro Thr Val Arg Leu Ser", TMP2); i>0; i--)      {AACID[TMP2]
                                                                                                 BASES[TMP2]=TMP1
                                                                                                }
                 if (DEBUG) {for (t in TMP1) print TMP2[t], TMP1[t]}
                }


/^[     ]*$/    {EMP++
                 next
                }
/^>/            {$0=""
                }

                {print > "DNA.OUT"
                 gsub (/A/, "U")
                 gsub (/C/, "c")
                 gsub (/G/, "C")
                 gsub (/T/, "A")
                 gsub (/c/, "G")
                 print > "RNA.OUT"
                 gsub (/.../, "& ")
                 for (a in AACID) ACNT[a] += gsub (BASES[a], a)
                 print > "AminoAcids"
                }

END             {print "lines: ", NR-EMP
                 print "empty: ", EMP
                 for (a in ACNT) print a, ACNT[a]
                }
' file
lines:  28
empty:  2
Ser 54
Val 39
Tyr 16
Ala 18
Gly 25
Pro 30
Thr 48
Leu 45
Arg 36
Phe 37

Please note that there are several incomplete base triples in your file and that I did not cover all triples, just a few to prove the method working.

faizlo · October 10, 2015, 5:42pm

I have added other missing bases and AAcids to your awk code. Here is what I have now:

awk  '
BEGIN           {split ("UU[CU] UA[UC] GC[ACGU] GG[ACGU] CC[ACGU] AC[ACGU] GU[ACGU] (CG[ACGU]|AG[AG]) (CU[ACGU]|UU[AG]) (UC[ACGU]|AG[CU]) AU[ACU] AUG (UA[AG]|UAG) CA[AU] CA[AG] AA[UC] AA[AG] GA[CU] GA[AG] UG[CU] UGG", TMP1)
    for (i=split ("Phe Tyr Ala Gly Pro Thr Val Arg Leu Ser Ile Met STOP His Gln Asn Lys Asp Glu Cys Trp", TMP2); i > 0; i--)      {AACID[TMP2]
                                                                                                 BASES[TMP2]=TMP1
    }
    if (DEBUG) {for (t in TMP1) print TMP2[t], TMP1[t]}
}


/^[     ]*$/    {EMP++
                 next
}
/^>/            {$0=""
}

                {print > "DNA.OUT"
                    gsub (/A/, "U")
                    gsub (/C/, "c")
                    gsub (/G/, "C")
                    gsub (/T/, "A")
                    gsub (/c/, "G")
                 print > "RNA.OUT"
                 gsub (/.../, "& ")
                 for (a in AACID) ACNT[a] += gsub (BASES[a], a)
                 print > "AminoAcids"
                }

                END             {print "lines: ", NR-EMP
                    print "empty: ", EMP
                    for (a in ACNT) print a, ACNT[a]
                }
' file

Please pardon my question, but how do I run this from the command line? My input file that has the DNA sequences is called "input.txt"

This is how I tried to run it:

faizlo@faizlo $ awk -f rudic.awk input.txt 
awk: rudic.awk:1: awk  '
awk: rudic.awk:1:      ^ invalid char ''' in expression
awk: rudic.awk:1: awk  '
awk: rudic.awk:1:      ^ syntax error

I also tried to add:

#!/usr/bin/awk -f

at the beginning of the code but got the smae error(s.)

** rudic.awk is the script that has your awk script.

Don_Cragun · October 10, 2015, 9:41pm

Change the last line of rudic.awk from:

' file

to:

' input.txt

Then execute the script with:

sh rudic.awk

or make

rudic.awk

executable and execute it directly:

chmod +x rudic.awk
./rudic.awk

faizlo · October 10, 2015, 11:09pm

Interesting. I have done the executable step before but did not work!
It does now.
Thank you both for your help. I appreciate it.

---------- Post updated at 10:09 PM ---------- Previous update was at 10:03 PM ----------

I have one more question!

What should I do if I want the frequency from each line (sequence,) and not the total frequency of all sequences in the input file?

RudiC · October 11, 2015, 4:44am

Would this do:

awk  -vDEBUG=1 '
BEGIN           {C1 = split ("UU[CU] UA[UC] GC[ACGU] GG[ACGU] CC[ACGU] AC[ACGU] GU[ACGU] CG[ACGU]|AG[AG] CU[ACGU]|UU[AG] "\
                                "UC[ACGU]|AG[CU] AU[ACU] AUG UA[AG]|UAG CA[AU] CA[AG] AA[UC] AA[AG] GA[CU] GA[AG] UG[CU] UGG", TMP1)
                 for (C2=i=split ("Phe Tyr Ala Gly Pro Thr Val Arg Leu Ser Ile Met STOP His Gln Asn Lys Asp Glu Cys Trp", TMP2); i > 0; i--)    {AACID[TMP2]
                                                                                                                                                 BASES[TMP2]=TMP1
                                                                                                                                                }
                 if (DEBUG) {print C1, C2; for (t in TMP1) print TMP2[t], TMP1[t]}
                }


/^[     ]*$/    {EMP++
                 next
                }
/^>/            {$0=""
                }

                {print > "DNA.OUT"
                 gsub (/A/, "U")
                 gsub (/C/, "c")
                 gsub (/G/, "C")
                 gsub (/T/, "A")
                 gsub (/c/, "G")
                 print > "RNA.OUT"
                 gsub (/.../, "& ")
                 for (a in AACID)       {TMP = gsub (BASES[a], a)
                                         print NR, a, TMP                
                                         ACNT[a] += TMP
                                        }
                 print > "AminoAcids"
                }

END             {print "lines: ", NR-EMP
                 print "empty: ", EMP
                 for (a in ACNT) print a, ACNT[a]
                }
' file

?

---------- Post updated at 10:44 ---------- Previous update was at 10:25 ----------

A bit more structured approach:

awk  -vDEBUG=1 '
BEGIN           {Str1 = "UU[CU] UA[UC] GC[ACGU] GG[ACGU] CC[ACGU] AC[ACGU] GU[ACGU] CG[ACGU]|AG[AG] CU[ACGU]|UU[AG] "\
                        "UC[ACGU]|AG[CU] AU[ACU] AUG UA[AG]|UAG CA[AU] CA[AG] AA[UC] AA[AG] GA[CU] GA[AG] UG[CU] UGG" 
                 Str2 = "Phe Tyr Ala Gly Pro Thr Val Arg Leu Ser Ile Met STOP His Gln Asn Lys Asp Glu Cys Trp"
                 C1 = split (Str1, TMP1)
                 for (C2=i=split (Str2, TMP2); i > 0; i--)      {AACID[TMP2]
                                                                 BASES[TMP2]=TMP1
                                                                }
                 if (DEBUG) {print C1, C2; for (t in TMP1) print TMP2[t], TMP1[t]}

                 C1 = split ("ACGTc", PAT, "")
                 C2 = split ("UcCAG", REP, "")
                 if (DEBUG) {print C1, C2; for (p in PAT) print PAT[p], REP[p]}
                }


/^[     ]*$/    {EMP++
                 next 
                }
/^>/            {$0=""
                }

                {print > "DNA.OUT" 
                 for (i=1; i<=C1; i++)  gsub (PAT, REP)
                 print > "RNA.OUT" 
                 gsub (/.../, "& ")
                 for (a in AACID)       {TMP = gsub (BASES[a], a)
                                         print NR, a, TMP
                                         ACNT[a] += TMP  
                                        }
                 print > "AminoAcids"
                }

END             {print "lines: ", NR-EMP
                 print "empty: ", EMP   
                 for (a in ACNT) print a, ACNT[a]
                }
' file

(The parentheses in Str1 were a heritage from a former version - not needed anymore)

faizlo · October 11, 2015, 12:25pm

@RudiC:
Thank you so much for your help. I can't appreciate it more.
It will take me some time to understand the script as awk seems to be huge indeed.
Thank you so much again.