Tricky task with DNA sequences.

Xterra · August 31, 2011, 3:56pm

I am trying to reverse and complement my DNA sequences. The file format is FASTA, something like this:

Now, to reverse the sequence, I should start reading from right to left. At the same should be complemented. Thus, "A" should be read as "T"; "C" should be read as "G"; "T" should be converted into "A"; and "G" must be seen as "C" in the reversed-complemented sequence. From my example, the resulting file should look something like this:

Any help will be greatly appreciated!

sulti · August 31, 2011, 4:13pm

You can use sth like that to reverse and complement dna chain (given as parameter to the script):

#!/bin/bash

# get chain as parameter
dna=$1

function rrev() {
	case $1 in
		"T")
		echo -n A
		;;
		"A")
		echo -n T
		;;
		"G")
		echo -n C
		;;
		"C")
		echo -n G
		;;
	esac
}

reversed=`echo $dna | rev`

for i in $(seq 0 $((${#dna}-1))); do
	rrev ${reversed:$i:1};
done
echo

Run script as follow:

$ ./rev.sh ATCATATCCA
TGGATATGAT

radoulov · August 31, 2011, 4:13pm

Is there some algorithm behind or this is a fixed mapping?
I mean there are only A, C, G and T and T, G, C and A?

radoulov · August 31, 2011, 4:18pm

If the answer is yes:

perl -nle'BEGIN {
  @map{ A, C, G, T } = ( T, G, C, A )
  }
  print /^>/ ?
    $_ :
      join //, map $map{ $_ }, split //, scalar reverse
  ' infile

sulti · August 31, 2011, 4:20pm

Ok, got one, which takes file as parameter:

#!/bin/bash

file=$1

function rrev() {
	case $1 in
		"T")
		echo -n A
		;;
		"A")
		echo -n T
		;;
		"G")
		echo -n C
		;;
		"C")
		echo -n G
		;;
	esac
}

while read line; do

	if [ `echo $line | grep -c "^>"` -eq 0 ]; then
		orig_revd=`echo $line | rev`
		for i in $(seq 0 $((${#line}-1))); do
        		rrev ${orig_revd:$i:1};
		done
		echo
	else
		echo $line
	fi

done < $file

Edit:
@up - looks much simplier in perl

radoulov · August 31, 2011, 4:34pm

With awk:

awk 'BEGIN {
  j = n = split("A C G T", t)
  for (i = 0; ++i <= n;)
    map[t] = t[j--]
  }
{
  if (/^>/) print
  else {
    for (i = length; i; i--)
	  printf "%s", map[substr($0, i, 1)]
    print x
	}	  
  }'  infile

Xterra · September 1, 2011, 9:36am

That's great! The only part that is missing is the headers (>Sequence ID) for each sequence. How can I modify your perl script so it can include the headers for each and every sequence?
Thanks!

radoulov · September 1, 2011, 9:39am

I think they are included in the output:

zsh-4.3.12[t]% cat infile 
>Influenza
TGGATATGAT
>Dengue
AGTTACTCCGAAA
>Varicella
CTTTGGATA
zsh-4.3.12[t]% perl -nle'BEGIN {
quote>   @map{ A, C, G, T } = ( T, G, C, A )
quote>   }
quote>   print /^>/ ?
quote>     $_ :
quote>       join //, map $map{ $_ }, split //, scalar reverse
quote>   ' infile
>Influenza
ATCATATCCA
>Dengue
TTTCGGAGTAACT
>Varicella
TATCCAAAG

Am I missing something?

Xterra · September 1, 2011, 10:13am

This is the output I am getting

Now I am not sure, I am missing something?

---------- Post updated at 10:13 AM ---------- Previous update was at 10:12 AM ----------

Input file was not the correct one!
It is working like a charm!
Thanks!