Reverse complement

Xterra · July 25, 2015, 10:27pm

I want to reverse some DNA sequences and complement them at the same time. Thus, A changes to T; C to G; T to A and G to C.
example:
infile

>GHL8OVD01CMQVT SHORT1
TTGATGT
>GHL8OVD01CMQVT SHORT2
TTGATGT

outfile:

>GHL8OVD01CMQVT SHORT1
ACATCAA
>GHL8OVD01CMQVT SHORT2
ACATCAA

The Identifier (> XXXXX) should not be modified
This is the code I want to modify:

awk ' !(NR%2) ' infile | rev | tr ACGT TGCA

However, the Ids are not being printed. If I include NR%2 , the Ids will also be reverse complemented
I know I can always use perl:

perl -nle'BEGIN {
  @map{ A, C, G, T } = ( T, G, C, A )
  }
  print /^>/ ?
    $_ :
      join //, map $map{ $_ }, split //, scalar reverse
  ' infile

But I am trying to simplify the script so I can explain it better

Don_Cragun · July 25, 2015, 10:55pm

Can't you apply what you learned from your thread Cut & awk four days ago to this thread? You have exactly the same problem assuming that some elements of a pipeline will only process some of the lines they are fed or that lines thrown away by some element of a pipeline will still magically appear in your output.

You didn't ask any questions about the suggestions you were given there, so we assume that you understand how those suggestions work.

Aia · July 26, 2015, 12:57am

perl -ple 'y/ACGT/TGCA/ and $_ = reverse unless /^>/' infile
>GHL8OVD01CMQVT SHORT1
ACATCAA
>GHL8OVD01CMQVT SHORT2
ACATCAA

RudiC · July 26, 2015, 2:39am

sed approach (not necessarily easier to explain...):

sed '/>/n; y/ATCG/TAGC/;s/^.*$/X&X/;:x;s/\(X.\)\(.*\)\(.X\)/\3\2\1/;tx;s/X//g' file
>GHL8OVD01CMQVT SHORT1
ACATCAA
>GHL8OVD01CMQVT SHORT2
ACATCAA

or, if you have GNU sed with its extensions:

sed '/>/n; y/ATCG/TAGC/;s/^/echo /;s/$/ | rev/;e' file

Don_Cragun · July 26, 2015, 5:53pm

Private message from Xterra
Subject: linux commands "inside" awk

Don
Maybe I did not explain myself clearly. I would like to learn if I can use regular Linux commands such as cut, tr, rev, inside the awk script.
My question about awk and cut was answered using substr -nice alternative though. And I have been working on this

Code:
awk '{ for(i=length;i!=0;i--) x=(x substr($0,i,1))}{print x;x=""}'

I love Aia's solution for this task

Code:
perl -ple 'y/ACGT/TGCA/ and $_ = reverse unless /^>/'

Which is nicer than the code I have been using this far

Code:
perl -nle'BEGIN {
@map{ A, C, G, T } = ( T, G, C, A )
}
print /^>/ ?
$_ :
join //, map $map{ $_ }, split //, scalar reverse
'

Still, I would like to know if rev | tr can be used within the awk code for this particular example so it can be applied specifically to even rather than all lines on the infile. I do not want to transform the infile into a temporary file, reverse and complement and then rebuilt the file like my example for sort

Code:
awk '{printf("%s%s",$0,(NR%2)?"\t":"\n")}' input|sort -rk 3|tr '\t' '\n'

(thanks once again for that
I have several solutions but now I am trying to learn "easier" ways to write my scripts. Maybe this is not possible and that's why I asked. I have been working on it, I just do not seem to get it to work to satisfy my requirements
Cheers!

Hi Xterra,
You can use system() to run shell commands inside awk , but invoking a shell to invoke rev and tr once for each even numbered line in your file will take at least two orders of magnitude longer to run than building equivalent functionality into your awk script. If we write an awk script to print odd numbered lines and feed even numbered lines through rev and tr :

#!/bin/ksh
IAm=${0##*/}
tmpf="$IAm.$$"
awk -v tmpf="$tmpf" '
FNR % 2
!(FNR % 2) {
	print > tmpf
	close(tmpf)
	system("rev \"" tmpf "\" | tr ACGT TGCA")
}
' ${1:-infile}
rm -f "$tmpf"

it is easy to understand and, with an input file containing 10,000 copies of your sample input file, the average of timing 10 runs (with output redirected to a file) is about:

real	1m5.37s
user	0m41.09s
sys	0m49.33s

A similar awk script building the rev and tr functionality into an internal function:

#!/bin/ksh
awk '
BEGIN {	c["A"] = "T"; c["C"] = "G"; c["G"] = "C"; c["T"] = "A" }
function revcomp(	i, o) {
	o = ""
	for(i = length; i > 0; i--)
		o = o c[substr($0, i, 1)]
	return(o)
}
!(FNR % 2) {$0 = revcomp()}
1' ${1:-infile}

produces exactly the same output and takes about:

real	0m0.16s
user	0m0.15s
sys	0m0.00s

In other words this awk script processes a little more than 800 lines in the time it take to process 2 lines firing up a pipeline to process the even lines.

The average timing for Aia's perl suggestion was:

real	0m0.03s
user	0m0.02s
sys	0m0.01s

For some reason the BSD based sed on OS X produced the wrong output (with leading and trailing X characters on even numbered lines; the lines had been translated but not reversed) without producing any diagnostics when running RudiC's sed script. But an equivalent command (splitting on semicolons into separate sed editing commands):

sed -e '/>/n' -e 'y/ATCG/TAGC/' -e 's/^.*$/X&X/' -e ':x' -e 's/\(X.\)\(.*\)\(.X\)/\3\2\1/' -e 'tx' -e's/X//g' infile

produced the expected output with average timing output of:

real	0m0.09s
user	0m0.09s
sys	0m0.00s