Reporting characters after string

Xterra · April 10, 2016, 1:46am

I have a file that looks like this:

>ID 1
AATAATTCCGGATCGTGC
>ID 2
TTTGACAGTAGAC
>ID 3
AGACGATGACGAT

I am using the following script to report if AATTCCGGATCG is present in any sequence:

awk 'FNR==1{n=substr(FILENAME,1,index(FILENAME,".")-1)} { print n "\t" (/AATTCCGGATCG|CGATCCGGAATT/ ? "ATCG" : "NOT Present" ) }

However, what I really need is the four characters right after the given string (AATTCCGG) , in my example= ATCG . Importantly, the string can be found reversed GGCCTTAA and complemented A=T; T=A; C=G and G=C , originating the following string = CCGGAATT in the sequence. If the string is found reversed and complemented, the four characters after the string must be reported as reversed and complemented. Thus, the desired output from a file containing the following sequences:

>ID 1
AATAATTTTGGATCGTGC
>ID 2
TTTGACGTTCCGGAATTCAGTAGAC
>ID 3
AGACGATGACGAT

would be AACG , since sequence 2 contains the corresponding string, only reversed and complemented.
My script can deal with the fact that the sequence is reversed/complemented. However, if any of the positions after the string is mutated, it will not detect it. That's is why I would rather get the characters instead
Any help will be greatly appreciated
Thanks
PS. The string , in this case AATTCCGG or CCGGAATT will never be mutated in a real scenario.

RudiC · April 10, 2016, 2:34am

If I see it correctly, that reversed/complemented string sits BEFORE your search pattern? If my memory serves me right, you had been given some sort of "algorithm/function" to reverse/complement strings; mayhap you could apply those?

Scrutinizer · April 10, 2016, 6:51am

Something like this?

awk -v len=4 -v string=AATTCCGG '
  BEGIN {
    FS=RS; RS=">"; OFS=""
    C["A"]="T"; C["T"]="A"; C["C"]="G"; C["G"]="C"  
  }
  function reverse_complement(s,	t,i) {
    for(i=1;i<=length(s);i++)
      t=C[substr(s,i,1)] t
    return t
  } 
  FNR==1{
    split(FILENAME, F, ".")
    next
  } 
  { 
    label=$1
    $1=""
    rec=$0 FS reverse_complement($0)
    while(match(rec,string)) { 
      print F[1] ":" label ":" substr(rec,RSTART+RLENGTH, len)
      rec=substr(rec, RSTART+1)
    }
  }
' file

Output:

file1:ID 1:ATCG
file2:ID 2:AACG

Xterra · April 10, 2016, 7:32pm

Thanks! It works; however, it is taking a lot of time to run. I will be searching hundreds of strings among thousands of files; and I suspect it will take way too long. This is the time I got when using your script in one real dataset searching for only one string:

real    0m28.877s
user    0m28.734s
sys     0m0.015s

Same dataset with my old script:

awk 'FNR==1{n=substr(FILENAME,1,index(FILENAME,".")-1)} { print n "\t" (/AATTCCGGATCG|CGATCCGGAATT/ ? "ATCG" : "NOT Present" ) }'

I got this:

real    0m0.074s
user    0m0.031s
sys     0m0.030s

It is pretty fast but it does not report the last 4 characters so it's no good.
I kinda get what I want using the following sed scripts:

sed -n 's/^[ATCG]*AATTCCGG\(.\)\(.\)\(.\)\(.\)[ATCG]*$/\1\2\3\4/p' file
sed -n 's/^[ATCG]*\(.\)\(.\)\(.\)\(.\)CCGGAATT[ATCG]*$/\4\3\2\1/p' file | tr ATGC TACG

The timing for the individual script, again using the same dataset mentioned above:

real    0m0.482s
user    0m0.405s
sys     0m0.138s

I would like to combine both sed script into one. Maybe using an IF statement in bash. Even though I would like to avoid bash if at all possible
I am also not sure how to modify so it can output the last three characters after the string

Almost there

 sed -n '0,/AATTCCGG/s/^[ATCG]*AATTCCGG\(.\)\(.\)\(.\)\(.\)[ATCG]*$/\1\2\3\4/p; 0,/CCGGAATT/s/^[ATCG]*\(.\)\(.\)\(.\)\(.\)CCGGAATT[ATCG]*$/\4\3\2\1/p; y/ATCG/TAGC/'

However, the last script y/ATCG/TAGC/ , is being ignored
I solved the problem with y/ATCG/TAGC/

sed -n '0,/AATTCCGG/s/^[ATCG]*AATTCCGG\(.\)\(.\)\(.\)\(.\)[ATCG]*$/\1\2\3\4/p; /AATTCCGG/!y/ATCG/TAGC/; s/^[ATCG]*\(.\)\(.\)\(.\)\(.\)GGCCTTAA[ATCG]*$/\4\3\2\1/p'

But if I add 0,/y/ATCG/TAGC/ to limit the extent of the script to the first occurrence exclusively
I guess I got it:

sed -n '0,/AATTCCGG/s/^[ATCG]*AATTCCGG\(.\)\(.\)\(.\)\(.\)[ATCG]*$/\1\2\3\4/p; 0,/CCGGAATT/y/ATCG/TAGC/; s/^[ATCG]*\(.\)\(.\)\(.\)\(.\)GGCCTTAA[ATCG]*$/\4\3\2\1/p'

Scrutinizer · April 10, 2016, 11:29pm

About the awk approach:
What OS are you using and what awk ?
How many lines are there and how long are they?
Could you try the same using mawk , which usually is the fastest awk available (perhaps you can install a package onto your system?)...

Also, could you try replacing:

rec=substr(rec, RSTART+1)

with

rec=substr(rec, RSTART+RLENGTH+len+1)

What it did was re-examing the string that was found to see if there is another match with a part of the same sequence, perhaps that is not necessary..

Xterra · April 11, 2016, 8:02pm

I am using Biolinux 8
My files contain thousand of lines; some of those lines contain >300,000 characters
I replace rec=substr(rec, RSTART+1) with rec=substr(rec, RSTART+RLENGTH+len+1) but I am not getting the desired output

Scrutinizer · April 11, 2016, 9:35pm

OK, that should be:

rec=substr(rec, RSTART+RLENGTH+len)

could you try again ?

Also since you are using GNU awk, you could try replacing the function with

  function reverse_complement(s,        t,i,n,F) {
    n=split(s,F,"")
    for(i=1;i<=n;i++)
      t=C[F] t
    return t
  }

which should also work for mawk ..

A mawk package is available for Bio-Linux you should be able to just install it, if it is not installed on your system by default. Could you try and test with that and see if the results are correct and what the performance results are? The difference can be extraordinary in certain cases..

Xterra · April 11, 2016, 10:26pm

Your script with the latest mods:

 real 9m13.651s
 user 9m13.210s
 sys 0m0.436s

my sed script:

 sed -n '0,/AATTCCGG/s/^[ATCG]*AATTCCGG\(.\)\(.\)\(.\)\(.\)[ATCG]*$/\1\2\3\4/p; 0,/CCGGAATT/y/ATCG/TAGC/; s/^[ATCG]*\(.\)\(.\)\(.\)\(.\)GGCCTTAA[ATCG]*$/\4\3\2\1/p'

This is what I got:

 real 1m10.950s
 user 1m10.715s
 sys 0m0.234s

I was wondering if there is any way I can limit the extent of either script to let say the first 10 occurrences only? That will significantly reduce the running time, and still allow me to 'sample' the data sufficiently to identify the consensus string for each file
Thanks a TON!

Scrutinizer · April 11, 2016, 11:39pm

10 occurrences per line ? Or per file?
What is you expected output?

Could you repeat the results with mawk , do you know how to install it?

--
Your GNU sed script will only find one occurrence per line and one occurrence per set of files of regular and reversed/complemented versions (the latter because only part of the file is reversed). Any additional patterns will not be shown and neither will it be shown which files or records these belong to, is that as intended? In the sample in post #1 one it printed the filename and could take multiple files...

Since it only reverse part of the file(s) and searches the whole file(s) there is a risk that it will find a reversed match in a non-reversed part of the file, which would mean a false positive .
To counteract that, you would need something like this, using GNU sed:

sed -n '0,/AATTCCGG/s/^[ATCG]*AATTCCGG\(.\)\(.\)\(.\)\(.\)[ATCG]*$/\1\2\3\4/p; 0,/CCGGAATT/{y/ATCG/TAGC/; s/^[ATCG]*\(.\)\(.\)\(.\)\(.\)GGCCTTAA[ATCG]*$/\4\3\2\1/p;}' file*

So the sed script it looking for very different things than the awk script us, and is only suited to investigate of there is one occurrence of either pattern in a single (set of) files, and for multiple files you would need a shell loop, which would significantly slow down processing, whereas the awk version can scan multiple files at once.

Xterra · April 12, 2016, 2:52pm

I will install mawk and report back
Sorry, my code should be as follows:

 sed -n '/AATTCCGG/s/^[ATCG]*AATTCCGG\(....\)[ATCG]*$/\1/p; /CCGGAATT/{y/ATCG/TAGC/; s/^[ATCG]*\(.\)\(.\)\(.\)\(.\)GGCCTTAA[ATCG]*$/\4\3\2\1/p;}'

I can search all occurrences in each and every line using global :

 sed -n '/AATTCCGG/s/^[ATCG]*AATTCCGG\(....\)[ATCG]*$/\1/pg; /CCGGAATT/{y/ATCG/TAGC/; s/^[ATCG]*\(.\)\(.\)\(.\)\(.\)GGCCTTAA[ATCG]*$/\4\3\2\1/pg;}'

I still wondering how would you limit your awk script to only 10 occurrences in the file

Scrutinizer · April 12, 2016, 3:22pm

No, that will not fly. This new sed code will match multiple lines per file, but whether you use global or not, this will still only one occurrence of a regular match or one reversed/complemented match per line, the latter only if there is no regular match on that line...

--
You could limit to 10 matches per file, like so, try:

awk -v len=4 -v string=AATTCCGG -v max=10 '
  BEGIN {
    FS=RS; RS=">"; OFS=""
    C["A"]="T"; C["T"]="A"; C["C"]="G"; C["G"]="C"  
  }
  function reverse_complement(s,        t,i,n,F) {
    n=split(s,F,"")
    for(i=1;i<=n;i++)
      t=C[F] t
    return t
  }
  FNR==1{
    split(FILENAME, F, ".")
    c=1
    next
  } 
  { 
    label=$1
    $1=""
    rec=$0 FS reverse_complement($0)
    while(c<=max && match(rec,string)) { 
      print F[1] ":" label ":" substr(rec,RSTART+RLENGTH, len)
      rec=substr(rec, RSTART+RLENGTH+len)
      c++
    }
  }
' file*.txt