I have a file that looks like this:
>ID 1
AATAATTCCGGATCGTGC
>ID 2
TTTGACAGTAGAC
>ID 3
AGACGATGACGAT
I am using the following script to report if AATTCCGGATCG
is present in any sequence:
awk 'FNR==1{n=substr(FILENAME,1,index(FILENAME,".")-1)} { print n "\t" (/AATTCCGGATCG|CGATCCGGAATT/ ? "ATCG" : "NOT Present" ) }
However, what I really need is the four characters right after the given string (AATTCCGG)
, in my example= ATCG
. Importantly, the string
can be found reversed GGCCTTAA
and complemented A=T; T=A; C=G and G=C
, originating the following string
= CCGGAATT
in the sequence. If the string
is found reversed and complemented, the four characters after the string must be reported as reversed and complemented. Thus, the desired output from a file containing the following sequences:
>ID 1
AATAATTTTGGATCGTGC
>ID 2
TTTGACGTTCCGGAATTCAGTAGAC
>ID 3
AGACGATGACGAT
would be AACG
, since sequence 2 contains the corresponding string, only reversed and complemented.
My script can deal with the fact that the sequence is reversed/complemented. However, if any of the positions after the string
is mutated, it will not detect it. That's is why I would rather get the characters instead
Any help will be greatly appreciated
Thanks
PS. The string
, in this case AATTCCGG
or CCGGAATT
will never be mutated in a real scenario.