Correct use of substr

Xterra · April 18, 2017, 2:47pm

I have a file that looks like this:

 >ID_1
 ATGCATGC
 >ID_2
 ATGCATGC
 >ID_3
 ATGCATGC
 >ID_4
 ATGCATGC

And I am using the following script to "extract" specific positions from the sequences:

 awk '/^>/{id=$0; next}{ print id "\n" substr( $1,1,1 ) substr ($1,4,2 ) substr ($1,7,1) }' test.txt

It actually works but I suspect is the wrong way to use substr . This is the output:

 >ID_1
 ACAG
 >ID_2
 ACAG
 >ID_3
 ACAG
 >ID_4
 ACAG

Ideally, what I would like to do, is to use a file positions.txt , containing the sites I would like to extract:

I would appreciate if anyone can point me in the right direction.
Thanks in advance!

Corona688 · April 18, 2017, 3:24pm

Actually -- I see nothing wrong. That's how strings, substr, concatenation, and variables work in awk.

Anyway, the code you wanted:

awk '   NR==FNR { POS[++P]=$1+0 ; next } # Load into array POS while in file 1
        /^>/ { print ; next } # Print IDs immediately
        {
                S="";
                for(N=1; N in POS; N++) S=S substr($0, POS[N], 1); # Assemble substrings
                print S; # Print
        }' positions.txt inputfile

NR==FNR is an old trick. NR is the total cumulative number of lines, while FNR is the line number in the current file. The two are equal only while awk is processing its first file.

Xterra · April 18, 2017, 3:38pm

Thanks a TON Corona!
Could you please explain me the following parts of your code:

 POS[++P]=$1+0

Once again thank you very much!

Corona688 · April 18, 2017, 3:41pm

++P is the pre-increment operator, which increments the variable before it's used. Which means it goes POS[1], POS[2], POS[3], ...

If I'd used the post-increment operator, P++, it would do POS[""], POS[1], POS[2], ... because unset variables are blank strings.

$1+0 is to make sure awk stores it as a number, not a string. Doing any arithmetic on a string converts it into a number. Might not be necessary here.

Xterra · April 18, 2017, 3:56pm

Got it! Just one more quick question, how could I change the output field separator for substr from "" to " " ? In other words, how can I modify your script so I can generate the following output:

 >ID_1
 A C A G
 >ID_2
 A C A G
 >ID_3
 A C A G
 >ID_4
 A C A G

MadeInGermany · April 18, 2017, 4:42pm

Interesting: shouldn't the integer operator p++ immediatly cast to an integer i.e. give 0 ??

---------- Post updated at 15:42 ---------- Previous update was at 14:56 ----------

Because the output is assembled in a variable there is no simple OFS option.
Two solutions,

with a separator variable

                S=sep=""
                for(N=1; N in POS; N++) { S=S sep substr($0, POS[N], 1); sep=" " }# Assemble substrings

with an embedded if clause

                S=""
                for(N=1; N in POS; N++) S=S (S=="" ? S : " ") substr($0, POS[N], 1) # Assemble substrings

Xterra · April 18, 2017, 6:49pm

I guess I am doing something wrong because I am only printing the headers.
I can modify the file using sed but I really would like to get the feeling of how to do it with awk

RudiC · April 19, 2017, 3:25am

Mind to show us WHAT you're doing wrong?

Xterra · April 19, 2017, 7:33am

Rudy
I was referring to MadeinGermany's post:

Because the output is assembled in a variable there is no simple OFS option.
Two solutions,

with a separator variable
S=sep="" for(N=1; N in POS; N++) { S=S sep substr($0, POS[N], 1); sep=" " }# Assemble substrings 
with an embedded if clause
 S="" for(N=1; N in POS; N++) S=S (S=="" ? S : " ") substr($0, POS[N], 1) # Assemble substrings 
 

I momentarily solved the problem with

sed:

 sed '/^>/!s/A/A\t/g'

MadeInGermany · April 19, 2017, 7:58am

You have joined the two lines that I have given.
If you do that then you need a semicolon between them.
And of course you stll need the print S .

Xterra · April 19, 2017, 8:38am

Got it!

awk ' NR==FNR { POS[++P]=$1+0 ; next } /^>/ { print ; next } { S=""; for(N=1; N in POS; N++) S=S (S=="" ? S : " " ) substr($0, POS[N], 1); print S; }' positions.txt test.txt

 >ID_1
 G C A G
 >ID_2
 A C A G
 >ID_3
 T C A G
 >ID_4
 C C A G

I cannot make this work though

 awk ' NR==FNR { POS[++P]=$1+0 ; next } /^>/ { print ; next } { S=sep=""; for(N=1; N in POS; N++) { S=S sep substr($0, POS[N], 1); print S; sep=" "; }}' positions.txt test.txt

 >ID_1
 G
 G C
 G C A
 G C A G
 >ID_2
 A
 A C
 A C A
 A C A G
 >ID_3
 T
 T C
 T C A
 T C A G
 >ID_4
 C
 C C
 C C A
 C C A G

MadeInGermany · April 19, 2017, 10:20am

The latter has the print S within the loop, needs to be after the loop

awk ' NR==FNR { POS[++P]=$1+0; next } /^>/ { print; next } { S=sep=""; for (N=1; N in POS; N++) { S=S sep substr($0, POS[N], 1); sep=" "; } print S; }' positions.txt test.txt

Xterra · April 19, 2017, 10:59am

Got it!
Thanks!