Correct use of substr

I have a file that looks like this:

 >ID_1
 ATGCATGC
 >ID_2
 ATGCATGC
 >ID_3
 ATGCATGC
 >ID_4
 ATGCATGC

And I am using the following script to "extract" specific positions from the sequences:

 awk '/^>/{id=$0; next}{ print id "\n" substr( $1,1,1 ) substr ($1,4,2 ) substr ($1,7,1) }' test.txt
 

It actually works but I suspect is the wrong way to use substr . This is the output:

 >ID_1
 ACAG
 >ID_2
 ACAG
 >ID_3
 ACAG
 >ID_4
 ACAG
 

Ideally, what I would like to do, is to use a file positions.txt , containing the sites I would like to extract:

 1
 4
 5
 7
 

I would appreciate if anyone can point me in the right direction.
Thanks in advance!

Actually -- I see nothing wrong. That's how strings, substr, concatenation, and variables work in awk.

Anyway, the code you wanted:

awk '   NR==FNR { POS[++P]=$1+0 ; next } # Load into array POS while in file 1
        /^>/ { print ; next } # Print IDs immediately
        {
                S="";
                for(N=1; N in POS; N++) S=S substr($0, POS[N], 1); # Assemble substrings
                print S; # Print
        }' positions.txt inputfile

NR==FNR is an old trick. NR is the total cumulative number of lines, while FNR is the line number in the current file. The two are equal only while awk is processing its first file.

1 Like

Thanks a TON Corona!
Could you please explain me the following parts of your code:

 POS[++P]=$1+0
 

Once again thank you very much!

++P is the pre-increment operator, which increments the variable before it's used. Which means it goes POS[1], POS[2], POS[3], ...

If I'd used the post-increment operator, P++, it would do POS[""], POS[1], POS[2], ... because unset variables are blank strings.

$1+0 is to make sure awk stores it as a number, not a string. Doing any arithmetic on a string converts it into a number. Might not be necessary here.

1 Like

Got it! Just one more quick question, how could I change the output field separator for substr from "" to " " ? In other words, how can I modify your script so I can generate the following output:

 >ID_1
 A C A G
 >ID_2
 A C A G
 >ID_3
 A C A G
 >ID_4
 A C A G
 

Interesting: shouldn't the integer operator p++ immediatly cast to an integer i.e. give 0 ??

---------- Post updated at 15:42 ---------- Previous update was at 14:56 ----------

Because the output is assembled in a variable there is no simple OFS option.
Two solutions,

  1. with a separator variable
                S=sep=""
                for(N=1; N in POS; N++) { S=S sep substr($0, POS[N], 1); sep=" " }# Assemble substrings
  1. with an embedded if clause
                S=""
                for(N=1; N in POS; N++) S=S (S=="" ? S : " ") substr($0, POS[N], 1) # Assemble substrings

I guess I am doing something wrong because I am only printing the headers.
I can modify the file using sed but I really would like to get the feeling of how to do it with awk

Mind to show us WHAT you're doing wrong?

Rudy
I was referring to MadeinGermany's post:

I momentarily solved the problem with

sed:

 sed '/^>/!s/A/A\t/g'
 

You have joined the two lines that I have given.
If you do that then you need a semicolon between them.
And of course you stll need the print S .

Got it!

awk ' NR==FNR { POS[++P]=$1+0 ; next } /^>/ { print ; next } { S=""; for(N=1; N in POS; N++) S=S (S=="" ? S : " " ) substr($0, POS[N], 1); print S; }' positions.txt test.txt
 
 >ID_1
 G C A G
 >ID_2
 A C A G
 >ID_3
 T C A G
 >ID_4
 C C A G
 

I cannot make this work though

 awk ' NR==FNR { POS[++P]=$1+0 ; next } /^>/ { print ; next } { S=sep=""; for(N=1; N in POS; N++) { S=S sep substr($0, POS[N], 1); print S; sep=" "; }}' positions.txt test.txt
 
  
 >ID_1
 G
 G C
 G C A
 G C A G
 >ID_2
 A
 A C
 A C A
 A C A G
 >ID_3
 T
 T C
 T C A
 T C A G
 >ID_4
 C
 C C
 C C A
 C C A G
 

The latter has the print S within the loop, needs to be after the loop

awk ' NR==FNR { POS[++P]=$1+0; next } /^>/ { print; next } { S=sep=""; for (N=1; N in POS; N++) { S=S sep substr($0, POS[N], 1); sep=" "; } print S; }' positions.txt test.txt
2 Likes

Got it!
Thanks!