Xterra
April 18, 2017, 2:47pm
1
I have a file that looks like this:
>ID_1
ATGCATGC
>ID_2
ATGCATGC
>ID_3
ATGCATGC
>ID_4
ATGCATGC
And I am using the following script to "extract" specific positions from the sequences:
awk '/^>/{id=$0; next}{ print id "\n" substr( $1,1,1 ) substr ($1,4,2 ) substr ($1,7,1) }' test.txt
It actually works but I suspect is the wrong way to use substr
. This is the output:
>ID_1
ACAG
>ID_2
ACAG
>ID_3
ACAG
>ID_4
ACAG
Ideally, what I would like to do, is to use a file positions.txt
, containing the sites I would like to extract:
1
4
5
7
I would appreciate if anyone can point me in the right direction.
Thanks in advance!
Actually -- I see nothing wrong. That's how strings, substr, concatenation, and variables work in awk.
Anyway, the code you wanted:
awk ' NR==FNR { POS[++P]=$1+0 ; next } # Load into array POS while in file 1
/^>/ { print ; next } # Print IDs immediately
{
S="";
for(N=1; N in POS; N++) S=S substr($0, POS[N], 1); # Assemble substrings
print S; # Print
}' positions.txt inputfile
NR==FNR is an old trick. NR is the total cumulative number of lines, while FNR is the line number in the current file. The two are equal only while awk is processing its first file.
1 Like
Xterra
April 18, 2017, 3:38pm
3
Thanks a TON Corona!
Could you please explain me the following parts of your code:
POS[++P]=$1+0
Once again thank you very much!
++P is the pre-increment operator, which increments the variable before it's used. Which means it goes POS[1], POS[2], POS[3], ...
If I'd used the post-increment operator, P++, it would do POS[""], POS[1], POS[2], ... because unset variables are blank strings.
$1+0 is to make sure awk stores it as a number, not a string. Doing any arithmetic on a string converts it into a number. Might not be necessary here.
1 Like
Xterra
April 18, 2017, 3:56pm
5
Got it! Just one more quick question, how could I change the output field separator for substr from ""
to " "
? In other words, how can I modify your script so I can generate the following output:
>ID_1
A C A G
>ID_2
A C A G
>ID_3
A C A G
>ID_4
A C A G
Interesting: shouldn't the integer operator p++ immediatly cast to an integer i.e. give 0 ??
---------- Post updated at 15:42 ---------- Previous update was at 14:56 ----------
Because the output is assembled in a variable there is no simple OFS option.
Two solutions,
with a separator variable
S=sep=""
for(N=1; N in POS; N++) { S=S sep substr($0, POS[N], 1); sep=" " }# Assemble substrings
with an embedded if clause
S=""
for(N=1; N in POS; N++) S=S (S=="" ? S : " ") substr($0, POS[N], 1) # Assemble substrings
Xterra
April 18, 2017, 6:49pm
7
I guess I am doing something wrong because I am only printing the headers.
I can modify the file using sed
but I really would like to get the feeling of how to do it with awk
RudiC
April 19, 2017, 3:25am
8
Mind to show us WHAT you're doing wrong?
Xterra
April 19, 2017, 7:33am
9
Rudy
I was referring to MadeinGermany's post:
Because the output is assembled in a variable there is no simple OFS option.
Two solutions,
with a separator variable
S=sep="" for(N=1; N in POS; N++) { S=S sep substr($0, POS[N], 1); sep=" " }# Assemble substrings
with an embedded if clause
S="" for(N=1; N in POS; N++) S=S (S=="" ? S : " ") substr($0, POS[N], 1) # Assemble substrings
I momentarily solved the problem with
sed:
sed '/^>/!s/A/A\t/g'
You have joined the two lines that I have given.
If you do that then you need a semicolon between them.
And of course you stll need the print S
.
Xterra
April 19, 2017, 8:38am
11
Got it!
awk ' NR==FNR { POS[++P]=$1+0 ; next } /^>/ { print ; next } { S=""; for(N=1; N in POS; N++) S=S (S=="" ? S : " " ) substr($0, POS[N], 1); print S; }' positions.txt test.txt
>ID_1
G C A G
>ID_2
A C A G
>ID_3
T C A G
>ID_4
C C A G
I cannot make this work though
awk ' NR==FNR { POS[++P]=$1+0 ; next } /^>/ { print ; next } { S=sep=""; for(N=1; N in POS; N++) { S=S sep substr($0, POS[N], 1); print S; sep=" "; }}' positions.txt test.txt
>ID_1
G
G C
G C A
G C A G
>ID_2
A
A C
A C A
A C A G
>ID_3
T
T C
T C A
T C A G
>ID_4
C
C C
C C A
C C A G
The latter has the print S
within the loop, needs to be after the loop
awk ' NR==FNR { POS[++P]=$1+0; next } /^>/ { print; next } { S=sep=""; for (N=1; N in POS; N++) { S=S sep substr($0, POS[N], 1); sep=" "; } print S; }' positions.txt test.txt
2 Likes