More efficient awk parser

I have an awk parser, that works great if the data is NC_0000 (four digits), but if it is not that then the data is parsed. I'm not sure the most efficient way to obtain the desired output. Thank you :).

Code:

 awk 'FNR > 1 && match($0, /NC_0000([0-9]*)\..*g\.([0-9]+)(.)>(.)/, a){ print a[1], a[2], a[2], a[3], a[4] }' OFS='\t' ${id}.txt > ${id}_parse.txt 

For example:

NC_000013.10:g.20763466G>A

or

NC_00001.10:g.20763477C>G

would be parsed into the desired output of

13 20763466 20763466 G A

or

1 20763477 20763477 C G

,

but

NC_000004.11:g.41749507G>T

ould not work. The desired output format is listed below and is always that way. Thank you :).
parse rules:

4 zeros after the NC_ (not always the case) and the digits before the .

digits after the g. repeated twice separated by a tab

letter before the >

letter after the >
[MOD]As has been stated many times before, PLEASE use CODE tags when displaying sample input and output as well as when displaying code segments.

It works for me:

[user@host ~]$ cat file
NC_000004.11:g.41749507G>T
NC_000013.10:g.20763466G>A
NC_00001.10:g.20763477C>G
[user@host ~]$ awk 'match($0, /NC_0000([0-9]*)\..*g\.([0-9]+)(.)>(.)/, a){ print a[1], a[2], a[2], a[3], a[4] }' OFS='\t' file
04      41749507        41749507        G       T
13      20763466        20763466        G       A
1       20763477        20763477        C       G
[user@host ~]$

By the way, why do you have FNR>1 ; do you want to skip the first line? Does the first line have NC_000004.... ?

Yes, the first line is a header so FNR>1 is used to skip it. I attached the input file that contains the data to be parsed. The issue with the parser the way it is that the line in bold is going error in a perl script I use later. Line 1 needs to look like line 3 in order for it to be used and I am not sure how to do this. Thank you :).

 
NC_000004.11:g.41749507G>T
NC_000013.10:g.20763466G>A
NC_00001.10:g.20763477C>G
 
 
04      41749507        41749507        G       T
13      20763466        20763466        G       A
1       20763477        20763477        C       G 

Did you try to print a[1]+0, ... ?

Since the digits after the g. might also vary:

 awk -F"[_.>]" 'FNR > 1 '{a=length($4);b=substr($4,1,a-1);print $2+0,b,b,substr($4,a),$5}' OFS="\t" ${id}.txt > ${id}_parse.txt 

would this skip the header row and parse the third column? Thanks.

Did you test that? What be the result?

One awk feature is that it uses leading digits only if you perform arithmetics on a field, dropping everything after the first non-digit. So - $4+0 would yield the desired number regardless of its length. And a sub ($4+0, "", $4) would give the trailing char.

If I do the below the format is incorect pressumambly because of the header in the input file.

 awk -F"[_.>]" '{a=length($4);b=substr($4,1,a-1);print $2+0,b,b,substr($4,a),$5}' OFS="\t" Target.txt
0
4004    244     244     G       A               NC
3924    288     288     C       A               NC
3924    385     385     G       A               NC 

However, the below gives an error, I think because of the 'FNR > 1 , but I'm not sure. Thank you :).

 awk -F"[_.>]" 'FNR > 1 '{a=length($4);b=substr($4,1,a-1);print $2+0,b,b,substr($4,a),$5}' OFS="\t" ${id}.txt > ${id}_parse.txt 
-bash: syntax error near unexpected token `(' 

Definitely not. There's a single quote too many.
Where does the NC last field come from?

loose the ' following FNR > 1

awk -F'[_.>]' 'FNR > 1 {a=length($4);b=substr($4,1,a-1);print $2+0,b,b,substr($4,a),$5}' OFS='\t' ${id}.txt > ${id}_parse.txt 

My guess is that it parses out the NC_004004.4 incorectly. If I use a file with just the variants in it (no header) it works fine. The problem is that the input has a header normally that needs to be skipped. I thought I had it or was close, but it errors and I'm not that good at debugging yet. Thank you :).

 
awk -F"[_.>]" '{a=length($4);b=substr($4,1,a-1);print $2+0,b,b,substr($4,a),$5}' OFS="\t" Test.txt > output.txt

I attached the output of the command, which runs, but doesn't look right. Thank you :).

 awk -F'[_.>]' 'FNR > 1 {a=length($4);b=substr($4,1,a-1);print $2+0,b,b,substr($4,a),$5}' OFS='\t' ${id}.txt > ${id}_parse.txt 
 Desired Output
13 20763477 20763477 C T
4 41749507 41749507 G T
4 41749410 41749410 C T

Try

awk -F"[_.>]" '{X=$4+0; sub(X, "", $4); print $2+0, X, X, $4, $5}' OFS="\t" /tmp/Test.txt 
13      20763477        20763477        C       T
4       41749507        41749507        G       T
4       41749410        41749410        C       T

awk -F"[_.>\t]" 'FNR>1 {X=$4+0; sub(X, "", $4); print $2+0, X, X, $4, $5}' OFS="\t" /tmp/Target.txt 
4004    244     244     G       A
3924    288     288     C       A
3924    385     385     G       A

The output of the second awk skips the header but the first awk has the desired output.

The input will change each time so to represent this ${id}_position.txt is used. An example of the input file is attached. I tried:

 awk -F"[_.>]" 'FNR > 1 {X=$4+0; sub(X, "", $4); print $2+0, X, X, $4, $5}' OFS="\t" ${id}_position.txt > ${id}_parse.txt 

but that didn't work.

Output of script

 4004	79	79	G	A		NC 

Desired Output

 13     20763642     20763642     C     T 

Thank you very much and have a nice weekend:).

I modified the code a bit and it works perfectly :slight_smile: thank you for your help.

 awk 'NR==2 {split($2,a,"[_.>]");b=substr(a[4],1,length(a[4]-1));print a[2]+0,b,b,substr(a[4],length(a[4])),a[5]}' OFS="\t" ${id}_position.txt > ${id}_parse.txt 

As you mentioned in a post what if a > is not present, then how to parse. I am encountering that more now, but will post in a new thread. Thank you :slight_smile:

Given your data strictly follow the pattern above, you could select sets by numbers:

awk -F"[_.>\t]+" 'FNR>1 {N=(set-1)*5; X=$(N+4)+0; sub(X, "", $(N+4)); print $(N+2)+0, X, X, $(N+4), $(N+5)}' OFS="\t" set=5 /tmp/GJB-1_position.txt 
5266355    79    79    G    A
for i in 1 2 3 4 5; do awk -F"[_.>\t]+" 'FNR>1 {N=(set-1)*5; X=$(N+4)+0; sub(X, "", $(N+4)); print $(N+2)+0, X, X, $(N+4), $(N+5)}' OFS="\t" set=$i /tmp/GJB-1_position.txt~; done
4004    79    79    G    A
13    20763642    20763642    C    T
4004    79    79    G    A
5266354    79    79    G    A
5266355    79    79    G    A
1 Like