More efficient awk parser

cmccabe · March 12, 2015, 11:40am

I have an awk parser, that works great if the data is NC_0000 (four digits), but if it is not that then the data is parsed. I'm not sure the most efficient way to obtain the desired output. Thank you :).

Code:

 awk 'FNR > 1 && match($0, /NC_0000([0-9]*)\..*g\.([0-9]+)(.)>(.)/, a){ print a[1], a[2], a[2], a[3], a[4] }' OFS='\t' ${id}.txt > ${id}_parse.txt

For example:

NC_000013.10:g.20763466G>A

or

NC_00001.10:g.20763477C>G

would be parsed into the desired output of

13 20763466 20763466 G A

or

1 20763477 20763477 C G

,

but

NC_000004.11:g.41749507G>T

ould not work. The desired output format is listed below and is always that way. Thank you :).
parse rules:

4 zeros after the NC_ (not always the case) and the digits before the .

digits after the g. repeated twice separated by a tab

letter before the >

letter after the >
[MOD]As has been stated many times before, PLEASE use CODE tags when displaying sample input and output as well as when displaying code segments.

balajesuri · March 12, 2015, 12:19pm

It works for me:

[user@host ~]$ cat file
NC_000004.11:g.41749507G>T
NC_000013.10:g.20763466G>A
NC_00001.10:g.20763477C>G
[user@host ~]$ awk 'match($0, /NC_0000([0-9]*)\..*g\.([0-9]+)(.)>(.)/, a){ print a[1], a[2], a[2], a[3], a[4] }' OFS='\t' file
04      41749507        41749507        G       T
13      20763466        20763466        G       A
1       20763477        20763477        C       G
[user@host ~]$

By the way, why do you have FNR>1 ; do you want to skip the first line? Does the first line have NC_000004.... ?

cmccabe · March 12, 2015, 12:36pm

Yes, the first line is a header so FNR>1 is used to skip it. I attached the input file that contains the data to be parsed. The issue with the parser the way it is that the line in bold is going error in a perl script I use later. Line 1 needs to look like line 3 in order for it to be used and I am not sure how to do this. Thank you :).

 
NC_000004.11:g.41749507G>T
NC_000013.10:g.20763466G>A
NC_00001.10:g.20763477C>G
 
 
04      41749507        41749507        G       T
13      20763466        20763466        G       A
1       20763477        20763477        C       G

RudiC · March 12, 2015, 2:44pm

Did you try to print a[1]+0, ... ?

cmccabe · March 12, 2015, 6:45pm

Since the digits after the g. might also vary:

 awk -F"[_.>]" 'FNR > 1 '{a=length($4);b=substr($4,1,a-1);print $2+0,b,b,substr($4,a),$5}' OFS="\t" ${id}.txt > ${id}_parse.txt

would this skip the header row and parse the third column? Thanks.

RudiC · March 13, 2015, 8:09am

Did you test that? What be the result?

One awk feature is that it uses leading digits only if you perform arithmetics on a field, dropping everything after the first non-digit. So - $4+0 would yield the desired number regardless of its length. And a sub ($4+0, "", $4) would give the trailing char.

cmccabe · March 13, 2015, 10:25am

If I do the below the format is incorect pressumambly because of the header in the input file.

 awk -F"[_.>]" '{a=length($4);b=substr($4,1,a-1);print $2+0,b,b,substr($4,a),$5}' OFS="\t" Target.txt
0
4004    244     244     G       A               NC
3924    288     288     C       A               NC
3924    385     385     G       A               NC

However, the below gives an error, I think because of the 'FNR > 1 , but I'm not sure. Thank you :).

 awk -F"[_.>]" 'FNR > 1 '{a=length($4);b=substr($4,1,a-1);print $2+0,b,b,substr($4,a),$5}' OFS="\t" ${id}.txt > ${id}_parse.txt 
-bash: syntax error near unexpected token `('

RudiC · March 13, 2015, 11:05am

Definitely not. There's a single quote too many.
Where does the NC last field come from?

vgersh99 · March 13, 2015, 11:11am

loose the ' following FNR > 1

awk -F'[_.>]' 'FNR > 1 {a=length($4);b=substr($4,1,a-1);print $2+0,b,b,substr($4,a),$5}' OFS='\t' ${id}.txt > ${id}_parse.txt

cmccabe · March 13, 2015, 11:15am

My guess is that it parses out the NC_004004.4 incorectly. If I use a file with just the variants in it (no header) it works fine. The problem is that the input has a header normally that needs to be skipped. I thought I had it or was close, but it errors and I'm not that good at debugging yet. Thank you :).

 
awk -F"[_.>]" '{a=length($4);b=substr($4,1,a-1);print $2+0,b,b,substr($4,a),$5}' OFS="\t" Test.txt > output.txt

cmccabe · March 13, 2015, 11:24am

I attached the output of the command, which runs, but doesn't look right. Thank you :).

 awk -F'[_.>]' 'FNR > 1 {a=length($4);b=substr($4,1,a-1);print $2+0,b,b,substr($4,a),$5}' OFS='\t' ${id}.txt > ${id}_parse.txt

 Desired Output
13 20763477 20763477 C T
4 41749507 41749507 G T
4 41749410 41749410 C T

RudiC · March 13, 2015, 5:15pm

Try

awk -F"[_.>]" '{X=$4+0; sub(X, "", $4); print $2+0, X, X, $4, $5}' OFS="\t" /tmp/Test.txt 
13      20763477        20763477        C       T
4       41749507        41749507        G       T
4       41749410        41749410        C       T

awk -F"[_.>\t]" 'FNR>1 {X=$4+0; sub(X, "", $4); print $2+0, X, X, $4, $5}' OFS="\t" /tmp/Target.txt 
4004    244     244     G       A
3924    288     288     C       A
3924    385     385     G       A

cmccabe · March 14, 2015, 11:46am

The output of the second awk skips the header but the first awk has the desired output.

The input will change each time so to represent this ${id}_position.txt is used. An example of the input file is attached. I tried:

 awk -F"[_.>]" 'FNR > 1 {X=$4+0; sub(X, "", $4); print $2+0, X, X, $4, $5}' OFS="\t" ${id}_position.txt > ${id}_parse.txt

but that didn't work.

Output of script

 4004	79	79	G	A		NC

Desired Output

 13     20763642     20763642     C     T

Thank you very much and have a nice weekend:).

cmccabe · March 14, 2015, 2:22pm

I modified the code a bit and it works perfectly thank you for your help.

 awk 'NR==2 {split($2,a,"[_.>]");b=substr(a[4],1,length(a[4]-1));print a[2]+0,b,b,substr(a[4],length(a[4])),a[5]}' OFS="\t" ${id}_position.txt > ${id}_parse.txt

As you mentioned in a post what if a > is not present, then how to parse. I am encountering that more now, but will post in a new thread. Thank you

RudiC · March 14, 2015, 2:46pm

Given your data strictly follow the pattern above, you could select sets by numbers:

awk -F"[_.>\t]+" 'FNR>1 {N=(set-1)*5; X=$(N+4)+0; sub(X, "", $(N+4)); print $(N+2)+0, X, X, $(N+4), $(N+5)}' OFS="\t" set=5 /tmp/GJB-1_position.txt 
5266355    79    79    G    A

for i in 1 2 3 4 5; do awk -F"[_.>\t]+" 'FNR>1 {N=(set-1)*5; X=$(N+4)+0; sub(X, "", $(N+4)); print $(N+2)+0, X, X, $(N+4), $(N+5)}' OFS="\t" set=$i /tmp/GJB-1_position.txt~; done
4004    79    79    G    A
13    20763642    20763642    C    T
4004    79    79    G    A
5266354    79    79    G    A
5266355    79    79    G    A