I have an awk parser, that works great if the data is NC_0000 (four digits), but if it is not that then the data is parsed. I'm not sure the most efficient way to obtain the desired output. Thank you :).
ould not work. The desired output format is listed below and is always that way. Thank you :).
parse rules:
4 zeros after the NC_ (not always the case) and the digits before the .
digits after the g. repeated twice separated by a tab
letter before the >
letter after the >
[MOD]As has been stated many times before, PLEASE use CODE tags when displaying sample input and output as well as when displaying code segments.
Yes, the first line is a header so FNR>1 is used to skip it. I attached the input file that contains the data to be parsed. The issue with the parser the way it is that the line in bold is going error in a perl script I use later. Line 1 needs to look like line 3 in order for it to be used and I am not sure how to do this. Thank you :).
NC_000004.11:g.41749507G>T
NC_000013.10:g.20763466G>A
NC_00001.10:g.20763477C>G
04 41749507 41749507 G T
13 20763466 20763466 G A
1 20763477 20763477 C G
One awk feature is that it uses leading digits only if you perform arithmetics on a field, dropping everything after the first non-digit. So - $4+0 would yield the desired number regardless of its length. And a sub ($4+0, "", $4) would give the trailing char.
If I do the below the format is incorect pressumambly because of the header in the input file.
awk -F"[_.>]" '{a=length($4);b=substr($4,1,a-1);print $2+0,b,b,substr($4,a),$5}' OFS="\t" Target.txt
0
4004 244 244 G A NC
3924 288 288 C A NC
3924 385 385 G A NC
However, the below gives an error, I think because of the 'FNR > 1 , but I'm not sure. Thank you :).
My guess is that it parses out the NC_004004.4 incorectly. If I use a file with just the variants in it (no header) it works fine. The problem is that the input has a header normally that needs to be skipped. I thought I had it or was close, but it errors and I'm not that good at debugging yet. Thank you :).
awk -F"[_.>]" '{X=$4+0; sub(X, "", $4); print $2+0, X, X, $4, $5}' OFS="\t" /tmp/Test.txt
13 20763477 20763477 C T
4 41749507 41749507 G T
4 41749410 41749410 C T
awk -F"[_.>\t]" 'FNR>1 {X=$4+0; sub(X, "", $4); print $2+0, X, X, $4, $5}' OFS="\t" /tmp/Target.txt
4004 244 244 G A
3924 288 288 C A
3924 385 385 G A
for i in 1 2 3 4 5; do awk -F"[_.>\t]+" 'FNR>1 {N=(set-1)*5; X=$(N+4)+0; sub(X, "", $(N+4)); print $(N+2)+0, X, X, $(N+4), $(N+5)}' OFS="\t" set=$i /tmp/GJB-1_position.txt~; done
4004 79 79 G A
13 20763642 20763642 C T
4004 79 79 G A
5266354 79 79 G A
5266355 79 79 G A