Using awk to parse multiple conditions

cmccabe · March 14, 2015, 2:43pm

There are 4 ways the user can input data and unfortunately the parse rules for each are slightly different. The first condition works great and the input file is attached for the second condition. Conditions 3 and 4 will follow I'm sure I will have trouble with them and need help as well. The code below parses condition 1 perfectly:

I apologize for the long post but just wanted to provide all the details. Thank you :).

 awk 'NR==2 {split($2,a,"[_.>]");b=substr(a[4],1,length(a[4]-1));print a[2]+0,b,b,substr(a[4],length(a[4])),a[5]}' OFS="\t" ${id}_position.txt > ${id}_parse.txt

 
1. c.79G>A
parse rules:
1 four zeros after the NC_  (not always the case) and the digits before the .
2 g. ###   g.###
3 letter before the >
4 letter after the >
Desired Output:  13     20763642     20763642     C     T

2. c.35delG
1 four zeros after the NC_  (not always the case) and the digits before the .
2 g. ###   g.###
3 letter before the del
4 "-" after the del
Desired Output:  13     20763686     20763686     C     -

3. c.575_576delCA 
4. .34_35delGGinsT

RudiC · March 14, 2015, 2:51pm

Where does the C T or C - come from?

cmccabe · March 14, 2015, 2:58pm

The C T and comes from the ${id}_position.txt which is parsed by the awk in the post.

The C - also comes from the ${id}_position.txt which is parsed, however the del in the field being parsed: NC_000013.10:g.20763686delC is how the C then - (leterr after the del goes first, and a "-" is used in the second position). The attached file has this in it as well. Thank you :).

cmccabe · March 16, 2015, 2:08pm

I don't know if something like the below would work. Also, how does awk know which parser to use?

 awk 'NR==2 {split($2,a,"[_.del]");b=substr(a[4],1,length(a[4]-1));print a[2]+0,b,b,substr(a[4],length(a[4])),a[5]}' OFS="\t" ${id}_position.txt > ${id}_parse.txt

There are two conditions the first has a > and the second has a del in in it:

 awk 'NR==2 {split($2,a,"[_.>]");b=substr(a[4],1,length(a[4]-1));print a[2]+0,b,b,substr(a[4],length(a[4])),a[5]}' OFS="\t" ${id}_position.txt > ${id}_parse.txt

will parse the >, but not the del. So do I need some identifier for the correct parser to be used? Thank you :).

Maybe:

 echo '>' | awk 'NR==2 {split($2,a,"[_.>]");b=substr(a[4],1,length(a[4]-1));print a[2]+0,b,b,substr(a[4],length(a[4])),a[5]}' OFS="\t" ${id}_position.txt > ${id}_parse.txt

echo 'del' | awk 'NR==2 {split($2,a,"[_.>]");b=substr(a[4],1,length(a[4]-1));print a[2]+0,b,b,substr(a[4],length(a[4])),a[5]}' OFS="\t" ${id}_position.txt > ${id}_parse.txt

cmccabe · March 16, 2015, 3:27pm

I tried the below code on the file attached.

 echo 'del' | awk 'NR==2 {split($2,a,"[_.del]");b=substr(a[4],1,length(a[4]-1));print a[2]+0,b,b,substr(a[4],length(a[4])),a[-]}' OFS="\t" del_position.txt >del_parse.txt

awk: cmd. line:1: NR==2 {split($2,a,"[_.del]");b=substr(a[4],1,length(a[4]-1));print a[2]+0,b,b,substr(a[4],length(a[4])),a[-]}
awk: cmd. line:1:                                                                                                            ^ syntax error
awk: cmd. line:1: error: invalid subscript expression

 Desired Output: 
 13     20763686     20763686     C     -

Thank you :).

vgersh99 · March 16, 2015, 3:58pm

couple of questions:

what do you think this will do? split($2,a,"[_.del]")
what do you think this will do? length(a[4]-1)
what's the meaning of this? a[-]

cmccabe · March 16, 2015, 4:13pm

Here is what I am trying to do and my attempt. Thanks you :).

If ">" in field then use first code, but if "del" in the field then use second code.
Example:
NC_000013.10:g.20763642C>T - uses code 1
NC_000013.10:g.20763686delC - uses code 2

split($2,a,"[_.del]") - split on the _ . del
length(a[4]-1) - capture all field 3 digits
a[-] - typo -[5] - put a "-" in field

 echo '>' | awk 'NR==2 {split($2,a,"[_.>]");b=substr(a[4],1,length(a[4]-1));print a[2]+0,b,b,substr(a[4],length(a[4])),a[5]}' OFS="\t" ${id}_position.txt > ${id}_parse.txt   # SNP

echo 'del' | awk 'NR==2 {split($2,a,"[_.>]");b=substr(a[4],1,length(a[4]-1));print a[2]+0,b,b,substr(a[4],length(a[4])),-[5]}' OFS="\t" ${id}_position.txt >${id}_parse.txt   # Deletion

Thank you :).

vgersh99 · March 16, 2015, 4:27pm

no, this will split $2 by EITHER '_', or '.' or 'd' or 'e' or 'l' (single characters)

Field 3? feels like you're doing a[4]? Where's "field 3" come into play?
if field 4 of an array 'a' has a value of 15, what do you think the following should return? length(a[4]-1)

3.  a[-]

typo -[5] - put a "-" in field

[/quote]
Hmm.... I don't understand what's being said here....
Thank you

cmccabe:

 echo '>' | awk 'NR==2 {split($2,a,"[_.>]");b=substr(a[4],1,length(a[4]-1));print a[2]+0,b,b,substr(a[4],length(a[4])),a[5]}' OFS="\t" ${id}_position.txt > ${id}_parse.txt   # SNP

echo 'del' | awk 'NR==2 {split($2,a,"[_.>]");b=substr(a[4],1,length(a[4]-1));print a[2]+0,b,b,substr(a[4],length(a[4])),-[5]}' OFS="\t" ${id}_position.txt >${id}_parse.txt   # Deletion

Thank you :).

cmccabe · March 17, 2015, 10:00am

Would the below work?

 
echo '>' | awk 'NR==2 {split($2,a,"[_.>]");b=substr(a[4],1,length(a[4]-1));print a[2]+0,b,b,substr(a[4],length(a[4])),a[5]}' OFS="\t" ${id}_position.txt > ${id}_parse.txt   # SNP

echo 'del' | awk 'NR==2 {split($2,a,"[_.'del']");b=substr(a[4],1,length(a[4]-1));print a[2]+0,b,b,substr(a[4],length(a[4])),-[5]}' OFS="\t" ${id}_position.txt >${id}_parse.txt   # Deletion

split on the > or del
return all digits after g. (regardless of length) so if it was 15 or 1025 all the #'s are returned
depending what is split on (the > or del), defines the parser to use.
So, if there is a ">" in the split then the first code is used
if there is "del" in the split then the second code is used. Thank you :).

---------- Post updated 03-17-15 at 09:00 AM ---------- Previous update was 03-16-15 at 04:08 PM ----------

Does this make sense or is there a better way to do this? Thank you :).

vgersh99 · March 17, 2015, 11:21pm

I'd like to ask moderators to close this thread.
Thank you.