Using awk to parse multiple conditions

There are 4 ways the user can input data and unfortunately the parse rules for each are slightly different. The first condition works great and the input file is attached for the second condition. Conditions 3 and 4 will follow I'm sure I will have trouble with them and need help as well. The code below parses condition 1 perfectly:

I apologize for the long post but just wanted to provide all the details. Thank you :).

 awk 'NR==2 {split($2,a,"[_.>]");b=substr(a[4],1,length(a[4]-1));print a[2]+0,b,b,substr(a[4],length(a[4])),a[5]}' OFS="\t" ${id}_position.txt > ${id}_parse.txt 
 
1. c.79G>A
parse rules:
1 four zeros after the NC_  (not always the case) and the digits before the .
2 g. ###   g.###
3 letter before the >
4 letter after the >
Desired Output:  13     20763642     20763642     C     T

2. c.35delG
1 four zeros after the NC_  (not always the case) and the digits before the .
2 g. ###   g.###
3 letter before the del
4 "-" after the del
Desired Output:  13     20763686     20763686     C     -

3. c.575_576delCA 
4. .34_35delGGinsT 

Where does the C T or C - come from?

The C T and comes from the ${id}_position.txt which is parsed by the awk in the post.

The C - also comes from the ${id}_position.txt which is parsed, however the del in the field being parsed: NC_000013.10:g.20763686delC is how the C then - (leterr after the del goes first, and a "-" is used in the second position). The attached file has this in it as well. Thank you :).

I don't know if something like the below would work. Also, how does awk know which parser to use?

 awk 'NR==2 {split($2,a,"[_.del]");b=substr(a[4],1,length(a[4]-1));print a[2]+0,b,b,substr(a[4],length(a[4])),a[5]}' OFS="\t" ${id}_position.txt > ${id}_parse.txt 

There are two conditions the first has a > and the second has a del in in it:

 awk 'NR==2 {split($2,a,"[_.>]");b=substr(a[4],1,length(a[4]-1));print a[2]+0,b,b,substr(a[4],length(a[4])),a[5]}' OFS="\t" ${id}_position.txt > ${id}_parse.txt 

will parse the >, but not the del. So do I need some identifier for the correct parser to be used? Thank you :).

Maybe:

 echo '>' | awk 'NR==2 {split($2,a,"[_.>]");b=substr(a[4],1,length(a[4]-1));print a[2]+0,b,b,substr(a[4],length(a[4])),a[5]}' OFS="\t" ${id}_position.txt > ${id}_parse.txt

echo 'del' | awk 'NR==2 {split($2,a,"[_.>]");b=substr(a[4],1,length(a[4]-1));print a[2]+0,b,b,substr(a[4],length(a[4])),a[5]}' OFS="\t" ${id}_position.txt > ${id}_parse.txt 

I tried the below code on the file attached.

 echo 'del' | awk 'NR==2 {split($2,a,"[_.del]");b=substr(a[4],1,length(a[4]-1));print a[2]+0,b,b,substr(a[4],length(a[4])),a[-]}' OFS="\t" del_position.txt >del_parse.txt

awk: cmd. line:1: NR==2 {split($2,a,"[_.del]");b=substr(a[4],1,length(a[4]-1));print a[2]+0,b,b,substr(a[4],length(a[4])),a[-]}
awk: cmd. line:1:                                                                                                            ^ syntax error
awk: cmd. line:1: error: invalid subscript expression
 Desired Output: 
 13     20763686     20763686     C     - 

Thank you :).

couple of questions:

  1. what do you think this will do? split($2,a,"[_.del]")
  2. what do you think this will do? length(a[4]-1)
  3. what's the meaning of this? a[-]

Here is what I am trying to do and my attempt. Thanks you :).

If ">" in field then use first code, but if "del" in the field then use second code.
Example:
NC_000013.10:g.20763642C>T - uses code 1
NC_000013.10:g.20763686delC - uses code 2

  1. split($2,a,"[_.del]") - split on the _ . del
  2. length(a[4]-1) - capture all field 3 digits
  3. a[-] - typo -[5] - put a "-" in field
 echo '>' | awk 'NR==2 {split($2,a,"[_.>]");b=substr(a[4],1,length(a[4]-1));print a[2]+0,b,b,substr(a[4],length(a[4])),a[5]}' OFS="\t" ${id}_position.txt > ${id}_parse.txt   # SNP

echo 'del' | awk 'NR==2 {split($2,a,"[_.>]");b=substr(a[4],1,length(a[4]-1));print a[2]+0,b,b,substr(a[4],length(a[4])),-[5]}' OFS="\t" ${id}_position.txt >${id}_parse.txt   # Deletion 

Thank you :).

no, this will split $2 by EITHER '_', or '.' or 'd' or 'e' or 'l' (single characters)

Field 3? feels like you're doing a[4]? Where's "field 3" come into play?
if field 4 of an array 'a' has a value of 15, what do you think the following should return? length(a[4]-1)

3.  a[-] 
  • typo -[5] - put a "-" in field

[/quote]
Hmm.... I don't understand what's being said here....
Thank you

Would the below work?

 
echo '>' | awk 'NR==2 {split($2,a,"[_.>]");b=substr(a[4],1,length(a[4]-1));print a[2]+0,b,b,substr(a[4],length(a[4])),a[5]}' OFS="\t" ${id}_position.txt > ${id}_parse.txt   # SNP

echo 'del' | awk 'NR==2 {split($2,a,"[_.'del']");b=substr(a[4],1,length(a[4]-1));print a[2]+0,b,b,substr(a[4],length(a[4])),-[5]}' OFS="\t" ${id}_position.txt >${id}_parse.txt   # Deletion 
  1. split on the > or del

  2. return all digits after g. (regardless of length) so if it was 15 or 1025 all the #'s are returned

  3. depending what is split on (the > or del), defines the parser to use.
    So, if there is a ">" in the split then the first code is used
    if there is "del" in the split then the second code is used. Thank you :).

---------- Post updated 03-17-15 at 09:00 AM ---------- Previous update was 03-16-15 at 04:08 PM ----------

Does this make sense or is there a better way to do this? Thank you :).

I'd like to ask moderators to close this thread.
Thank you.