In the following tab-delimited input , I am checking $7 for the keyword intronic . If that keyword is found then $2 is split by the .[ in each line and if the string after the digits or the +/- is >10 , then that line is deleted. This will always be the case for intronic . If $7 is exonic then nothing is done and the next line is processed.
For example, using the first line in input : $7=intronic , so $2 or c.[433+79A>G]+[433+79A>G] is split using the .[ in bold, and the string after the digits after the + is >10 , so that line is removed.
awk
awk -F'\t' -v OFS='\t' FNR==NR 'if ($7 ==/intronic/) ; {split($2,f2,"[[digits]");a[f2[1]];next} $2 in a' or ; {split($2,f3,"[[digits]");a[f3[1]];next} $2 in a' input
input
Index Mutation Call Start End Ref Alt Func.refGene Gene.refGene ExonicFunc.refGene Sanger
1 c.[433+79A>G]+[433+79A>G] 40556922 40556922 T C intronic PPT1
2 c.[362+8C>T]+[=] 40557656 40557656 G A intronic PPT1
3 c.276-31delG 43396570 43396570 C - intronic SLC2A1
20 c.[5109C>T]+[=] 166245425 166245425 C T exonic SCN2A synonymous SNV
21 c.[5139C>T]+[=] 166848646 166848646 G A exonic SCN1A synonymous SNV
desired output
Index Mutation Call Start End Ref Alt Func.refGene Gene.refGene ExonicFunc.refGene Sanger
2 c.[362+8C>T]+[=] 40557656 40557656 G A intronic PPT1
20 c.[5109C>T]+[=] 166245425 166245425 C T exonic SCN2A synonymous SNV
21 c.[5139C>T]+[=] 166848646 166848646 G A exonic SCN1A synonymous SNV
Index Mutation Call Start End Ref Alt Func.refGene Gene.refGene ExonicFunc.refGene Sanger
2 c.[362+8C>T]+[=] 40557656 40557656 G A intronic PPT1
3 c.276-31delG 43396570 43396570 C - intronic SLC2A1
20 c.[5109C>T]+[=] 166245425 166245425 C T exonic SCN2A synonymous SNV
21 c.[5139C>T]+[=] 166848646 166848646 G A exonic SCN1A synonymous SNV
Edit: Line 3 is not deleted as it didn't contain .[ - your sampe output still has it deleted. is the .[ not important?
It doesn't look like the .[ is important as long as all the +/- get removed if they meet the condition, but $7 could be intronic or UTR5 or UTR3 , is this possible to include in one awk ? @Shamrock the exonic will be used in another awk , but I'm not quite sure of the details yet. Thank you :).
Thank you Chubler_XL for nice code. Could you please help me here in one of my confusion. So when I print the value of variable na,ed v(which has 2nd field's value in it) as follows.
Index Mutation Call Start End Ref Alt Func.refGene Gene.refGene ExonicFunc.refGene Sanger
79A>G]+[433+79A>G] #### Value of variable v
8C>T]+[=] #### Value of variable v
2 c.[362+8C>T]+[=] 40557656 40557656 G A intronic PPT1
8C>T]+[=]
2 c.[362+8C>T]+[=] 40557656 40557656 G A UTR3 PPT1
31delG
20 c.[5109C>T]+[=] 166245425 166245425 C T exonic SCN2A synonymous SNV
21 c.[5139C>T]+[=] 166848646 166848646 G A exonic SCN1A synonymous SNV
So in above as we could see value of variable v is 79A>G]+[433+79A>G] , so I understood like by doing v+0 e are telling awk here to consider it as digit and then comparing it with 10 but doubt here is, it has alphabets as well as digits into it after 79A, so awk will still consider 79 only for comparison here? Could you please guide me here, will be grateful to you sir.
Hello cmccabe,
Could you please try following and let me know if this helps you too. Though my solution is based on I am trying to take exact digits between -/+ to </> here then later by adding 0 to it comparing it.
Yes, that is how awk works when converting strings value to numeric. It will continue until it comes to a character that makes the string non-numeric and just ignores the rest of the string. Try this code for some examples
awk '
function try(A) {
print A "\t" A + 0
}
BEGIN {
try("27.2A")
try("3..1415")
try(".23A27")
try("009001A")
} '
Note that some alphabetic characters are interpreted as parts of numeric values when awk converts a string to a number. The standards require that exponential notation be recognized (e.g., 1.05E-3 ) and allows but does not require awk to recognize hexadecimal constants of the forms 0xhexdigits and 0Xhexdigits . And, the standards allow, but do not currently require that the strings inf and infinity (case-insensitive on both) be treated as an infinity and that the string NaN (also case-insensitive) be treated as not-a-number.