Remove line based on condition in awk

cmccabe · August 25, 2016, 3:35pm

In the following tab-delimited input , I am checking $7 for the keyword intronic . If that keyword is found then $2 is split by the .[ in each line and if the string after the digits or the +/- is >10 , then that line is deleted. This will always be the case for intronic . If $7 is exonic then nothing is done and the next line is processed.

For example, using the first line in input :
$7=intronic , so $2 or c.[433+79A>G]+[433+79A>G] is split using the .[ in bold, and the string after the digits after the + is >10 , so that line is removed.

awk

awk -F'\t' -v OFS='\t' FNR==NR 'if ($7 ==/intronic/) ; {split($2,f2,"[[digits]");a[f2[1]];next} $2 in a' or ; {split($2,f3,"[[digits]");a[f3[1]];next} $2 in a' input

input

Index    Mutation Call    Start    End    Ref    Alt    Func.refGene    Gene.refGene    ExonicFunc.refGene    Sanger
1    c.[433+79A>G]+[433+79A>G]    40556922    40556922    T    C    intronic    PPT1        
2    c.[362+8C>T]+[=]    40557656    40557656    G    A    intronic    PPT1        
3    c.276-31delG    43396570    43396570    C    -    intronic    SLC2A1    
20    c.[5109C>T]+[=]    166245425    166245425    C    T    exonic    SCN2A    synonymous SNV    
21    c.[5139C>T]+[=]    166848646    166848646    G    A    exonic    SCN1A    synonymous SNV

desired output

Index    Mutation Call    Start    End    Ref    Alt    Func.refGene    Gene.refGene    ExonicFunc.refGene    Sanger
2    c.[362+8C>T]+[=]    40557656    40557656    G    A    intronic    PPT1         
20    c.[5109C>T]+[=]    166245425    166245425    C    T    exonic    SCN2A    synonymous SNV    
21    c.[5139C>T]+[=]    166848646    166848646    G    A    exonic    SCN1A    synonymous SNV

Chubler_XL · August 25, 2016, 4:00pm

Try this:

awk -F'\t' '
$7=="intronic" {
   v=$2
   sub(/.*\.\[[^+-]+[+-]/,"",v)
   if(v + 0 > 10) next
}
1' input

output:

Index   Mutation        Call    Start   End     Ref     Alt     Func.refGene    Gene.refGene    ExonicFunc.refGene      Sanger
2       c.[362+8C>T]+[=]        40557656        40557656        G       A       intronic        PPT1
3       c.276-31delG    43396570        43396570        C       -       intronic        SLC2A1
20      c.[5109C>T]+[=] 166245425       166245425       C       T       exonic  SCN2A   synonymous      SNV
21      c.[5139C>T]+[=] 166848646       166848646       G       A       exonic  SCN1A   synonymous      SNV

Edit: Line 3 is not deleted as it didn't contain .[ - your sampe output still has it deleted. is the .[ not important?

shamrock · August 25, 2016, 4:24pm

So what is to be done with those lines that have neither intronic or exonic in them?

cmccabe · August 25, 2016, 4:34pm

It doesn't look like the .[ is important as long as all the +/- get removed if they meet the condition, but $7 could be intronic or UTR5 or UTR3 , is this possible to include in one awk ?
@Shamrock the exonic will be used in another awk , but I'm not quite sure of the details yet. Thank you :).

Maybe

awk -F'\t' '
$7=="intronic || UTR3 || UTR5" {
   v=$2
   sub(/.*[^+-]+[+-]/,"",v)
   if(v + 0 > 10) next
}
1' input

Chubler_XL · August 25, 2016, 6:22pm

Try:

awk -F'\t' '
$7 ~ "^(intronic|UTR3|UTR5)$" {
   v=$2
   sub(/^[^+-]+[+-]/,"",v)
   if(v + 0 > 10) next
}
1' input

RavinderSingh13 · August 25, 2016, 11:36pm

Thank you Chubler_XL for nice code. Could you please help me here in one of my confusion. So when I print the value of variable na,ed v(which has 2nd field's value in it) as follows.

awk '                                                                                                                  
$7 ~ "^(intronic|UTR3|UTR5)$" {
   v=$2
   sub(/^[^+-]+[+-]/,"",v)
   print v;if(v + 0 > 10) next
}
1' Input_file

Output will be as follows then.

Index    Mutation Call    Start    End    Ref    Alt    Func.refGene    Gene.refGene    ExonicFunc.refGene    Sanger
79A>G]+[433+79A>G]   #### Value of variable v
8C>T]+[=]                     #### Value of variable v
2    c.[362+8C>T]+[=]    40557656    40557656    G    A    intronic    PPT1        
8C>T]+[=]
2    c.[362+8C>T]+[=]    40557656    40557656    G    A    UTR3    PPT1
31delG
20    c.[5109C>T]+[=]    166245425    166245425    C    T    exonic    SCN2A    synonymous SNV    
21    c.[5139C>T]+[=]    166848646    166848646    G    A    exonic    SCN1A    synonymous SNV

So in above as we could see value of variable v is 79A>G]+[433+79A>G] , so I understood like by doing v+0 e are telling awk here to consider it as digit and then comparing it with 10 but doubt here is, it has alphabets as well as digits into it after 79A, so awk will still consider 79 only for comparison here? Could you please guide me here, will be grateful to you sir.

Hello cmccabe,

Could you please try following and let me know if this helps you too. Though my solution is based on I am trying to take exact digits between -/+ to </> here then later by adding 0 to it comparing it.

awk -F'\t' '($7 ~ /intronic||UTR3||UTR5/){v=$2;a=sub(/^[^+-]+\+/,X,v);if(a){sub(/([><]).*/,X,v)};if((v+0)>10){next}} 1'  Input_file

Thanks,
R. Singh

Chubler_XL · August 25, 2016, 11:55pm

Yes, that is how awk works when converting strings value to numeric. It will continue until it comes to a character that makes the string non-numeric and just ignores the rest of the string. Try this code for some examples

awk '
function try(A) {
  print A "\t" A + 0
}
BEGIN {
  try("27.2A")
  try("3..1415")
  try(".23A27")
  try("009001A")
} '

Don_Cragun · August 26, 2016, 2:27am

Note that some alphabetic characters are interpreted as parts of numeric values when awk converts a string to a number. The standards require that exponential notation be recognized (e.g., 1.05E-3 ) and allows but does not require awk to recognize hexadecimal constants of the forms 0xhexdigits and 0Xhexdigits . And, the standards allow, but do not currently require that the strings inf and infinity (case-insensitive on both) be treated as an infinity and that the string NaN (also case-insensitive) be treated as not-a-number.

cmccabe · August 26, 2016, 9:36am

@RavinderSingh13 here is the result of the awk . Thank you all for your help :).

awk -F'\t' '($7 ~ /intronic||UTR3||UTR5/){v=$2;a=sub(/^[^+-]+\+/,X,v);if(a){sub(/([><]).*/,X,v)};if((v+0)>10){next}} 1'  Input_file

output

Index    Mutation Call    Start    End    Ref    Alt    Func.refGene    Gene.refGene    ExonicFunc.refGene    Sanger
3    c.276-31delG    43396570    43396570    C    -    intronic    SLC2A1        
4    c.[13-22T>C]+[13-22T>C]    160090674    160090674    T    C    intronic    ATP1A2        
5    c.13-11_13-8delTCCT    160090685    160090688    TCCT    -    intronic    ATP1A2        
12    c.[268-28A>G]+[268-28A>G]    166153499    166153499    A    G    intronic    SCN2A        
13    c.[1035-3T>C]+[1035-3T>C]    166170127    166170127    T    C    intronic    SCN2A        
15    c.[1672-16C>T]+[1672-16C>T]    166179650    166179650    C    T    intronic    SCN2A        
16    c.[2994C>T]+[=]    166210776    166210776    C    T    exonic    SCN2A    synonymous SNV    
17    c.[3400-71C>T]+[3400-71C>T]    166221582    166221582    C    T    intronic    SCN2A        
18    c.4552-40delT    166243216    166243216    T    -    intronic    SCN2A        
19    c.[4914T>A]+[4914T>A]    166245230    166245230    T    A    exonic    SCN2A    synonymous SNV    
20    c.[5109C>T]+[=]    166245425    166245425    C    T    exonic    SCN2A    synonymous SNV    
21    c.[5139C>T]+[=]    166848646    166848646    G    A    exonic    SCN1A    synonymous SNV    
22    c.3152_3153insAACCACT    166892841    166892841    -    AGTGGTT    exonic    SCN1A    frameshift insertion    TP
23    c.2044-5delT    166898947    166898947    A    -    intronic    SCN1A        
24    c.[1663-47T>G]+[1663-47T>G]    166900606    166900606    A    C    intronic    SCN1A        
25    c.1530_1531insA    166901684    166901684    -    T    exonic    SCN1A    frameshift insertion    FP
26    c.[-44C>T]+[=]    66094008    66094008    C    T    UTR5    KCTD7
27    c.[*68A>G]+[=]    125880589    125880589    T    C    UTR3    ALDH7A1

@Chubler_XL

The awk is missing the last line (Index 27) presumably because of the * . Would adding \*\ capture that condition? Thank you :).

Adding to the range was what I needed to do sub(/^[^*+-]+[*+-]/,"",v)

RudiC · August 26, 2016, 9:55am

Try also

awk '$7 ~/intronic|UTR(3|5)/{match ($2, /[+-][0-9]*/); if (10 < 0+substr($2, RSTART+1, RLENGTH-1)) next}1' file

cmccabe · August 26, 2016, 1:07pm

works great @RudiC... thank you :).