Perl to update field based on a specific set of rules

cmccabe · July 21, 2017, 8:39am

In the perl below, which does execute, I am having trouble with the else in Rule 3. The digit in f{8} is extracted and used to update f[55] accordinly along with the value in f[13].
There can be either - * or + before the number that is extracted but the same logic applies, that is if the value is greater than 10 and f[13] is greater than 0.01 f[55] is Likely Benign, if the value is less than 10
and f[13] is less than 0.01 f[55] is VUS in. It is possible for f[13] to be . (dot) but that is the same as zero. However, currently the value does not seem to be extracted by the perl below and I
get different output then the desired. If I comment this rule out it fixes some of the lines but causes other lines to be incorrect. I am not sure what is causing the issue or how to fix it. Thank you :).

I added a description to explain each line of the output and what should be happening, which currently is not :).

file

R_Index    Chr    Start    End    Ref    Alt    Func.refGene    Gene.refGene    GeneDetail.refGene    Inheritence    ExonicFunc.refGene    AAChange.refGene    avsnp147    PopFreqMax    1000G_ALL    1000G_AFR    1000G_AMR    1000G_EAS    1000G_EUR    1000G_SAS    ExAC_ALL    ExAC_AFR    ExAC_AMR    ExAC_EAS    ExAC_FIN    ExAC_NFE    ExAC_OTH    ExAC_SAS    ESP6500siv2_ALL    ESP6500siv2_AA    ESP6500siv2_EA    CG46    SIFT_score    SIFT_pred    Polyphen2_HDIV_score    Polyphen2_HDIV_pred    Polyphen2_HVAR_score    Polyphen2_HVAR_pred    LRT_score    LRT_pred    MutationTaster_score    MutationTaster_pred    MutationAssessor_score    MutationAssessor_pred    dpsi_max_tissue    dpsi_zscore    CLINSIG    CLNDBN    CLNACC    CLNDSDB    CLNDSDBID    Quality    Reads    Zygosity    Score    Classification    HGMD    Sanger
47    chr16    10032451    10032451    T    C    splicing    GRIN2A    NM_001134408:exon3:c.415-43A>G;NM_001134407:exon3:c.415-43A>G;NM_000833:exon4:c.415-43A>G    .    .    .    rs760077182    0.0004    .    .    .    .    .    .    0.0001    .    .    .    .    .    .    0.0004    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    0.7699    1.365    .    .    .    .    .    GOOD    93    het    9    .    .    .
77    chr14    76427420    76427420    C    G    splicing    TGFB3    NM_003239:exon6:c.927-1G>C    .    .    .    rs767548724    0.0001    .    .    .    .    .    .    .    .    0.0001    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    1    D    .    .    -3.9123    -2.334    .    .    .    .    .    GOOD    190    het    13    .    .    .
160    chr3    186944313    186944313    G    C    splicing    MASP1    NM_001879:exon12:c.1442-5C>G    .    .    .    rs138989954    0.0055    0.0018    .    .    .    0.005    0.0041    0.0036    0.0008    0.0032    .    0.0007    0.0043    0.0037    0.0055    0.0025    0.0005    0.0036    .    .    .    .    .    .    .    .    .    .    .    .    .    -0.8646    -1.388    .    .    .    .    .    GOOD    265    het    13    .    .    .

desired output tab-delimeted

R_Index    Chr    Start    End    Ref    Alt    Func.refGene    Gene.refGene    GeneDetail.refGene    Inheritence    ExonicFunc.refGene    AAChange.refGene    avsnp147    PopFreqMax    1000G_ALL    1000G_AFR    1000G_AMR    1000G_EAS    1000G_EUR    1000G_SAS    ExAC_ALL    ExAC_AFR    ExAC_AMR    ExAC_EAS    ExAC_FIN    ExAC_NFE    ExAC_OTH    ExAC_SAS    ESP6500siv2_ALL    ESP6500siv2_AA    ESP6500siv2_EA    CG46    SIFT_score    SIFT_pred    Polyphen2_HDIV_score    Polyphen2_HDIV_pred    Polyphen2_HVAR_score    Polyphen2_HVAR_pred    LRT_score    LRT_pred    MutationTaster_score    MutationTaster_pred    MutationAssessor_score    MutationAssessor_pred    dpsi_max_tissue    dpsi_zscore    CLINSIG    CLNDBN    CLNACC    CLNDSDB    CLNDSDBID    Quality    Reads    Zygosity    Score    Classification    HGMD    Sanger
47    chr16    10032451    10032451    T    C    splicing    GRIN2A    NM_001134408:exon3:c.415-43A>G;NM_001134407:exon3:c.415-43A>G;NM_000833:exon4:c.415-43A>G    .    .    .    rs760077182    0.0004    .    .    .    .    .    .    0.0001    .    .    .    .    .    .    0.0004    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    0.7699    1.365    .    .    .    .    .    GOOD    93    het    9    Likely Benign    .    .
77    chr14    76427420    76427420    C    G    splicing    TGFB3    NM_003239:exon6:c.927-1G>C    .    .    .    rs767548724    0.0001    .    .    .    .    .    .    .    .    0.0001    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    1    D    .    .    -3.9123    -2.334    .    .    .    .    .    GOOD    190    het    13    VUS    .    .
160    chr3    186944313    186944313    G    C    splicing    MASP1    NM_001879:exon12:c.1442-5C>G    .    .    .    rs138989954    0.0055    0.0018    .    .    .    0.005    0.0041    0.0036    0.0008    0.0032    .    0.0007    0.0043    0.0037    0.0055    0.0025    0.0005    0.0036    .    .    .    .    .    .    .    .    .    .    .    .    .    -0.8646    -1.388    .    .    .    .    .    GOOD    265    het    13    VUS    .    .

description

line 1 is Likely Benign in f[55] because in NM_001134408:exon3:c.415-43A>G in f[8] the 43 after the minus is extracted and since that value is greater than 10 f[55] is Likely Benign
line 2 is VUS in f[55] because in NM_003239:exon6:c.927-1G>C in f[8] the 1 after the minus is extracted and since that value is less than 10 and f[13] is less than 0.01 f[55] is VUS
line 3 is VUS in f[55] because in NM_001879:exon12:c.1442-5C>G in f[8] the 1 after the minus is extracted and since that value is less than 10 and f[13] is less than 0.01 f[55] is VUS

perl

# disables certain Perl expressions that could behave unexpectedly or are difficult to debug, turning them into errors
use strict;
use warnings;

# Read and display the first line of the file passed at command line.
print scalar <>;

# Read line by line the file given at the command line, could be the stdin if no file is give as argument.
while (<>)
{  # start conditional block

# Make tokens out of the line, using the tab as separator.
    my @f = split /\t/;

    # Select 4 tokens and define @f.
    my ($FuncrefGene,$PopFreqMax,$GeneDetailrefGene,$ClinSig) = @f[6,13,8,46];

    # Rule1: Default classification VUS.
    my $classification = 'VUS';

    # map to 0 if it doesn't have numeric meaning.
    $GeneDetailrefGene = 0 if $GeneDetailrefGene eq '.';
    $PopFreqMax = 0 if $PopFreqMax eq '.';

    # Rule2: Change to Likely Benign if conditions occurs.
    if ($PopFreqMax > 0.011) {
        $classification = 'Likely Benign';
    }

     #Rule3: GeneDetail condition
    if ($FuncrefGene !~ /exonic/i && $GeneDetailrefGene=~/^\D(\d+)$/) {   # capture the digits after any non-digit into $1
                if ($1 > 10) {   # 
                      $1 //= 0;  # Give it a value of zero if no numeric value was found.
                        $classification = 'Likely Benign';  # Reclassify intronic variants (with distance only) based on distance to exon > 10 to Likely Benign
         }
     }
   else {
                 if ($FuncrefGene !~ /exonic/i) {
                    my ($transcript) = ($GeneDetailrefGene) =~ /(?:\.\d+[+*-]|\D)(\d+)/;   # Get a numeric value if exists using (.) and (+/*/-) and capture digits into $transcript.
                             $transcript //= 0;  # Give it a value of zero if no numeric value was found.
                                $classification = 'Likely Benign' if $transcript > 10; # Reclassify intronic variants (following c. nomenclature) to Likely Benign if distance greater than 10
                 }
           }

    # Rule4: ClinVar condition for ACMG
           if ($ClinSig =~ /Pathogenic|Likely pathogenic/i) {
               $classification = 'VUS';
           }

# token 55 is classification.
    $f[55] = $classification;

    # display results and update @f.
    print join "\t", @f;
}   # end conditional block

current output

R_Index    Chr    Start    End    Ref    Alt    Func.refGene    Gene.refGene    GeneDetail.refGene    Inheritence    ExonicFunc.refGene    AAChange.refGene    avsnp147    PopFreqMax    1000G_ALL    1000G_AFR    1000G_AMR    1000G_EAS    1000G_EUR    1000G_SAS    ExAC_ALL    ExAC_AFR    ExAC_AMR    ExAC_EAS    ExAC_FIN    ExAC_NFE    ExAC_OTH    ExAC_SAS    ESP6500siv2_ALL    ESP6500siv2_AA    ESP6500siv2_EA    CG46    SIFT_score    SIFT_pred    Polyphen2_HDIV_score    Polyphen2_HDIV_pred    Polyphen2_HVAR_score    Polyphen2_HVAR_pred    LRT_score    LRT_pred    MutationTaster_score    MutationTaster_pred    MutationAssessor_score    MutationAssessor_pred    dpsi_max_tissue    dpsi_zscore    CLINSIG    CLNDBN    CLNACC    CLNDSDB    CLNDSDBID    Quality    Reads    Zygosity    Score    Classification    HGMD    Sanger
47    chr16    10032451    10032451    T    C    splicing    GRIN2A    NM_001134408:exon3:c.415-43A>G;NM_001134407:exon3:c.415-43A>G;NM_000833:exon4:c.415-43A>G    .    .    .    rs760077182    0.0004    .    .    .    .    .    .    0.0001    .    .    .    .    .    .    0.0004    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    0.7699    1.365    .    .    .    .    .    GOOD    93    het    9    Likely Benign    n    .
77    chr14    76427420    76427420    C    G    splicing    TGFB3    NM_003239:exon6:c.927-1G>C    .    .    .    rs767548724    0.0001    .    .    .    .    .    .    .    .    0.0001    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    1    D    .    .    -3.9123    -2.334    .    .    .    .    .    GOOD    190    het    13    Likely Benign    .    .
160    chr3    186944313    186944313    G    C    splicing    MASP1    NM_001879:exon12:c.1442-5C>G    .    .    .    rs138989954    0.0055    0.0018    .    .    .    0.005    0.0041    0.0036    0.0008    0.0032    .    0.0007    0.0043    0.0037    0.0055    0.0025    0.0005    0.0036    .    .    .    .    .    .    .    .    .    .    .    .    .    -0.8646    -1.388    .    .    .    .    .    GOOD    265    het    13    Likely Benign    .    .

durden_tyler · July 21, 2017, 10:02am

The problem lies in the regular expression you are using with "$GeneDetailrefGene".

See below:

cmccabe:

...
...

 ...
 ...
my ($transcript) = ($GeneDetailrefGene) =~ /(?:\.\d+[+*-]|\D)(\d+)/;   # Get a numeric value if exists using (.) and (+/*/-) and capture digits into $transcript.
...
 ...

...
...

What you are essentially saying is that in the string $GeneDetailrefGene, look for one of the following:

a) a dot character (".") followed by 1 or more digits [0-9] followed by any one of the characters "+", "*" or "-"

\.\d+[+*-]

== OR ==

b) a non-digit character

\D

whichever comes first.

If either a) or b) is followed by 1 or more digits [0-9], then capture those digits and set them to $transcript.

The "whichever comes first" part is crucial here.
If you have a regex like this: "A|B", Perl will match either A or B, whichever comes first - reading the string from left to right.

So, when your program sees this value for the $GeneDetailrefGene:

NM_001134408:exon3:c.415-43A>G;NM_001134407:exon3:c.415-43A>G;NM_000833:exon4:c.415-43A>G

it matches the "_001134408" to "\D(\d+)" and therefore sets $transcript to "001134408".

For a clearer explanation, the red part of the regex below matches the red part of the string $GeneDetailrefGene. Ditto with the blue part, which is eventually assigned to $transcript.

Line 1:

NM_001134408:exon3:c.415-43A>G;NM_001134407:exon3:c.415-43A>G;NM_000833:exon4:c.415-43A>G
/(?:\.\d+[+*-]|\D)(\d+)/

Line 2:

NM_003239:exon6:c.927-1G>C
/(?:\.\d+[+*-]|\D)(\d+)/

Line 3:

NM_001879:exon12:c.1442-5C>G
/(?:\.\d+[+*-]|\D)(\d+)/

cmccabe · July 21, 2017, 10:53am

if ($FuncrefGene !~ /exonic/i) {
                    my ($transcript) = ($GeneDetailrefGene) =~ /(?:\.\d+[+*-]|\D)(\d+)/;   # Get a numeric value if exists using (.) and (+/*/-) and capture digits into $transcript.
                             $transcript //= 0;  # Give it a value of zero if no numeric value was found.
                                $classification = 'Likely Benign' if $transcript > 10; # Reclassify intronic variants (following c. nomenclature) to Likely Benign if distance greater than 10

changed to:

if ($FuncrefGene !~ /exonic/i) {
                    my ($transcript) = ($GeneDetailrefGene) =~ /(?:[+*-]d=)(\d+)/;   # Get a numeric value if exists using (.) and (+/*/-) and capture digits into $transcript.
                             $transcript //= 0;  # Give it a value of zero if no numeric value was found.
                                $classification = 'Likely Benign' if $transcript > 10; # Reclassify intronic variants (following c. nomenclature) to Likely Benign if distance greater than 10

should capture the 43 in NM_001134408:exon3:c.415-43A>G and that wiill be the value of $transcript ? I am not sure how to also use f[13} in this rule. In the cases that have multiple f[8] values, like in line 1, the first can be used.

In line 1 f[8] is

NM_001134408:exon3:c.415-43A>G;NM_001134407:exon3:c.415-43A>G;NM_000833:exon4:c.415-43A>G

aand the ; (semi-colon) indicates the start of a new value. NM_001134408:exon3:c.415-43A>G would be the first value, so 43 is read into the $transcript variable and since f[13] is 0.0004 , f[55] is VUS . Thank you :).

durden_tyler · July 21, 2017, 11:56am

cmccabe:

...
...

if ($FuncrefGene !~ /exonic/i) {
   my ($transcript) = ($GeneDetailrefGene) =~ /(?:[+*-]d=)(\d+)/;   # Get a numeric value if exists using (.) and (+/*/-) and capture digits into $transcript.
   $transcript //= 0;  # Give it a value of zero if no numeric value was found.
   $classification = 'Likely Benign' if $transcript > 10; # Reclassify intronic variants (following c. nomenclature) to Likely Benign if distance greater than 10

should capture the 43 in NM_001134408:exon3:c.415-43A>G and that wiill be the value of $transcript ?
...
...

No, it will not match because the pattern "[+-]d=" is not present in $GeneDetailrefGene. The characters "d" and "=" do not follow any one of ("+", "", "-").
Note that in a regex, "d" matches the character "d", but "\d" matches a single digit in the range [0-9].

Also, the stream of 1 or more digits is to be matched before [+*-].
So, you may want to use this regex:

/(?:\.\d+[+*-])(\d+)/

 ...
 ...
 I am not sure how to also use f[13} in this rule.
 ...
 ...

Just use it the way you laid down the rules in your first post.
Here's the relevant excerpt from your first post:

So, since the "value" you talk about is $transcript and f[13] is $PopFreqMax, your logic would be something like:

 if ($transcript > 10 and $PopFreqMax > 0.01) {
     $classification = "Likely Benign";
 } elsif ($transcript < 10 and $PopFreqMax < 0.01) {
     $classification = "VUS";
 }

Which you have already taken care of in line # 23 of your Perl code in your post # 1, so no worries.

I don't know why you run "Rule 2" (regarding f[13] or PopFreqMax) on its own in line # 25 through 28 of your Perl code in post # 1.

Since you have to use it in conjunction with $transcript as per your logic above, use it after you have determined the value of $transcript.

cmccabe:

...
...
In the cases that have multiple f[8] values, like in line 1, the first can be used.

In line 1 f[8] is
NM_001134408:exon3:c.415-43A>G;NM_001134407:exon3:c.415-43A>G;NM_000833:exon4:c.415-43A>G
aand the ; (semi-colon) indicates the start of a new value. NM_001134408:exon3:c.415-43A>G would be the first value, so 43 is read into the $transcript variable and since f[13] is 0.0004 , f[55] is VUS . ...

And the first one will be used, as the regex reads from the left of the string and tries to match as early as possible.

See below:

$
$ echo "NM_001134408:exon3:c.415-43A>G;NM_001134407:exon3:c.415-44A>G;NM_000833:exon4:c.415-45A>G" | perl -lne '/(?:\.\d+[+*-])(\d+)/ and print $1'
43
$
$

However, if your first "value" within $GeneDetailrefGene (where "values" are delimited by ";") does not match the pattern, then that pattern will be attempted in the second "value" within $GeneDetailrefGene.

In the example below, I have replaced the first "-" character by "#", so the pattern will not match anything in the first "value".

$
$ echo "NM_001134408:exon3:c.415#43A>G;NM_001134407:exon3:c.415-44A>G;NM_000833:exon4:c.415-45A>G" | perl -lne '/(?:\.\d+[+*-])(\d+)/ and print $1'
44
$
$

Perl keeps looking forward and extracts the first substring it encounters that matches the pattern.
This substring happens to be in the second "value" inside $GeneDetailrefGene.

cmccabe · July 26, 2017, 7:30am

Thank you very much :).

cmccabe · September 11, 2017, 9:48am

wrong thread, please ignore. I apologize.

I think the below will capture lines 2-6 , but not line 1 (looks like 018328) is being captured by the regex . Is the syntax correct or is there a better way? Thank you :).

perl

    if ($FuncrefGene !~ /exonic/i && $GeneDetailrefGene=~/\D(\d+)/) {   # capture the digits into $1
              if ($1 > 11) {   # 
                      $1 ||= 0;  # Give it a value of zero if no numeric value was found.
                        $classification = 'Likely Benign';  # Reclassify intronic variants (with distance only) based on distance to exon > 10 to Likely Benign
         }
     }
    else {
              if ($FuncrefGene !~ /exonic/i && $GeneDetailrefGene=~/(\D\d+)/) {   # capture the digits after any non-digit into $1
                 if ($1 > 11) {   # 
                      $1 ||= 0;  # Give it a value of zero if no numeric value was found.
                        $classification = 'Likely Benign';  # Reclassify intronic variants (with distance only) based on distance to exon > 10 to Likely Benign
         }
     }
    else {
              if ($FuncrefGene !~ /exonic/i) {
                 my ($transcript) = ($GeneDetailrefGene) =~ /(?:\.\d+[+*-])/;   # Get a numeric value if exists using (.) and (+/-) and capture digits into $transcript.
                           $transcript ||= 0;  # Give it a value of zero if no numeric value was found.
                             $classification = 'Likely Benign' if $transcript > 11; # Reclassify intronic variants (following c. nomenclature) to Likely Benign if distance greater than 10
         }
     }