Perl to change value based on set of rules

cmccabe · August 4, 2017, 11:49am

In the perl there is a default rule that sets f[55] to VUS , and then a seris of rules that will change f[55] based on the result that is
obtained from the rule. The code below is a rule that is supposed to be applicable to lines 2-4 because this rule just looks at the digit in f[8] . So in line 2 f[8] is 27
and that value is greater than 10, so f[55] would be Likely Benign . Since the symbol before the digit could be either a > or + or - in the regex I use
\D to look for any non-digit before the number.
The else portion of the rule is supposed to be applicable to lines 1 and 5 as it uses the regex to parse out the digit after the - ot + or * in the string
that begins with NM_ in in f[8] . I am currently only getting the second line's f[55] value to be correct and I am not sure what I am doing incorrect. I have tried
changing the regex but not to the correct one (maybe there is something else I am missing). Thank you :).

file

R_Index Chr Start End Ref Alt Func.refGene Gene.refGene GeneDetail.refGene Inheritence ExonicFunc.refGene AAChange.refGene avsnp147 PopFreqMax 1000G_ALL 1000G_AFR 1000G_AMR 1000G_EAS 1000G_EUR 1000G_SAS ExAC_ALL ExAC_AFR ExAC_AMR ExAC_EAS ExAC_FIN ExAC_NFE ExAC_OTH ExAC_SAS ESP6500siv2_ALL ESP6500siv2_AA ESP6500siv2_EA CG46 SIFT_score SIFT_pred Polyphen2_HDIV_score Polyphen2_HDIV_pred Polyphen2_HVAR_score Polyphen2_HVAR_pred LRT_score LRT_pred MutationTaster_score MutationTaster_pred MutationAssessor_score MutationAssessor_pred dpsi_max_tissue dpsi_zscore CLINSIG CLNDBN CLNACC CLNDSDB CLNDSDBID Quality Reads Zygosity Score Classification HGMD Sanger
28 chr2 149246946 149246946 T C splicing MBD5 NM_018328:exon12:c.3055-9T>C . . . rs370173652 0.0043 0.0008 0.003 . . . . 0.0003 0.003 0.0004 . . 0.0001 . . 0.0015 0.0043 . . . . . . . . . . . . . . -2.3896 -2.011 other|Benign|Uncertain significance "not_specified|Mental_retardation,_autosomal_dominant_1|Intellectual_Disability,_Dominant" RCV000188062.2|RCV000230037.1|RCV000392347.1 MedGen|Gene:MedGen:OMIM:Orphanet|MedGen CN169374|100820633:C1969562:156200:ORPHA228402|CN239282 GOOD 174 het 9 . . .
211 chr15 68522107 68522107 C G upstream CLN6 27 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . GOOD 5 het . . . .
212 chr15 68522115 68522115 A G upstream CLN6 35 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . GOOD 6 het . . . .
43 chr2 166930214 166930214 T A splicing SCN1A >50 . . . rs566839 1 0.99 0.95 0.99 1 1 1 . . . . . . . . . . . 1 . . . . . . . . . . . . 1.4402 1.752 . . . . . GOOD 108 hom 31 . . .
60 chr3 11078886 11078886 C A UTR3 SLC6A1 NM_003042:c.*234C>A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . GOOD 8 het . Likely Benign n .

perl

     if ($FuncrefGene !~ /exonic/i && $GeneDetailrefGene=~/(\D\d+)/) {   # capture the digits after any non-digit into $1
                if ($1 > 11) {   # 
                      $1 //= 0;  # Give it a value of zero if no numeric value was found.
                        $classification = 'Likely Benign';  # Reclassify intronic variants (with distance only) based on distance to exon > 10 to Likely Benign
         }
     }
   else {
                 if ($FuncrefGene !~ /exonic/i) {
                    my ($transcript) = ($GeneDetailrefGene) =~ /(?:\.\d+[+*-])(\d+)/;   # Get a numeric value if exists using (.) and (+/-) and capture digits into $transcript.
                             $transcript //= 0;  # Give it a value of zero if no numeric value was found.
                                $classification = 'Likely Benign' if $transcript > 11; # Reclassify intronic variants (following c. nomenclature) to Likely Benign if distance greater than 10
                 }
           }

desired output in f[55]

VUS
Likely Benign
Likely Benign
Likely Benign
Likely Benign
Likely Benign

current output in f[55]

Likely Benign
Likely Benign
VUS
VUS
VUS
VUS

durden_tyler · August 8, 2017, 7:38pm

That's wrong!
The regex looks for "a non-digit followed by one or more digits" ("\D\d+").
Line numbers 2 and 3 do not have that, so it will not match.
Line number 4 has that, so it will match.

That's wrong again.
Line # 2 will never match "\D\d+" so 27 will never be extracted and hence never compared to anything.

Wrong again.
There are cases where a symbol does not exist in the first place.
For example, lines 2 and 3 do not have the symbol at all.
You did not do anything for those cases hence those lines fail to match your regex.
Line 5 does have the symbol, so it matches your regex.

Yes, but before the control goes to the "else" portion, it will go to the "if" portion.
And the "if" portion will match your lines 1 and 5 because both of them have "a non-digit followed by one or more digit" ("\D\d+") in their f[8] values.

So the "else" portion will not even get a chance to execute for lines 1 and 5.

Here's some diagnostic output for your data file:

$ 
$ cat -n process_files.pl
     1    #!perl
     2    
     3    $file = $ARGV[0];
     4    open(FH, '<', $file) or die "Can't open $file: $!";
     5    while (<FH>) {
     6        next if $. == 1;
     7        chomp;
     8        @x = split /\s+/;
     9        $FuncrefGene = $x[6];
    10        $GeneDetailrefGene = $x[8];
    11        printf("Row no. = [%d]\n", $.);
    12        printf("FuncrefGene = [%s]\n", $FuncrefGene);
    13        printf("GeneDetailrefGene = [%s]\n", $GeneDetailrefGene);
    14        if ($FuncrefGene !~ /exonic/i && $GeneDetailrefGene=~/(\D\d+)/) {
    15            printf("In IF branch:\n");
    16            printf("==> \$1 = [%s]\n", $1);
    17            if ($1 > 11) {
    18                printf("==> Inside \$1 > 11\n");
    19                $1 //= 0;
    20                $classification = 'Likely Benign';
    21            }
    22            printf("==> Last line in IF branch: classification = [%s]\n", $classification);
    23        } else {
    24            printf("In ELSE branch:\n");
    25            if ($FuncrefGene !~ /exonic/i) {
    26                my ($transcript) = ($GeneDetailrefGene) =~ /(?:\.\d+[+*-])(\d+)/;
    27                printf("==> transcript = [%s]\n", $transcript);
    28                $transcript //= 0;
    29                $classification = 'Likely Benign' if $transcript > 11;
    30            }
    31            printf("==> Last line in ELSE branch: classification = [%s]\n", $classification);
    32        }
    33        printf("%s\n\n", "="x60);
    34    }
    35    close(FH) or die "Can't close $file: $!";
    36    
$ 
$ perl process_files.pl data.txt
Row no. = [2]
FuncrefGene = [splicing]
GeneDetailrefGene = [NM_018328:exon12:c.3055-9T>C]
In IF branch:
==> $1 = [_018328]
==> Last line in IF branch: classification = []
============================================================

Row no. = [3]
FuncrefGene = [upstream]
GeneDetailrefGene = [27]
In ELSE branch:
==> transcript = []
==> Last line in ELSE branch: classification = []
============================================================

Row no. = [4]
FuncrefGene = [upstream]
GeneDetailrefGene = [35]
In ELSE branch:
==> transcript = []
==> Last line in ELSE branch: classification = []
============================================================

Row no. = [5]
FuncrefGene = [splicing]
GeneDetailrefGene = [>50]
In IF branch:
==> $1 = [>50]
==> Last line in IF branch: classification = []
============================================================

Row no. = [6]
FuncrefGene = [UTR3]
GeneDetailrefGene = [NM_003042:c.*234C>A]
In IF branch:
==> $1 = [_003042]
==> Last line in IF branch: classification = []
============================================================

$ 
$

cmccabe · August 29, 2017, 9:00am

Thank you for the diagnostics, they help, is it more or less I am trying to capture to many conditions with the regex ? What would you recommend? Thank you :).

durden_tyler · August 30, 2017, 3:30pm

Yes, from your other Perl related posts, I do get the impression that you are trying to use the regexes for too many things. That should be avoided.
However, for this particular piece of code, I think, you may want to deepen your understanding of regexes.

You have two types of data in F[8] column.

Type 1:

27
35
>50

and

Type 2:

NM_018328:exon12:c.3055-9T>C
NM_003042:c.*234C>A

So use regular expressions that work specifically with each type of data.
Your regex "\D\d+" is meant for Type 1, but it will actually match Type 2 as well.
Why?
Because "\D" means "non-digit character" and so it matches the "_" after "NM".
And then that is followed by "\d+" - "one or more digits". That's why the regex doesn't work the way you want.

Here's a demonstration:

$ perl -le '$x = "NM_018328:exon12:c.3055-9T>C"; if ($x =~ /(\D)(\d+)/){printf("It matches!\n\\D  or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
It matches!
\D  or $1 = _
\d+ or $2 = 018328

And for line # 5:

$ perl -le '$x = "NM_003042:c.*234C>A"; if ($x =~ /(\D)(\d+)/){printf("It matches!\n\\D  or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
It matches!
\D  or $1 = _
\d+ or $2 = 003042

As you can see, the regex meant for Type 2 data is working on Type 1 data as well.

So, determine what exactly is there in Type 1 and Type 2 data that differentiates them? Here are a few observations:

(1) Type 1 has "\d+" - "one or more digits"
(2) Type 1 may or may not have a non-digit at the front. This non-digit could be ">", "+" or "-". But nothing else.
(3) If there is a non-digit at the front, there is only one such non-digit. There cannot be more than one. So you need: "zero or one non-digit". For that, you could use "\D{0,1}" or "\D?".

Let's test this on the one-liner above.
First, notice that "\D\d+" will not work on both ">50" and "50".

$
$ perl -le '$x = ">50"; if ($x =~ /(\D)(\d+)/){printf("It matches!\n\\D  or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
It matches!
\D  or $1 = >
\d+ or $2 = 50
$
$ perl -le '$x = "50"; if ($x =~ /(\D)(\d+)/){printf("It matches!\n\\D  or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
Does not match!
$
$

That's because there is nothing before "50" in the second case, but the regex "\D\d+" demands exactly one non-digit at the beginning.
Since there was no non-digit, the match failed.

Now notice how "\D?\d+" works for both cases:

$
$ perl -le '$x = ">50"; if ($x =~ /(\D?)(\d+)/){printf("It matches!\n\\D  or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
It matches!
\D  or $1 = >
\d+ or $2 = 50
$
$
$ perl -le '$x = "50"; if ($x =~ /(\D?)(\d+)/){printf("It matches!\n\\D  or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
It matches!
\D  or $1 =
\d+ or $2 = 50
$
$

Now, we make the regex more robust. We know that the "non-digit" character at the beginning is one of ">", "+" or "-".
So we use the bracket notation: "[>+-]"
This will match exactly one of the characters inside the brackets.
And since there can be 0 or 1 of such characters, we use "?" after the brackets: "[>+-]?"
In other words, we simply replaced "\D" by "[>+-]"
"\D" matches any non-digit character; it could match "#" or "A" or ">" etc.
"[>+-]" matches only one of the characters inside the brackets.

Testing again:

$
$ perl -le '$x = ">50"; if ($x =~ /([>+-]?)(\d+)/){printf("It matches!\n\\D  or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
It matches!
\D  or $1 = >
\d+ or $2 = 50
$
$ perl -le '$x = "50"; if ($x =~ /([>+-]?)(\d+)/){printf("It matches!\n\\D  or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
It matches!
\D  or $1 =
\d+ or $2 = 50
$
$

Finally, we only want the sequence of digits at the end.
So we can remove the parentheses around the non-digits at the beginning.
We can also put the "beginning of string anchor", which is "^" to specify that the non-digits are at the beginning of the string.
The updated regex is "^[1]?(\d+)"

Testing again:

 $
$ perl -le '$x = ">50"; if ($x =~ /^[>+-]?(\d+)/){printf("It matches!\n\\D  or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
It matches!
\D  or $1 = 50
\d+ or $2 =
$
$ perl -le '$x = "50"; if ($x =~ /^[>+-]?(\d+)/){printf("It matches!\n\\D  or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
It matches!
\D  or $1 = 50
\d+ or $2 =
$
$

So that takes care of Type 1 data.

Now for Type 2 data.
Your regex "/(?:\.\d+[+*-])(\d+)/" looks for the following:

(1) A single dot character "." followed by
(2) One or more digits "\d+" followed by
(3) Exactly one of the characters "+", "*", "-" followed by
(4) One or more digits "\d+"

It matches (1), (2), (3) together but does not "group" them into $1 (due to "?:" at the beginning).
It matches (4) and groups the sequence of digits into $1.

Now, if you look at your Line # 5:

NM_003042:c.*234C>A

the data has:
(1) Single dot character "."
(2) But no sequence of digits after the dot!! There is a "*" after the dot "."

Hence your regex fails.
Here's the demonstration:

$
$ # Matches Line # 1
$ perl -le '$x = "NM_018328:exon12:c.3055-9T>C"; if ($x =~ /(\.\d+[+*-])(\d+)/){printf("It matches!\n\\D  or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
It matches!
\D  or $1 = .3055-
\d+ or $2 = 9
$
$ # But does not match Line # 5
$ perl -le '$x = "NM_003042:c.*234C>A"; if ($x =~ /(\.\d+[+*-])(\d+)/){printf("It matches!\n\\D  or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
Does not match!
$
$

So what are the special characteristics of Type 2 data that distinguish it from Type 1 data? And how do we create the regex to match Type 2 data?

Firstly, if all Type 2 data start with "NM_", you could use that in your regex. So we have "NM_"

Now, it has a dot ">" at some point further on. So we get the regex "NM_.\."
Here "." passes through "maximum number of characters till it reaches the right-most dot (.) character". It's a greedy search.

The dot character may or may not have a sequence of digits after it. (Line 1 has, Line 5 does not have.) "\d*" matches "zero or more digits" - "more" means "1 or more", so "zero or 1 or more than 1 digits".
So, we get: "NM_.*\.\d*"

After that, we definitely have one of the following characters "+", "", "-".
So we use "[+-]" for that. The regex now becomes "NM_.*\.\d*[+*-]"

Finally, that is followed by a sequence of digits that we want to capture.
Sequence of digits is "\d+". So the final regex is: "NM_.*\.\d*[+*-](\d+)"

Let's test this on Line 1 and Line 5 data:

$
$ # Line 1
$ perl -le '$x = "NM_018328:exon12:c.3055-9T>C"; if ($x =~ /NM_.*\.\d*[+*-](\d+)/){printf("It matches!\n\\D  or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
It matches!
\D  or $1 = 9
\d+ or $2 =
$
$ # Line 5
$ perl -le '$x = "NM_003042:c.*234C>A"; if ($x =~ /NM_.*\.\d*[+*-](\d+)/){printf("It matches!\n\\D  or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
It matches!
\D  or $1 = 234
\d+ or $2 =
$
$

Because of the "NM_" at the beginning of the regex, we are guaranteed that it will not match Type 1 data.
But let's confirm that that is really the case:

$
$ # Line 2. This is Type 1 data. Regex is for Type 2. Must not match.
$ perl -le '$x = "27"; if ($x =~ /NM_.*\.\d*[+*-](\d+)/){printf("It matches!\n\\D  or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
Does not match!
$
$ # Line 3. This is Type 1 data. Regex is for Type 2. Must not match.
$ perl -le '$x = "35"; if ($x =~ /NM_.*\.\d*[+*-](\d+)/){printf("It matches!\n\\D  or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
Does not match!
$
$ # Line 4. This is Type 1 data. Regex is for Type 2. Must not match.
$ perl -le '$x = ">50"; if ($x =~ /NM_.*\.\d*[+*-](\d+)/){printf("It matches!\n\\D  or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
Does not match!
$
$ # Other Type 1 data. Regex is for Type 2. Must not match.
$ perl -le '$x = "+50"; if ($x =~ /NM_.*\.\d*[+*-](\d+)/){printf("It matches!\n\\D  or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
Does not match!
$
$

Let's also confirm that the regex for Type 1 data does not match Type 2 data!

 $
$
$ perl -le '$x = "NM_018328:exon12:c.3055-9T>C"; if ($x =~ /^[>+-]?(\d+)/){printf("It matches!\n\\D  or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
Does not match!
$
$
$ perl -le '$x = "NM_003042:c.*234C>A"; if ($x =~ /^[>+-]?(\d+)/){printf("It matches!\n\\D  or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
Does not match!
$
$

Hope that helps.
If you are unable to incorporate the regexes in your script, do post the problem here.

>+- ↩︎

cmccabe · September 11, 2017, 9:51am

I think the below will capture lines 2-6 , but not line 1 (looks like 018328) is being captured by the regex . Is the syntax correct or is there a better way? Thank you :).

perl

    if ($FuncrefGene !~ /exonic/i && $GeneDetailrefGene=~/\D(\d+)/) {   # capture the digits into $1
              if ($1 > 11) {   # 
                      $1 ||= 0;  # Give it a value of zero if no numeric value was found.
                        $classification = 'Likely Benign';  # Reclassify intronic variants (with distance only) based on distance to exon > 10 to Likely Benign
         }
     }
    else {
              if ($FuncrefGene !~ /exonic/i && $GeneDetailrefGene=~/(\D\d+)/) {   # capture the digits after any non-digit into $1
                 if ($1 > 11) {   # 
                      $1 ||= 0;  # Give it a value of zero if no numeric value was found.
                        $classification = 'Likely Benign';  # Reclassify intronic variants (with distance only) based on distance to exon > 10 to Likely Benign
         }
     }
    else {
              if ($FuncrefGene !~ /exonic/i) {
                 my ($transcript) = ($GeneDetailrefGene) =~ /(?:\.\d+[+*-])/;   # Get a numeric value if exists using (.) and (+/-) and capture digits into $transcript.
                           $transcript ||= 0;  # Give it a value of zero if no numeric value was found.
                             $classification = 'Likely Benign' if $transcript > 11; # Reclassify intronic variants (following c. nomenclature) to Likely Benign if distance greater than 10
         }
     }