Classify lines in file using perl

cmccabe · April 8, 2017, 7:11pm

The below perl executes and does classify each of the 3 lines in file.txt . Lines 2 and 3 are correct as they fit the criteria for Rule 2.
The problem is that line one should be classified VUS as it does not meet the criteria for Rule 1, so Rule 3 is used.
However, currently Rule 2 is changing the classification to Likely Benign , if I comment that Rule out I get the expected result. I am not sure why that rule is even executed on that line as the first criteria is $FuncIDPrefGene !~ "exonic" --- if field is not exonic, but in line one that field is.
I have included comments in the code, but each rule is designed to follow a specific set of criteria. I have tried changing the order but the result is the same. Thank you

perl

#!/usr/bin/perl
use strict;

while (<>)
{
        $.<2 and print and next;
          my @f=split/\t/;
         #my @f=split/\s+/;
          my ($FuncIDPrefGene,$AAChangeIDPrefGene,$PopFreqMax,$GeneDetailIDPrefGene,$ClinSig,$Score)=@f[6,11,13,8,46,54];
# Check score for exonic set to 5
         $FuncIDPrefGene eq "exonic" && abs($Score) < 5 and &pj(\@f,"Likely Benign") and next; # Rule 1. Set classification to Likely benign based on score less than 5 for exons

# Check score for everything else set to 5 with GeneDetail following c. nomenclature
        $FuncIDPrefGene !~ "exonic" and abs($Score) < 5 and $GeneDetailIDPrefGene=~/\.\d+[\+\-](\d+)/; # this will capture the digits after    +/- into $1
        $1 < 10 and &pj(\@f,"Likely Benign") and next; # Rule 2. Reclassify intronic variants (with c.) less than 10 based on score

# PopFreqMax VUS
         &pj(\@f,"VUS"); # Rule 3.  If none of the above tests succeeded, and the PopFreqMax < 0.011 set the Classification field to the string VUS.
}
sub pj
{
    my $fr=shift;
       $fr->[55]=shift;
       print join("\t",@{$fr}); # add separator ,"\n"
}

desired result in [55] Classification

VUS
Likely Benign
Likely Benign

Aia · April 8, 2017, 10:59pm

Perhaps this might help you:

#!/usr/bin/perl
use strict;
use warnings;

my $header = scalar <>;
while (<>)
{
    my @f = split /\t/;
    my ( $FuncIDPrefGene,
         $AAChangeIDPrefGene,
         $PopFreqMax,
         $GeneDetailIDPrefGene,
         $ClinSig,
         $Score ) = @f[6,11,13,8,46,54];

     print "\$FuncIDPrefGene = $FuncIDPrefGene and you're trying to abs($Score)\n";

}

Using it with the example you posted it outputs:

perl showme.pl file.txt

$FuncIDPrefGene = exonic and you're trying to abs(12)
$FuncIDPrefGene = splicing and you're trying to abs(2)
$FuncIDPrefGene = intronic and you're trying to abs(.)

You have also, precedent issues with the _and_. I suggest you make use of if/else.

cmccabe · April 8, 2017, 11:10pm

I am not sure I follow completely. Is the logic not right. Thank you :).

Aia · April 8, 2017, 11:25pm

abs() is a function for numeric values, a dot is not numeric, turning the pragma warnings, would had shown you that at some point.

If the code is not producing the desired result but it runs, then the logic must not be correct.
This appears to be the flow you are following but it is flawed because of your use of abs() regardless if it has a numeric value or not. It is not possible for me to find out what's the meaning of $f[54] , if it does not contain a numeric value.

#!/usr/bin/perl
use strict;
use warnings;

print scalar <>;
while (<>)
{
    my @f = split /\t/;
    my ( $FuncIDPrefGene,
         $AAChangeIDPrefGene,
         $PopFreqMax,
         $GeneDetailIDPrefGene,
         $ClinSig,
         $Score ) = @f[6,11,13,8,46,54];

    if (abs($Score) < 5) {
        if($FuncIDPrefGene eq 'exonic') {
            pj(\@f,'Likely Benign');
        }
        else {
            my $scored = $GeneDetailIDPrefGene=~/\.\d+[\+\-](\d+)/;
            pj(\@f, 'Likely Benign') if $scored < 10;
        }
    }
    else {
        pj(\@f, 'VUS');
    }
}
sub pj
{
    my $fr = shift;
    $fr->[55] = shift;
    print join "\t", @{$fr};
}

Test:

perl test.pl file.txt 2>/dev/null | perl -naF'\t' -le 'print $F[55]'

Classification
VUS
Likely Benign
Likely Benign

cmccabe · April 9, 2017, 9:17am

If f[54] has a . in it, the value associated with it is zero. In order to prevent column shifting due to null values I use a . in these fields.
So, I think I follow but just to make sure the abs($Score) is only used if f[54] is not a . ? Is that right? Also, could you please comment the code so I may try to learn from more from it, if possible. Thank you very much :).

#!/usr/bin/perl    # call perl
use strict;     # use exactdefined criteria
use warnings;   # display warning messages

print scalar <>;  # skip header line
while (<>)    # start conditional checks
{
    my @f = split /\t/;      # split on tabs
    my ( $FuncIDPrefGene,    # field 1
         $AAChangeIDPrefGene, # field 2
         $PopFreqMax,         # field 3
         $GeneDetailIDPrefGene, # field 4
         $ClinSig,              # field 5
         $Score ) = @f[6,11,13,8,46,54];   # field 6 and define field locations using 0 coordinate

    if (abs($Score) < 5) {      # check field 6 for value and ensure its less than 5
        if($FuncIDPrefGene eq 'exonic') {   # check field 1 and if exonic and conditon above met
            pj(\@f,'Likely Benign');    # set field 55 to Likely Benign
        } # end condition 1 block
        else {
            my $scored = $GeneDetailIDPrefGene=~/\.\d+[\+\-](\d+)/;  # use field 4 and split on the . and +/1 and read value into variable
            pj(\@f, 'Likely Benign') if $scored > 10;    # if variable greater then 10 then field 55 is Likely Benign
        }  # end condition 2 block
    } 
    else {
        pj(\@f, 'VUS');  # if niether condition is meet set field 55 to VUS
    }
}  # end while block
sub pj     # define subroutine
{    # start sub block
    my $fr = shift;  # define variable 
    $fr->[55] = shift;  # use field 55 as variable
    print join "\t", @{$fr};   # print value in field
}   # end sub block

Aia · April 9, 2017, 10:35am

# Rule 1. Set classification to Likely benign based on score less than 5 for exons
What would you like to happen if it is an exon but it is more than 5?
Your logic place these into rule #3 only if PopFreqMax is less than 0.011. Would these be disregarded, otherwise?

# Rule 2. Reclassify intronic variants (with c.) less than 10 based on score
What would you like to happen if it is an intronic but with score more than 10?
Your logic place these into rule #3 only if PopFreqMax is less than 0.011. Do you disregard, otherwise?

# Rule 3. If none of the above tests succeeded, and the PopFreqMax < 0.011
What if the PopFreqMax is more than 0.011? Where would those go?

Can $FuncIDPrefGene be anything else than exonic, splicing, or intronic?

Would $Score ever contain a value with a plus (+12) or minor(-12)?
Would $Score ever contain a value beside a dot (.) that would not have a numeric interpretation?

cmccabe · April 9, 2017, 10:58am

Rule 3 was meant to be a catch all type rule but maybe it is better not to have that. If Rule 1 is exon and more than 5 then the classification is VUS . So is it better to have an else statement in Rule 1 or just remove the PopFreqMax condition from Rule 3?

I think this followss the same logic as Rule 1 in that i need an else to capture the other condition or redo Rule 3.

If PopFreqMax is greater than 0.011 classification is Likely Benign .

Yes, these are just three of the more common, but there are several other. However eventhough there are many possible values they can all be grouped in to exonic , for exons or not exonic , for everything else.

The number in $Score should always be 1 2 15 20 (some positive #). I used abs() just in case the format every changed to include a + or some other symbol.

No, a dot is only used for a null value and is always zero.

Thank you very much :).

Aia · April 9, 2017, 11:29am

cmccabe:

[...]

#!/usr/bin/perl    # call perl

# disables certain Perl expressions that could behave unexpectedly
#  or are difficult to debug, turning them into errors
use strict;     # use exactdefined criteria

use warnings;   # display warning messages

# Read and display the first line of the file passed at command line.
print scalar <>;  # skip header line

# Read line by line the file given at the command line.
# it could be the stdin if no file is give as argument.
while (<>)    # start conditional checks
{
   # Make tokens out of the line, using the tab the separator.
   my @f = split /\t/;      # split on tabs

   # Select 6 tokens from @f for convenience.
   my ( 
   $FuncIDPrefGene,    # field 1
   # Not used; possibly unnecessary.
   $AAChangeIDPrefGene, # field 2
   Not used.
   $PopFreqMax,         # field 3
   $GeneDetailIDPrefGene, # field 4
   # Not used; possibly unnecessary.
   $ClinSig,              # field 5
   $Score ) = @f[6,11,13,8,46,54];   # field 6 and define field locations using 0 coordinate

   if (abs($Score) < 5) {      # check field 6 for value and ensure its less than 5
   if($FuncIDPrefGene eq 'exonic') {   # check field 1 and if exonic and conditon above met
   pj(\@f,'Likely Benign');    # set field 55 to Likely Benign
   } # end condition 1 block
   else {
   my $scored = $GeneDetailIDPrefGene=~/\.\d+[\+\-](\d+)/;  # use field 4 and split on the . and +/1 and read value into variable
   pj(\@f, 'Likely Benign') if $scored > 10;    # if variable greater then 10 then field 55 is Likely Benign
   }  # end condition 2 block
   } 
   else {
   pj(\@f, 'VUS');  # if niether condition is meet set field 55 to VUS
   }
}  # end while block
sub pj     # define subroutine
{    # start sub block
   my $fr = shift;  # define variable 
   $fr->[55] = shift;  # use field 55 as variable
   print join "\t", @{$fr};   # print value in field
}   # end sub block

This was an rearrangement of the code you posted at post #1, so you could view your logic a bit clearer, removing the and and next , operators and commands.

---------- Post updated at 09:29 AM ---------- Previous update was at 09:22 AM ----------

Could these statements encapsulate an accurate logic?

Everything is VUS by default.
Conditions that could change it to Likely Benign
score less than 5
PopFreqMax more than 0.011

If this is not accurate, still, I think you should build upon the idea that it appears that everything is VUS and you are trying to find reasons to change it to Likely Benign.

cmccabe · April 9, 2017, 12:46pm

Yes, these statements are accurate and true. Thank you very much :).

Aia · April 9, 2017, 1:16pm

In that case this might do it.

#!/usr/bin/perl
use strict;
use warnings;

# display header
print scalar <>;
while (<>)
{
    # tokenization of line, splitting by tab.
    my @f = split /\t/;
    # tokens to check on.
    my ($PopFreqMax, $Score) = @f[13,54];

    # Default classification.
    my $classification = 'VUS';

    # map to 0 if it doesn't have numeric meaning.
    $Score = 0 if $Score eq '.';

    # Change to Likely Benign if either of these two
    # conditions occurs.
    if ($Score < 5 || $PopFreqMax > 0.011) {
        $classification = 'Likely Benign';
    }
    # token 55 is classification.
    $f[55] = $classification;
    # display results.
    print join "\t", @f;
}

cmccabe · April 9, 2017, 3:49pm

very close but I am getting a syntax error:

#!/usr/bin/perl
use strict;
use warnings;

print scalar <>;
while (<>)
{
    my @f = split /\t/;
    my ( $FuncIDPrefGene,
         $AAChangeIDPrefGene,
         $PopFreqMax,
         $GeneDetailIDPrefGene,
         $ClinSig,
         $Score ) = @f[6,11,13,8,46,54];
# map to 0 if it doesn't have numeric meaning.
         $Score = 0 if $Score eq '.';

    if (abs($Score) < 5) {
        if($FuncIDPrefGene eq 'exonic') {
            pj(\@f,'Likely Benign');
        }
        else {
            my $scored = $GeneDetailIDPrefGene=~/\.\d+[\+\-](\d+)/;
            pj(\@f, 'Likely Benign') if $scored < 10;
        }
        else {
            my $scored = $GeneDetailIDPrefGene=~/^\D(\d+)$/;
            pj(\@f, 'Likely Benign') if $scored < 10;
        }
    else {
        pj(\@f, 'VUS');
    }
}
sub pj
{
    my $fr = shift;
    $fr->[55] = shift;
    print join "\t", @{$fr};
}
syntax error at /home/cmccabe/classify3.pl line 26, near "else"
Illegal declaration of subroutine main::pj at /home/cmccabe/classify3.pl

Thank you very much :).

Also,
adding a:

# Default classification.
    my $classification = 'VUS';

means:
sub pj looks like:

sub pj
{
    my $fr=shift;
       $fr->[55]=shift;
       print join("\t",@{$fr}); # add seperater ,"\n
  {
# token 55 is classification.
    $f[55] = $classification;
    # display results.
    print join "\t", @f;
   }
}

--- to define $my classification

Aia · April 9, 2017, 6:40pm

cmccabe:

very close but I am getting a syntax error:

#!/usr/bin/perl
use strict;
use warnings;

print scalar <>;
while (<>)
{
   my @f = split /\t/;
   my ( $FuncIDPrefGene,
   $AAChangeIDPrefGene,
   $PopFreqMax,
   $GeneDetailIDPrefGene,
   $ClinSig,
   $Score ) = @f[6,11,13,8,46,54];
# map to 0 if it doesn't have numeric meaning.
   $Score = 0 if $Score eq '.';

   if (abs($Score) < 5) {
   if($FuncIDPrefGene eq 'exonic') {
   pj(\@f,'Likely Benign');
   }
   else {
   my $scored = $GeneDetailIDPrefGene=~/\.\d+[\+\-](\d+)/;
   pj(\@f, 'Likely Benign') if $scored < 10;
   }
#### You cannot have an else by itself ####
   else {
   my $scored = $GeneDetailIDPrefGene=~/^\D(\d+)$/;
   pj(\@f, 'Likely Benign') if $scored < 10;
   }
   else {
   pj(\@f, 'VUS');
   }
}
sub pj
{
   my $fr = shift;
   $fr->[55] = shift;
   print join "\t", @{$fr};
}
syntax error at /home/cmccabe/classify3.pl line 26, near "else"
Illegal declaration of subroutine main::pj at /home/cmccabe/classify3.pl

Thank you very much :).

Also,
adding a:

# Default classification.
   my $classification = 'VUS';

means:
sub pj looks like:

sub pj
{
   my $fr=shift;
   $fr->[55]=shift;
   print join("\t",@{$fr}); # add seperater ,"\n

### Not proper ####
  {
# token 55 is classification.
   $f[55] = $classification;
   # display results.
   print join "\t", @f;
   }
### End of not proper ####
}

--- to define $my classification

Hi, cmccabe

What you are trying to do now is contradictory to what you said it was true in post #9.
The code I posted in #10, where does not meet your expectations?

cmccabe · April 9, 2017, 9:33pm

Hi Aia,

The code works great, I have several additional rules/conditions to use. I did not post them all to keep the post shorter. I was trying to add them following your code as that works much better. Thank you :).

Aia · April 10, 2017, 12:10am

I understand, then. However, going back to the flawed work flow will not guarantee to keep every line.

If my suggestion works and you need to add more conditions, just follow the pattern. Ditch the subroutine pj , you really do not need it.

#!/usr/bin/perl
use strict;
use warnings;

print scalar <>;
while (<>)
{
    my @f = split /\t/;
    # Change this to hold more variables for checking.
    my ($PopFreqMax, $Score) = @f[13,54];

    # Default classification.
    my $classification = 'VUS';

    # map to 0 if it doesn't have numeric meaning.
    $Score = 0 if $Score eq '.';
    # if you must.
    $Score = abs($Score);

    # Change to Likely Benign if either of these two
    # conditions occurs.
    if ($Score < 5 || $PopFreqMax > 0.011) {
        $classification = 'Likely Benign';
    }

    # Create here any other conditions that might change $classification
    if () {
       $classification = '...';
    }
    else {
       $classification =  '...';
    }

   # When you get to this point you are ready to change $f[55] token
   # and to display the result.

    # token 55 is classification.
    $f[55] = $classification;
    # display results.
    print join "\t", @f;
}

cmccabe · April 10, 2017, 8:43am

I am adding the below condition to change the classification of line 3 to Likely Benign . If $Score was 20 and the PopFreqMax being what it is 0.003 it would follow the default rule.
However, because in the GeneDetailIDPrefGene section the digit 50, that is stripped off of the >50 is greater than 10 , so classification is Likely Benign . I know the code that strips of the 50 works, but am I doing something else wrong? Thank you :).

if ($FuncIDPrefGene !~/exonic/i && $Score < 5 && $GeneDetailIDPrefGene=~/^\D(\d+)$/;) {   # capture the digits after any non-digit into $1
        $1 > 10   # Reclassify intronic variants (with distance only) based on score less than 5 to Likely Benign
        $classification = 'Likely Benign';
    }
else {
             my $scored = $FuncIDPrefGene !~/exonic/i && $Score < 5 && $GeneDetailIDPrefGene=~/\.\d+[\+\-](\d+)/;   # capture the digits after . and (+/-) into $1
                if $scored < 5;    # Reclassify intronic variants (with c.) less than 5 based on score
       $classification =  'Likely Benign';
}
syntax error at /home/cmccabe/Desktop/NGS/scripts/classifier.pl line 45, near "}"

Execution of /home/cmccabe/Desktop/NGS/scripts/classifier.pl aborted due to compilation errors.

Aia · April 10, 2017, 10:32am

Let me remove all the extra around what you posted and highlight the syntax issues.

cmccabe:

[...]

# remove ;
if ($FuncIDPrefGene !~/exonic/i && $Score < 5 && $GeneDetailIDPrefGene=~/^\D(\d+)$/;) {  
   if ($1 > 10) {
   $classification = 'Likely Benign';
   }
   }
else {
   my $scored = $FuncIDPrefGene !~/exonic/i && $Score < 5 && $GeneDetailIDPrefGene=~/\.\d+[\+\-](\d+)/;   
   if ($scored < 5) {  
   $classification =  'Likely Benign';
   }
}

cmccabe · April 10, 2017, 12:53pm

Below is the updated code along with attempt to fix the message. The sections in bold were updated accordingly, however the new message seems to give a different message but allows the script to run. I am a little confused as this line seems to be important but the script ignores it/ or skips it? Thank you :).

# Change to Likely Benign if either of these two conditions occurs.
    if ($Score < 5 || $PopFreqMax > 0.011) {
        $classification = 'Likely Benign';
    }
    # GeneDetail condition
    if ($FuncIDPrefGene !~/exonic/i && $Score < 5 && $GeneDetailIDPrefGene=~/^\D(\d+)$/) {
        $1 > 10
        $classification = 'Likely Benign';
    }
    else {
           if ($FuncIDPrefGene !~/exonic/i && $Score < 5 && $GeneDetailIDPrefGene=~/\.\d+[\+\-](\d+)/)
              $1 > 10
              $classification =  'Likely Benign';
    }

# token 55 is classification.
    $f[55] = $classification;

    # display results and update @f.
    print join "\t", @f;
}   # end conditional block
Scalar found where operator expected at /home/cmccabe/Desktop/NGS/scripts/classifier.pl line 34, near "$classification"
	(Missing semicolon on previous line?)
Scalar found where operator expected at /home/cmccabe/Desktop/NGS/scripts/classifier.pl line 38, near ")
              $1"
	(Missing operator before $1?)
Scalar found where operator expected at /home/cmccabe/Desktop/NGS/scripts/classifier.pl line 39, near "$classification"
	(Missing semicolon on previous line?)
syntax error at /home/cmccabe/Desktop/NGS/scripts/classifier.pl line 34, near "$classification "
syntax error at /home/cmccabe/Desktop/NGS/scripts/classifier.pl line 38, near ")
              $1 "
Execution of /home/cmccabe/Desktop/NGS/scripts/classifier.pl aborted due to compilation errors.

adding the ; indicated by the message but the script does execute

# Change to Likely Benign if either of these two conditions occurs.
    if ($Score < 5 || $PopFreqMax > 0.011) {
        $classification = 'Likely Benign';
    }
    # GeneDetail condition
    if ($FuncIDPrefGene !~/exonic/i && $Score < 5 && $GeneDetailIDPrefGene=~/^\D(\d+)$/) {
        $1 > 10;
        $classification = 'Likely Benign';
    }
    else {
           if ($FuncIDPrefGene !~/exonic/i && $Score < 5 && $GeneDetailIDPrefGene=~/\.\d+[\+\-](\d+)/)
              $1 > 10;
              $classification =  'Likely Benign';
    }

# token 55 is classification.
    $f[55] = $classification;

    # display results and update @f.
    print join "\t", @f;
}   # end conditional block
Useless use of numeric gt (>) in void context at /home/cmccabe/Desktop/NGS/scripts/classifier.pl line 33.
Useless use of numeric gt (>) in void context at /home/cmccabe/Desktop/NGS/scripts/classifier.pl line 38.

Aia · April 10, 2017, 3:35pm

Hi cmccabe,
Please, take a look again at post #16. I highlighted for you how it needs to be if you mean it as such.
$1 > 10; It is useless as the message says.
It would be the equivalent of the sky is blue. So what? No flow control, there.
If the code runs it would always be $classification = 'Likely Benign' as soon as the if is met.

cmccabe · April 10, 2017, 7:14pm

I apologize I read the post incorrectly. I am not sure why line 1 in the attached file.txt should be VUS set by the default classification. That is correct. However, when the two conditions below are added the first behaves as expected. The second (after the else) changes the first line to Likely Benign . However, it should not be applied as $FuncIDPrefGene does not equal exonic . Is there something wrong with my logic? Thank you for all your help :).

# GeneDetail condition
    if ($FuncIDPrefGene !~/exonic/i && $Score < 5 && $GeneDetailIDPrefGene=~/^\D(\d+)$/) {  
        if ($1 > 10) {
            $classification = 'Likely Benign';
        }
    }
        else {
             my $transcript = $FuncIDPrefGene !~/exonic/i && $GeneDetailIDPrefGene=~/\.\d+[\+\-](\d+)/;   
             if ($transcript > 10) {  
                 $classification =  'Likely Benign';
            }
    }

desired classification

VUS     ----- default classification
Likely Benign   -----  portion before the else $Score < 5
Likely Benign    ---- portion after the else >50 is used to be Likely Benign

Aia · April 10, 2017, 10:57pm

You decide.

This is not necessary,

    if ($FuncIDPrefGene !~/exonic/i && $Score < 5 && $GeneDetailIDPrefGene=~/^\D(\d+)$/) {  
        if ($1 > 10) {
            $classification = 'Likely Benign';
        }
    }

its mission is to make $classification = 'Likely Benign', however the condition above does that job already since the $Score is less than 5, regardless if it is not exonic nor the $GeneDetailIDPrefGene is more than 10.

    if ($Score < 5 || $PopFreqMax > 0.011) {
        $classification = 'Likely Benign';
    }

        else {
             my $transcript = $FuncIDPrefGene !~/exonic/i &&$GeneDetailIDPrefGene=~/\.\d+[\+\-](\d+)/;   
             if ($transcript > 10) {  
                 $classification =  'Likely Benign';
            }
    }

The highlighted part does not work for >50 which is what the last line has.

Perhaps this might help, instead


    if ($Score < 5 || $PopFreqMax > 0.011) {
        $classification = 'Likely Benign';
    }

    if ($FuncIDPrefGene !~ /exonic/i) {
        # Get a numeric value if exist.
        my ($transcript) = ($GeneDetailIDPrefGene) =~ /(?:\.\d+[+-]|\D)(\d+)/;
        # Give it a value of zero if no numeric value was found.
        $transcript //= 0;
        $classification = 'Likely Benign' if $transcript > 10;
    }