Perl to run different parser based on digit

cmccabe · March 7, 2017, 1:13pm

The perl parser below works as expected assuming the last digit in the NC_ before the . is a single digit.

perl -ne 'next if $. == 1;
	if(/.*del([A-Z]+)ins([A-Z]+).*NC_0{4}([0-9]+).*g\.([0-9]+)_([0-9]+)/)   # indel
{
        print join("\t", $3, $4, $5, $1, $2), "\n";
}
           ' out_position.txt > out1.txt


out_position.txt > out2.txt

out_position.txt

Input Variant	Errors	Chromosomal Variant	Coding Variant(s)
NM_003924.3:c.*18_*19delGCinsAA		NC_000004.11:g.41747805_41747806delinsTT	LRG_513t1:c.*18_*19delinsAA	NM_003924.3:c.*18_*19delinsA

contents of out1.txt --- output is correct

4	41747805	41747806	GC	AA

However, I can not seem to adjust it to account for the last digit in NC_ before the . in bold, may not always be 1 digit as in the case above, it could be 2 digits, as n the case below. In this case I would need to parse out 4 zeros, instead of 5. So my question is I am not sure how to make the condition in italics in the perl command adjust based on the NC_ being 1 or 2 digits? Thank you :).

Input Variant	Errors	Chromosomal Variant	Coding Variant(s)
NM_003924.3:c.*18_*19delGCinsAA		NC_000014.11:g.41747805_41747806delinsTT	LRG_513t1:c.*18_*19delinsAA	NM_003924.3:c.*18_*19delinsA

So in this case the desired output would be:

14     41747805     41747806     GC     AA

It is also possible for the NC_ to be a letter, not a digit, but in that case it is always one letter, NC_00000X.11:g.41747805_41747806delinsTT

.*NC_0{5}([0-9]+).

to this:

.*NC_0{5}([0-9]+[A-Z]+).

durden_tyler · March 7, 2017, 7:47pm

So the string is one of the following:

(1) NC_ + five zeros + 1 digit + "." character => you want that one digit before before "." character
(2) NC_ + four zeros + 2 digits + "." character => you want those two digits before "." character
(3) NC_ + five zeros + 1 character + "." character => you want that one character before "." character

One way to look at it is:
NC_ + a sequence of more than one zeros + sequence of characters that are not zero + "." character

And you want to capture that sequence of non-zero characters before the "." character.

Here's a sample regex that does that:

$ 
$ cat input.txt
NC_000004.11
NC_000014.11
NC_00000X.11
$ 
$ perl -lne 's/NC_0+(.*?)\..*/$1/; print' input.txt
4
14
X
$ 
$

cmccabe · March 7, 2017, 8:13pm

I thought I understood, but not entirely :), but you are correct those are the 3 conditions that are possible. Thank you very much :).

perl -ne 'next if $. == 1;
    if(/.*del([A-Z]+)ins([A-Z]+).'s/NC_0+(.*?)\..*/$1/; print')   # indel
{
        print join("\t", $3, $4, $5, $1, $2), "\n";
}
           ' out_position.txt > out.txt
Unknown regexp modifier "/N" at -e line 2, at end of line
Unknown regexp modifier "/C" at -e line 2, at end of line
Unknown regexp modifier "/_" at -e line 2, at end of line
Unknown regexp modifier "/0" at -e line 2, at end of line
syntax error at -e line 2, near "(."
Execution of -e aborted due to compilation errors.

Chubler_XL · March 7, 2017, 8:28pm

Try this:

perl -ne 'next if $. == 1;
    if(/.*del([A-Z]+)ins([A-Z]+).*NC_0+([^.]+)\..*g\.([0-9]+)_([0-9]+)/)   # indel
    {
            print join("\t", $3, $4, $5, $1, $2), "\n";
            }
                       ' out_position.txt > out.txt

cmccabe · March 7, 2017, 8:52pm

Thank you both very much :).

Chubler_XL

.*NC_0+([^.]+)\.

Does this look for the NC_ and extract all digits/strings up to the . that are not zero? Thank you :).

Chubler_XL · March 7, 2017, 9:42pm

.*NC_0+([^.]+)\.

Look for NC_ followed by 1-or-more zeros
then extract 1-or-more non . characters, when they are followed by a . character.

Aia · March 7, 2017, 10:14pm

perl -nle 'BEGIN{$,="\t"}/del([A-Z]{2})ins([A-Z]{2})\s+NC_0+(\w+)\.\d+:\w\.(\d+)_(\d+)/ and print $3,$4,$5,$1,$2' out_position.txt > out2.txt

cat out2.txt
14      41747805        41747806        GC      AA

durden_tyler · March 8, 2017, 11:27am

cmccabe:

I thought I understood, but not entirely ..., but you are correct those are the 3 conditions that are possible. ...

perl -ne 'next if $. == 1;
   if(/.*del([A-Z]+)ins([A-Z]+).'s/NC_0+(.*?)\..*/$1/; print')   # indel
{
   print join("\t", $3, $4, $5, $1, $2), "\n";
}
   ' out_position.txt > out.txt
Unknown regexp modifier "/N" at -e line 2, at end of line
Unknown regexp modifier "/C" at -e line 2, at end of line
Unknown regexp modifier "/_" at -e line 2, at end of line
Unknown regexp modifier "/0" at -e line 2, at end of line
syntax error at -e line 2, near "(."
Execution of -e aborted due to compilation errors.

My example was for illustrative purpose, so you can adapt the regex to suit your code. You simply copy-pasted it in your code. It won't work that way.

1) Why put single-quotes within single-quotes? Your Perl code starts after "-ne" and goes up to "out_position.txt". Your Perl code is within single quotes. If you put something in single quotes inside it, how will the perl interpreter understand it?

2) Why use the s/// operator inside the "if" branch? What is Perl supposed to do if you use s/// operator inside the "if" branch? Check the documentation: s - perldoc.perl.org

3) Why use the "print" function inside the "if" branch? What is Perl supposed to do if you do that?

4) Where is the closing forward-slash ("/") character in the "if" branch? If you do not demarcate the pattern you want to search, how will Perl know it? Check the syntax of the "if" branch: if - perldoc.perl.org

cmccabe · March 13, 2017, 8:34am

Thank you all very much :).