Yes, from your other Perl related posts, I do get the impression that you are trying to use the regexes for too many things. That should be avoided.
However, for this particular piece of code, I think, you may want to deepen your understanding of regexes.
You have two types of data in F[8] column.
Type 1:
27
35
>50
and
Type 2:
NM_018328:exon12:c.3055-9T>C
NM_003042:c.*234C>A
So use regular expressions that work specifically with each type of data.
Your regex "\D\d+" is meant for Type 1, but it will actually match Type 2 as well.
Why?
Because "\D" means "non-digit character" and so it matches the "_" after "NM".
And then that is followed by "\d+" - "one or more digits". That's why the regex doesn't work the way you want.
Here's a demonstration:
$ perl -le '$x = "NM_018328:exon12:c.3055-9T>C"; if ($x =~ /(\D)(\d+)/){printf("It matches!\n\\D or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
It matches!
\D or $1 = _
\d+ or $2 = 018328
And for line # 5:
$ perl -le '$x = "NM_003042:c.*234C>A"; if ($x =~ /(\D)(\d+)/){printf("It matches!\n\\D or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
It matches!
\D or $1 = _
\d+ or $2 = 003042
As you can see, the regex meant for Type 2 data is working on Type 1 data as well.
So, determine what exactly is there in Type 1 and Type 2 data that differentiates them? Here are a few observations:
(1) Type 1 has "\d+" - "one or more digits"
(2) Type 1 may or may not have a non-digit at the front. This non-digit could be ">", "+" or "-". But nothing else.
(3) If there is a non-digit at the front, there is only one such non-digit. There cannot be more than one. So you need: "zero or one non-digit". For that, you could use "\D{0,1}" or "\D?".
Let's test this on the one-liner above.
First, notice that "\D\d+" will not work on both ">50" and "50".
$
$ perl -le '$x = ">50"; if ($x =~ /(\D)(\d+)/){printf("It matches!\n\\D or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
It matches!
\D or $1 = >
\d+ or $2 = 50
$
$ perl -le '$x = "50"; if ($x =~ /(\D)(\d+)/){printf("It matches!\n\\D or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
Does not match!
$
$
That's because there is nothing before "50" in the second case, but the regex "\D\d+" demands exactly one non-digit at the beginning.
Since there was no non-digit, the match failed.
Now notice how "\D?\d+" works for both cases:
$
$ perl -le '$x = ">50"; if ($x =~ /(\D?)(\d+)/){printf("It matches!\n\\D or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
It matches!
\D or $1 = >
\d+ or $2 = 50
$
$
$ perl -le '$x = "50"; if ($x =~ /(\D?)(\d+)/){printf("It matches!\n\\D or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
It matches!
\D or $1 =
\d+ or $2 = 50
$
$
Now, we make the regex more robust. We know that the "non-digit" character at the beginning is one of ">", "+" or "-".
So we use the bracket notation: "[>+-]"
This will match exactly one of the characters inside the brackets.
And since there can be 0 or 1 of such characters, we use "?" after the brackets: "[>+-]?"
In other words, we simply replaced "\D" by "[>+-]"
"\D" matches any non-digit character; it could match "#" or "A" or ">" etc.
"[>+-]" matches only one of the characters inside the brackets.
Testing again:
$
$ perl -le '$x = ">50"; if ($x =~ /([>+-]?)(\d+)/){printf("It matches!\n\\D or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
It matches!
\D or $1 = >
\d+ or $2 = 50
$
$ perl -le '$x = "50"; if ($x =~ /([>+-]?)(\d+)/){printf("It matches!\n\\D or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
It matches!
\D or $1 =
\d+ or $2 = 50
$
$
Finally, we only want the sequence of digits at the end.
So we can remove the parentheses around the non-digits at the beginning.
We can also put the "beginning of string anchor", which is "^" to specify that the non-digits are at the beginning of the string.
The updated regex is "?(\d+)"
Testing again:
$
$ perl -le '$x = ">50"; if ($x =~ /^[>+-]?(\d+)/){printf("It matches!\n\\D or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
It matches!
\D or $1 = 50
\d+ or $2 =
$
$ perl -le '$x = "50"; if ($x =~ /^[>+-]?(\d+)/){printf("It matches!\n\\D or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
It matches!
\D or $1 = 50
\d+ or $2 =
$
$
So that takes care of Type 1 data.
Now for Type 2 data.
Your regex "/(?:\.\d+[+*-])(\d+)/" looks for the following:
(1) A single dot character "." followed by
(2) One or more digits "\d+" followed by
(3) Exactly one of the characters "+", "*", "-" followed by
(4) One or more digits "\d+"
It matches (1), (2), (3) together but does not "group" them into $1 (due to "?:" at the beginning).
It matches (4) and groups the sequence of digits into $1.
Now, if you look at your Line # 5:
NM_003042:c.*234C>A
the data has:
(1) Single dot character "."
(2) But no sequence of digits after the dot!! There is a "*" after the dot "."
Hence your regex fails.
Here's the demonstration:
$
$ # Matches Line # 1
$ perl -le '$x = "NM_018328:exon12:c.3055-9T>C"; if ($x =~ /(\.\d+[+*-])(\d+)/){printf("It matches!\n\\D or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
It matches!
\D or $1 = .3055-
\d+ or $2 = 9
$
$ # But does not match Line # 5
$ perl -le '$x = "NM_003042:c.*234C>A"; if ($x =~ /(\.\d+[+*-])(\d+)/){printf("It matches!\n\\D or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
Does not match!
$
$
So what are the special characteristics of Type 2 data that distinguish it from Type 1 data? And how do we create the regex to match Type 2 data?
Firstly, if all Type 2 data start with "NM_", you could use that in your regex. So we have "NM_"
Now, it has a dot ">" at some point further on. So we get the regex "NM_.\."
Here "." passes through "maximum number of characters till it reaches the right-most dot (.) character". It's a greedy search.
The dot character may or may not have a sequence of digits after it. (Line 1 has, Line 5 does not have.) "\d*" matches "zero or more digits" - "more" means "1 or more", so "zero or 1 or more than 1 digits".
So, we get: "NM_.*\.\d*"
After that, we definitely have one of the following characters "+", "", "-".
So we use "[+-]" for that. The regex now becomes "NM_.*\.\d*[+*-]"
Finally, that is followed by a sequence of digits that we want to capture.
Sequence of digits is "\d+". So the final regex is: "NM_.*\.\d*[+*-](\d+)"
Let's test this on Line 1 and Line 5 data:
$
$ # Line 1
$ perl -le '$x = "NM_018328:exon12:c.3055-9T>C"; if ($x =~ /NM_.*\.\d*[+*-](\d+)/){printf("It matches!\n\\D or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
It matches!
\D or $1 = 9
\d+ or $2 =
$
$ # Line 5
$ perl -le '$x = "NM_003042:c.*234C>A"; if ($x =~ /NM_.*\.\d*[+*-](\d+)/){printf("It matches!\n\\D or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
It matches!
\D or $1 = 234
\d+ or $2 =
$
$
Because of the "NM_" at the beginning of the regex, we are guaranteed that it will not match Type 1 data.
But let's confirm that that is really the case:
$
$ # Line 2. This is Type 1 data. Regex is for Type 2. Must not match.
$ perl -le '$x = "27"; if ($x =~ /NM_.*\.\d*[+*-](\d+)/){printf("It matches!\n\\D or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
Does not match!
$
$ # Line 3. This is Type 1 data. Regex is for Type 2. Must not match.
$ perl -le '$x = "35"; if ($x =~ /NM_.*\.\d*[+*-](\d+)/){printf("It matches!\n\\D or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
Does not match!
$
$ # Line 4. This is Type 1 data. Regex is for Type 2. Must not match.
$ perl -le '$x = ">50"; if ($x =~ /NM_.*\.\d*[+*-](\d+)/){printf("It matches!\n\\D or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
Does not match!
$
$ # Other Type 1 data. Regex is for Type 2. Must not match.
$ perl -le '$x = "+50"; if ($x =~ /NM_.*\.\d*[+*-](\d+)/){printf("It matches!\n\\D or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
Does not match!
$
$
Let's also confirm that the regex for Type 1 data does not match Type 2 data!
$
$
$ perl -le '$x = "NM_018328:exon12:c.3055-9T>C"; if ($x =~ /^[>+-]?(\d+)/){printf("It matches!\n\\D or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
Does not match!
$
$
$ perl -le '$x = "NM_003042:c.*234C>A"; if ($x =~ /^[>+-]?(\d+)/){printf("It matches!\n\\D or \$1 = %s\n\\d+ or \$2 = %s\n",$1,$2)} else {print "Does not match!"}'
Does not match!
$
$
Hope that helps.
If you are unable to incorporate the regexes in your script, do post the problem here.