Perl to parse

cmccabe · March 24, 2015, 4:34pm

The below code works great to parse out a file if the input is in the attached SNP format ">".

 perl -ne 'next if $.==1; while(/\t*NC_(\d+)\.\S+g\.(\d+)([A-Z])>([A-Z])/g){printf("%d\t%d\t%d\t%s\t%s\n",$1,$2,$2,$3,$4,$5)}' out_position.txt > out_parse.txt

My question is if there is another format in the input, such as "del" can both be parsed at the same time?

SNP parse rules (column 3 after the headr row is skipped):
1. 4 zeros after the NC_  (not always the case) and the digits before the .
2. g. ###   g.### (all digits)
3. letter (C) before the > 
4. letter (T) after the > 
Desired Output from parse:  13     20763438     20763438     C     G
Desired Output from parse: 13     20763642     20763642     C     G

DEL parse rules (column 3 after the headr row is skipped):
1. 4 zeros after the NC_  (not always the case) and the digits before the .
2. g. ###   g.### (all digits)
3. letter after "del" (C) 
4.  hyphen "-" used in this spot   
      
Desired Output from parse:  13     20763438     20763438     C     G

durden_tyler · March 24, 2015, 5:33pm

cmccabe:

...

...
DEL parse rules (column 3 after the headr row is skipped):
1. 4 zeros after the NC_  (not always the case) and the digits before the .
2. g. ###   g.### (all digits)
3. letter after "del" (C) 
4.  hyphen "-" used in this spot   
   
Desired Output from parse:  13     20763438     20763438     C     G

(a)
Rule # 4 - there is no hyphen ("-") either in your attached file or in your output. So what exactly is that rule? Can you post input data on which that rule can be applied?

(b)
Desired output:
The token "20763438" is not seen in the last line of your attached file (the one where the "DEL" parse rule is to be applied apparently), but it is seen in your desired output.

The characters "C" and "G" are not seen together in any token in the last line of your attached file, but are seen together in the desired output.

Can you post the appropriate input data and its corresponding desired output that is obtained after applying all the rules?

cmccabe · March 24, 2015, 7:00pm

In the attached file the lastline:
NM_004004.5:c.35delG NC_000013.10:g.20763686delC NM_004004.5:c.35delG XM_005266354.1:c.35delG XM_005266355.1:c.35delG XM_005266356.1:c.35delG and the third column (the header row is skipped) NC_000013.10:g.20763686delC is the column/field to be parsed into the desired output: 13 20763686 20763686 C -

there is no "-" in the file the only indicator is in $5 there is a "-" to signify a deletion.

The code in the post seems to work for the first and second case (where there is a >), but not for the third (del).

Thank you :).

durden_tyler · March 24, 2015, 11:04pm

$
$ cat out_position.txt
Input Variant   Errors  Chromosomal Variant     Coding Variant(s)
NM_004004.5:c.283G>C            NC_000013.10:g.20763438C>G      NM_004004.5:c.283G>C    XM_005266354.1:c.283G>C XM_005266355.1:c.283G>C XM_005266356.1:c.283G>C
NM_004004.5:c.79G>C             NC_000013.10:g.20763642C>G      NM_004004.5:c.79G>C     XM_005266354.1:c.79G>C  XM_005266355.1:c.79G>C  XM_005266356.1:c.79G>C
NM_004004.5:c.35delG            NC_000013.10:g.20763686delC     NM_004004.5:c.35delG    XM_005266354.1:c.35delG XM_005266355.1:c.35delG XM_005266356.1:c.35delG
$
$
$ # Method 1 : Using a non-capturing grouping in Perl regular expression
$ perl -ne 'next if $. == 1;
            while (/\t*NC_(\d+)\.\S+g\.(\d+)(?:del)*([A-Z])>*([A-Z]*)/g) {
                printf ("%d\t%d\t%d\t%s\t%s\n", $1, $2, $2, $3, $4 || "-");
            }
           ' out_position.txt
13      20763438        20763438        C       G
13      20763642        20763642        C       G
13      20763686        20763686        C       -
$
$
$ # Method 2 : Using more elaborate but plain-vanilla regular expressions
$ perl -ne 'next if $. == 1;
            while (/\t*NC_(\d+)\.\S+g\.(\d+)(\S+[A-Z])/g) {
                ($num1, $num2, $common) = ($1, $2, $3);
                if ($common =~ /([A-Z])>([A-Z])/) {
                    ($ch1, $ch2) = ($1, $2)
                } elsif ($common =~ /del([A-Z])/) {
                    ($ch1, $ch2) = ($1, "-")
                }
                printf ("%d\t%d\t%d\t%s\t%s\n", $num1, $num2, $num2, $ch1, $ch2);
                map {undef} ($num1, $num2, $common, $ch1, $ch2);
            }
           ' out_position.txt
13      20763438        20763438        C       G
13      20763642        20763642        C       G
13      20763686        20763686        C       -
$
$

cmccabe · March 25, 2015, 10:58am

Thank you :).... it works perfectly. I went with method 2- I am reading into capturing groupings and regular expressions and they seen most useful in replacement operations of a named value. Is this correct? Thanks again :).

cmccabe · March 25, 2015, 11:59am

I have added a block of code to the script (in bold) to parse the attached file, but am getting a syntax error. Thank you :).

The field to parse is NC_000013.10:g.20763145_20763146delTG

Parse rules:
1. 4 zeros after the NC_  (not always the case) and the digits before the .
2. g. ### (before underscore)  _### (# after the _)
3. TG (all letters after del)
4. -  (hyphen used in this spot)    
Desired Output: 13     20763145     20763146     TG     -

 perl -ne 'next if $. == 1;
            while (/\t*NC_(\d+)\.\S+g\.(\d+)(\S+[A-Z])/g) {     # 3 condtion parse  
                ($num1, $num2, $common) = ($1, $2, $3);
                if ($common =~ /([A-Z])>([A-Z])/) {             # SNP
                    ($ch1, $ch2) = ($1, $2)
                } elsif ($common =~ /del([A-Z])/) {             # deletion
                    ($ch1, $ch2) = ($1, "-")
				} elsif ($common =~ /ins([A-Z])/) {             # insertion
                    ($ch1, $ch2) = ("-", $1)
			while (/\t*NC_(\d+)\.\S+g\.\S+g\.(\d+)(\S+[A-Z])/g) {      # 2 condtion parse  
                ($num1, $num2, $common) = ($1, $2, $3);
                if ($common =~ /del([A-Z])/) {                        # multi deletion
                    ($ch1, $ch2) = ($1, "-")             
                } elsif ($common =~ /ins([A-Z])/) {                   # multi insertion
                    ($ch1, $ch2) = ("-", $1)
                }
                printf ("%d\t%d\t%d\t%s\t%s\n", $num1, $num2, $num2, $ch1, $ch2);    # output
                map {undef} ($num1, $num2, $common, $ch1, $ch2);
            }
           ' out_position.txt > out_parse.txt

 
syntax error at -e line 10, near ") {"
Missing right curly or square bracket at -e line 20, at end of line
syntax error at -e line 20, at EOF
Execution of -e aborted due to compilation errors.

cmccabe · March 25, 2015, 12:31pm

I figured out the syntax error but the parse is not working correctly as it is following the same set of rules for deletion, not the new set in post 6.

I attached the file to be parsed as well. Thank you :).

Result as  of now:
13	20763145	20763145	T	-


Should be:
13     20763145     20763146     TG     -

durden_tyler · March 25, 2015, 2:18pm

$
$ cat out_position.txt
Input Variant   Errors  Chromosomal Variant     Coding Variant(s)
NM_004004.5:c.575_576delCA              NC_000013.10:g.20763145_20763146delTG   NM_004004.5:c.575_576delCA      XM_005266354.1:c.575_576delCA   XM_005266355.1:c.575_576delCA   XM_005266356.1:c.57(Scheduler): Empty Line
$
$
$ perl -ne 'next if $. == 1;
            while (/\t*NC_(\d+)\.\S+g\.(\d+)_(\d+)del([A-Z]+)/g) {
                printf ("%d\t%d\t%d\t%s\t-\n", $1, $2, $3, $4);
            }
           ' out_position.txt
13      20763145        20763146        TG      -
$
$

cmccabe · March 25, 2015, 2:31pm

The code below returns only 1 line, eventhough there are 3 lines to parse 13 20763145 20763146 TG - , what did I do wrong? The input file is attached. Thank you :).

Trying to have the input file parsed (for any of those conditions) no matter the input, it was working great but I did something. Thanks.

 
parse() {
    printf "\n\n"
	cd 'C:\Users\cmccabe\Desktop\annovar'
    perl -ne 'next if $. == 1;
            while (/\t*NC_(\d+)\.\S+g\.(\d+)_(\d+)del([A-Z]+)/g) {     # condtional parse  
                ($num1, $num2, $common) = ($1, $2, $3);
                if ($common =~ /([A-Z])>([A-Z])/) {                    # SNP
                    ($ch1, $ch2) = ($1, $2)
                } elsif ($common =~ /del([A-Z])/) {                    # deletion
                    ($ch1, $ch2) = ($1, "-")
		} elsif ($common =~ /ins([A-Z])/) {                    # insertion
                    ($ch1, $ch2) = ("-", $1)
                } elsif ($common =~ /del([A-Z])/) {                    # multi deletion
                    ($ch1, $ch2) = ($1, "-")             
                } elsif ($common =~ /ins([A-Z])/) {                    # multi insertion
                    ($ch1, $ch2) = ("-", $1)
                }
                printf ("%d\t%d\t%d\t%s\t-\n", $1, $2, $3, $4);        # output
                map {undef} ($num1, $num2, $common, $ch1, $ch2);
				}
	           ' out_position.txt > out_parse.txt
	             annovar
}

durden_tyler · March 25, 2015, 9:16pm

cmccabe:

...what did I do wrong? ...
Trying to have the input file parsed (for any of those conditions) no matter the input, it was working great but I did something. ...

 
...
   perl -ne 'next if $. == 1;
   while (/\t*NC_(\d+)\.\S+g\.(\d+)_(\d+)del([A-Z]+)/g) {     # condtional parse  
   ($num1, $num2, $common) = ($1, $2, $3);
   if ($common =~ /([A-Z])>([A-Z])/) {                    # SNP
   ($ch1, $ch2) = ($1, $2)
   } elsif ($common =~ /del([A-Z])/) {                    # deletion
   ($ch1, $ch2) = ($1, "-")
   } elsif ($common =~ /ins([A-Z])/) {                    # insertion
   ($ch1, $ch2) = ("-", $1)
   } elsif ($common =~ /del([A-Z])/) {                    # multi deletion
   ($ch1, $ch2) = ($1, "-")             
   } elsif ($common =~ /ins([A-Z])/) {                    # multi insertion
   ($ch1, $ch2) = ("-", $1)
   }
   printf ("%d\t%d\t%d\t%s\t-\n", $1, $2, $3, $4);        # output
   map {undef} ($num1, $num2, $common, $ch1, $ch2);
   }
   ' out_position.txt > out_parse.txt
...
}

The first step towards fixing something is understanding how that thing works. The deeper your understanding, the easier it is for you to fix it.
And understanding comes with practice - lots of practice.
I've highlighted a few problematic parts of your code in red color. But before that, you have to understand what that Perl one-liner does.

It reads each line of your file, strips off the EOL (end-of-line) character and runs the code within the single-quote. The same code is run against each line.

The "next if ..." statement skips the first line of your file.

Then there is this loop:
"while (/<blah>/g) { <do_something> }"
It matches the regular expression <blah> against the line and, for each part of the line that matches that regular expression (regex), it runs the part within the parenthesis i.e. <do_something>.
And it does this thing repeatedly (due to the "g"/global at the end) as long as there is something to read in the line.

In effect, the "while(/<blah>/g)" tokenizes the line i.e. it splits the line into tokens. We could have used the "split(/<blah>/)" function as well over there and it would've worked.

The regex <blah> is the most important part of the code. It has to be constructed in such a way so that you're able to pick up the most generic token in each line.

So if you have the following 4 tokens in your file:

NC_000013.10:g.20763642C>T
NC_000013.10:g.20763686delC
NC_000013.10:g.20763686insG
NC_000013.10:g.20763145_20763146delTG
NC_000013.10:g.20763145_20763146delAC

and you want a common regex to match as many common parts in each token,
then you'd want to match them according to the color code below:

NC_000013.10:g.20763642C>T
NC_000013.10:g.20763686delC
NC_000013.10:g.20763686insG
NC_000013.10:g.20763145_20763146delTG
NC_000013.10:g.20763145_20763146delAC

The part in red is all numbers, so that's \d+
The part in orange is all numbers again, so that \d+
The part in blue is some non-whitespace text, so we can use \S+
The part in black is common in all the tokens

With this knowledge, we could construct the regex as follows:

NC_(\d+)\.\S+g\.(\d+)(\S+)

I've added the color codes so you can understand what part of the regex matches what part of the token.

A token may be preceded by 0 or more tabs, so we need \t* at the beginning.
Note that the first token at the beginning of the line has 0 tabs before it. Every other token has 1 or more tabs in front of it. So now the regex becomes:

\t*NC_(\d+)\.\S+g\.(\d+)(\S+)

and that is what we should use in our "while" loop:

while (/\t*NC_(\d+)\.\S+g\.(\d+)(\S+)/g) {

The stuff between the first parentheses goes into $1 and we assign it to $num1.
The stuff between the second parentheses goes into $2 and we assign it to $num2.
The stuff between the third parentheses goes into $3 and we assign it to $common.

while (/\t*NC_(\d+)\.\S+g\.(\d+)(\S+)/g) {
    ($num1, $num2, $common) = ($1, $2, $3);
    ...
}

Now have a look at your code and especially the part in red:

while (/\t*NC_(\d+)\.\S+g\.(\d+)_(\d+)del([A-Z]+)/g) {

The regex won't match this token:

NC_000013.10:g.20763642C>T

because the token does not have the "del" text in it. It does not have two numbers separated by underscore.

The regex won't match this token either:

NC_000013.10:g.20763686delC

since there are no two numbers separated by underscore.

The regex will match only this token in line 4 of your input file:

NC_000013.10:g.20763145_20763146delTG

I'll color code the parts of the regex and the parts they match in the token so it's clear:

while (/\t*NC_(\d+)\.\S+g\.(\d+)_(\d+)del([A-Z]+)/g) {

NC_000013.10:g.20763145_20763146delTG

So that was the issue.
Once your main regex is wrong, most of your regexes inside the "while" loop become redundant.
For example, the one with the ">" will never be true:

if ($common =~ /([A-Z])>([A-Z])/) {

because $common will never have the ">" character. It has "del" instead.
And so on....

The second issue was in your "printf" statement.
Since we have assigned variables inside the while loop that contain our information, we should be printing the variables. Not $1, $2, $3, ... etc.
That is, we should print $num1, $num2, $ch1, ... etc.

Back to the correct code.
Once you understand that $common can contain the following different cases in blue color below:

NC_000013.10:g.20763642C>T
NC_000013.10:g.20763686delC
NC_000013.10:g.20763686insG
NC_000013.10:g.20763145_20763146delTG
NC_000013.10:g.20763145_20763146delAC

you can then work with each of them individually to obtain the information you want.

Another point is about the $num2. You print it twice for the first three cases above. But in cases 4 and 5 above, you need the number after the underscore ("_") and before "del".

What I've done is, I've defined a new variable called $num3.

By default, $num3 equals $num2. And it is set as soon as we know the value of $num2.
In the cases 4 and 5, we extract the value of $num3 and overwrite the default value.

We can then print $num1, $num2, $num3, $ch1, $ch2.

All the ideas above are incorporated in the code below:

$
$ cat out_position.txt
Input Variant   Errors  Chromosomal Variant     Coding Variant(s)
NM_004004.5:c.79G>A             NC_000013.10:g.20763642C>T      NM_004004.5:c.79G>A     XM_005266354.1:c.79G>A  XM_005266355.1:c.79G>A  XM_005266356.1:c.79G>A
NM_004004.5:c.35delG            NC_000013.10:g.20763686delC     NM_004004.5:c.35delG    XM_005266354.1:c.35delG XM_005266355.1:c.35delG XM_005266356.1:c.35delG
NM_004004.5:c.575_576delCA              NC_000013.10:g.20763145_20763146delTG   NM_004004.5:c.575_576delCA      XM_005266354.1:c.575_576delCA   XM_005266355.1:c.575_576delCA   XM_005266356.1:c.575_576delCA
$
$
$ perl -ne 'next if $. == 1;
            while (/\t*NC_(\d+)\.\S+g\.(\d+)(\S+)/g) {                                            # conditional parse
                ($num1, $num2, $common) = ($1, $2, $3);
                $num3 = $num2;
                if    ($common =~ /^([A-Z])>([A-Z])$/)   { ($ch1, $ch2) = ($1, $2) }              # SNP
                elsif ($common =~ /^del([A-Z])$/)        { ($ch1, $ch2) = ($1, "-") }             # deletion
                elsif ($common =~ /^ins([A-Z])$/)        { ($ch1, $ch2) = ("-", $1) }             # insertion
                elsif ($common =~ /^_(\d+)del([A-Z]+)$/) { ($num3, $ch1, $ch2) = ($1, $2, "-") }  # multi deletion
                elsif ($common =~ /^_(\d+)ins([A-Z]+)$/) { ($num3, $ch1, $ch2) = ($1, "-", $2) }  # multi insertion
                printf ("%d\t%d\t%d\t%s\t%s\n", $num1, $num2, $num3, $ch1, $ch2);                 # output
                map {undef} ($num1, $num2, $num3, $common, $ch1, $ch2);
            }
           ' out_position.txt
13      20763642        20763642        C       T
13      20763686        20763686        C       -
13      20763145        20763146        TG      -
$
$

Make sure you understand it thoroughly. If in doubt, ask.
Cheers.

cmccabe · March 26, 2015, 11:46am

Thank you for the explanations and color coding, that helps a lot. It's a lot too take in, but it definitely makes sense, I really appreciate your help and efforts.