Perl to parse a variety of formats

cmccabe · April 30, 2017, 9:32am

The below perl script parses a variety of formats. If I use the numeric text file as input the script works correctly. However using the alpha text file as input there is a black output file. The portion in bold splits the field to parse f[2] or NC_000023.10:g.153297761C>A into a variable $common but since the portion after the NC_ can be alpha or numeric I think that is the issue. Currently it is set to only numeric (\d+) , so maybe (\d+ || [Aa-Zz]) would solve this. I am still learning so I wanted to check and make sure it wasn't something else I over-looked. Thank you :).

numeric tab-delimited

Input Variant    Errors    Chromosomal Variant    Coding Variant(s)
NM_004992.3:c.274G>T        NC_000023.10:g.153297761C>A    XM_005274683.1:c.-6G>T    XM_005274682.1:c.-6G>T    XM_005274681.1:c.274G>T    LRG_764t2:c.274G>T    NM_004992.3:c.274G>T    LRG_764t1:c.310G>T    NM_001110792.1:c.310G>T

perl -ne 'next if $. == 1;
    if(/.*del([A-Z]+)ins([A-Z]+).*NC_0+([^.]+)\..*g\.([0-9]+)_([0-9]+)/)   # indel
    {
            print join("\t", $3, $4, $5, $1, $2), "\n";
            }
              else
{
            while (/\t*NC_(\d+)\.\S+g\.(\d+)(\S+)/g) {                                            # conditional parse
                ($num1, $num2, $common) = ($1, $2, $3);
                $num3 = $num2;
                if    ($common =~ /^([A-Z])>([A-Z])$/)   { ($ch1, $ch2) = ($1, $2) }              # SNP
                elsif ($common =~ /^del([A-Z])$/)        { ($ch1, $ch2) = ($1, "-") }             # deletion
                elsif ($common =~ /^ins([A-Z])$/)        { ($ch1, $ch2) = ("-", $1) }             # insertion
                elsif ($common =~ /^_(\d+)del([A-Z]+)$/) { ($num3, $ch1, $ch2) = ($1, $2, "-") }  # multi deletion
                elsif ($common =~ /^_(\d+)ins([A-Z]+)$/) { ($num3, $ch1, $ch2) = ("-", $1, $2) }  # multi insertion
                printf ("%d\t%d\t%d\t%s\t%s\n", $num1, $num2, $num3, $ch1, $ch2);                 # output
                map {undef} ($num1, $num2, $num3, $common, $ch1, $ch2);
            }
}' numeric

23    153297761    153297761    C    A tab-delimeted

alpha tab-delimited

Input Variant    Errors    Chromosomal Variant    Coding Variant(s)
NM_004992.3:c.274G>T        NC_0000X.10:g.153297761C>A    XM_005274683.1:c.-6G>T    XM_005274682.1:c.-6G>T    XM_005274681.1:c.274G>T    LRG_764t2:c.274G>T    NM_004992.3:c.274G>T    LRG_764t1:c.310G>T    NM_001110792.1:c.310G>T
Same script produces a blank output file after it executes.

desired output tab-delimeted

X    153297761    153297761    C    A

cmccabe · May 1, 2017, 10:45am

I have changed:

 while (/\t*NC_(\d+)\.\S+g\.(\d+)(\S+)/g) {

to

 while (/\t*NC_(\w+)\.\S+g\.(\d+)(\S+)/g) {

and

printf ("%d\t%d\t%d\t%s\t%s\n", $num1, $num2, $num3, $ch1, $ch2);

to

printf ("%s\t%d\t%d\t%s\t%s\n", $num1, $num2, $num3, $ch1, $ch2);

and getting

using numeric as input:

000023    153297761    153297761    C    A

using alpha as input:

0000X    153297761    153297761    C    A

there are multiple lines of input in each file but each line is following the same format.

Is the printf not correct or should I not use \w+ ? Thank you :).

Aia · May 2, 2017, 12:14am

While in Perl you could execute a lot of code at the command line, I would recommend that you use one-liners and executable for just testing concepts and quick throw-away.

For example, I want to test that I can split the line by tabs and work with the second field.
I want to extract information from the second field. I might do the following:

cat example.txt
Input Variant    Errors    Chromosomal Variant    Coding Variant(s)
NM_004992.3:c.274G>T        NC_000023.10:g.153297761C>A    XM_005274683.1:c.-6G>T    XM_005274682.1:c.-6G>T    XM_005274681.1:c.274G>T    LRG_764t2:c.274G>T    NM_004992.3:c.274G>T    LRG_764t1:c.310G>T    NM_001110792.1:c.310G>
NM_004992.3:c.274G>T        NC_0000X.10:g.153297761C>A    XM_005274683.1:c.-6G>T    XM_005274682.1:c.-6G>T    XM_005274681.1:c.274G>T    LRG_764t2:c.274G>T    NM_004992.3:c.274G>T    LRG_764t1:c.310G>T    NM_001110792.1:c.310G>T

perl -nale '@f = $F[1] =~ /NC_0+(\w+)\.\d+:g\.(\d+)(\w)>(\w)/; print join "\t", @f[0,1],$f[1],@f[2,3]' example.txt

23      153297761       153297761       C       A
X       153297761       153297761       C       A

Now, that I know that my regex is extracting what I want, let's implement it to keep:

#!/usr/bin/perl
# extract.pl
use strict;
use warnings;

{
    local $, = "\t";
    while(<>) {
        my @fields = split /\t+/;
        my @u = $fields[1] =~ /NC_0+(\w+)\.\d+:g\.(\d+)(\w)>(\w)/;
        print @u[0,1],$u[1],@u[2,3] . "\n" if @u;
    }
}

perl extract.pl example.txt
23      153297761       153297761       C       A
X       153297761       153297761       C       A

cmccabe · May 2, 2017, 2:37am

Since my input could contain multiple format types, I use multiple regex to capture them. I see your point but need to redo for each condition. They aren't in every file, but the format is often different.

For example,

A SNP regex would not work for an insertion (comments on each line). Thank you very much for the suggestion and help, it is working ).