The below perl
script parses a variety of formats. If I use the numeric
text file as input the script works correctly. However using the alpha
text file as input there is a black output file. The portion in bold splits the field to parse f[2]
or NC_000023.10:g.153297761C>A
into a variable $common
but since the portion after the NC_
can be alpha or numeric I think that is the issue. Currently it is set to only numeric (\d+)
, so maybe (\d+ || [Aa-Zz])
would solve this. I am still learning so I wanted to check and make sure it wasn't something else I over-looked. Thank you :).
numeric tab-delimited
Input Variant Errors Chromosomal Variant Coding Variant(s)
NM_004992.3:c.274G>T NC_000023.10:g.153297761C>A XM_005274683.1:c.-6G>T XM_005274682.1:c.-6G>T XM_005274681.1:c.274G>T LRG_764t2:c.274G>T NM_004992.3:c.274G>T LRG_764t1:c.310G>T NM_001110792.1:c.310G>T
perl -ne 'next if $. == 1;
if(/.*del([A-Z]+)ins([A-Z]+).*NC_0+([^.]+)\..*g\.([0-9]+)_([0-9]+)/) # indel
{
print join("\t", $3, $4, $5, $1, $2), "\n";
}
else
{
while (/\t*NC_(\d+)\.\S+g\.(\d+)(\S+)/g) { # conditional parse
($num1, $num2, $common) = ($1, $2, $3);
$num3 = $num2;
if ($common =~ /^([A-Z])>([A-Z])$/) { ($ch1, $ch2) = ($1, $2) } # SNP
elsif ($common =~ /^del([A-Z])$/) { ($ch1, $ch2) = ($1, "-") } # deletion
elsif ($common =~ /^ins([A-Z])$/) { ($ch1, $ch2) = ("-", $1) } # insertion
elsif ($common =~ /^_(\d+)del([A-Z]+)$/) { ($num3, $ch1, $ch2) = ($1, $2, "-") } # multi deletion
elsif ($common =~ /^_(\d+)ins([A-Z]+)$/) { ($num3, $ch1, $ch2) = ("-", $1, $2) } # multi insertion
printf ("%d\t%d\t%d\t%s\t%s\n", $num1, $num2, $num3, $ch1, $ch2); # output
map {undef} ($num1, $num2, $num3, $common, $ch1, $ch2);
}
}' numeric
23 153297761 153297761 C A tab-delimeted
alpha tab-delimited
Input Variant Errors Chromosomal Variant Coding Variant(s)
NM_004992.3:c.274G>T NC_0000X.10:g.153297761C>A XM_005274683.1:c.-6G>T XM_005274682.1:c.-6G>T XM_005274681.1:c.274G>T LRG_764t2:c.274G>T NM_004992.3:c.274G>T LRG_764t1:c.310G>T NM_001110792.1:c.310G>T
Same script produces a blank output file after it executes.
desired output tab-delimeted
X 153297761 153297761 C A