Add static text in perl

cmccabe · February 3, 2016, 9:44am

I am trying to add a static value to a field in perl . Basically, what happens is a file is created and "null" results in the fields then after some manipulation a field (AB) is split and the text from that is parsed into the desired fields. All that works great what doesn't is the line in bold where I am trying to add a static value of "VUS" to field 46. As of now it is still "Null". Thank you :).

my @colsleft = map "Null",(1..$#left);	  
	   my @colsright = map "Null",(0..$#right);	 
	  	  
	  while(<FH>)  {  # puts row of input file into $_ 
	  chomp;
	  my @vals = split/\t/; # this splits the line at tabs
	  my @mutations=split/,/,$vals[9]; # splits on comma to create an array of mutations
	  my ($gene,$transcript,$exon,$coding,$aa);
	  for (@mutations)
	  {	
			($gene,$transcript,$exon,$coding,$aa) = split/\:/; # this takes col AB and splits it at colons
			grep {$transcript eq $_} keys %nms or next;
		}
		# warn join ("\t",$gene,$transcript,$exon,$coding,$aa);
		my @out=($.,@colsleft,$_,@colsright);
		$out[2]=$gene;
		$out[3]=$nms{$transcript};
		$out[4]=$transcript;
		$out[15]=$coding;
		$out[17]=$aa;
		$out[45]="VUS";

Corona688 · February 3, 2016, 12:25pm

Please post your complete program. We can't tell from this why its not being printed.

cmccabe · February 3, 2016, 1:05pm

The complete program is below with the only thing not working is the static text. Thank you :).

 #!/bin/perl
   use strict;
   my %nms=("NM_004004.5"=>"AR","NM_004992.3"=>"XLD","NM_003924.3"=>"AD");  # match NM and inheritence with gene
    
 # Accept the input and output files as parameters
       my $input_file = $ARGV[0];
       my $output_file = $ARGV[1];
     
       # Set the header columns to be added to the left
       # and to the right of the header in the input file
      my @left =  (
                       "Index",
                       "Chromosome Position",
                       "Gene",
                       "Inheritance",
                       "RNA Accession",
                       "Chr",
                       "Coverage",
                       "Score",
                       "A(#F,#R)",
                       "C(#F,#R)",
                       "G(#F,#R)",
                       "T(#F,#R)",
                       "Ins(#F,#R)",
                       "Del(#F,#R)",
                       "SNP db_xref",
                       "Mutation Call",
                       "Mutant Allele Frequency",
                       "Amino Acid Change"
                  );
      my @right = (
                      "HP",
                      "SPLICE",
                      "Pseudogene",
                      "Classification",
                      "HGMD",
                      "Disease",
                      "Sanger",
                      "References"
                  );
    
      # open the input file, read the header line and sandwich it
      # between @left and @right arrays
      my $final_header;
      open (FH, "<", $input_file) or die "Can't open $input_file: $!";
   chomp(my $hdr=<FH>);
       $final_header = sprintf("%s\t%s\t%s\n", join("\t", @left), $hdr, join("\t",@right));
       # final header is set, print it to the output file
      open (OF, ">", $output_file) or die "Can't open $output_file: $!";
      print OF "$final_header";
     # close (FH) or die "Can't close $output_file: $!";
   
    my @colsleft = map "Null",(1..$#left);   
    my @colsright = map "Null",(0..$#right);  
      
   while(<FH>)  {  # puts row of input file into $_ 
   chomp;
   my @vals = split/\t/; # this splits the line at tabs
   my @mutations=split/,/,$vals[9]; # splits on comma to create an array of mutations
   my ($gene,$transcript,$exon,$coding,$aa);
   for (@mutations)
   { 
   ($gene,$transcript,$exon,$coding,$aa) = split/\:/; # this takes col AB and splits it at colons
    grep {$transcript eq $_} keys %nms or next;
  }
  # warn join ("\t",$gene,$transcript,$exon,$coding,$aa);
  my $classification = "VUS";
  my @out=($.,@colsleft,$_,@colsright);
  $out[2]=$gene;
  $out[3]=$nms{$transcript};
  $out[4]=$transcript;
  $out[15]=$coding;
  $out[17]=$aa;
  $out[45]=$classification;
  
  #print OF join("\t",$.,@colsleft,$_,@colsright),"\n";   # row data is set, print it to the output file  
  print OF join("\t",@out),"\n";   # row data is set, print it to the output file   
     }

MadeInGermany · February 3, 2016, 4:52pm

In general, with push you can add an element to the end of an array.

push @out, "VUS";

Because you assemble @out anyway, you can add it just there:

my @out=($., @colsleft, $_, @colsright, "VUS");

cmccabe · February 3, 2016, 6:22pm

How does perl know that VUS goes in [45] ? Thank you :).

Aia · February 3, 2016, 7:56pm

You are doing correctly:
$out[45] = "VUS";
if you want to guarantee that the 46th element of @out has the string "VUS" . push will only add to the end of the array, whatever that might be next as element is concerned.

$out[45] = "VUS"; pretty much assigns it and there is not way you have "NULL" after that unless it gets changed or your understanding of what you are looking at is not correct.
Do, you want to test it?

$out[45] = "VUS";
print "$out[45]\n";

Anything I might say now, it is not a criticism of your posted code, but rather trying to understand why you do it.

my @colsleft = map "null",(1..$#left);
my @colsright = map "null",(0..$#right);

I think you are trying to create two arrays of certain size with some empty value. But what you are creating is two arrays with each element holding the string "null" which it has no meaning in Perl as empty. In Perl, the equivalent would be undef
Nevertheless, you do not need to worry about that. If you create an array and manually add two elements but not in order, the remaining elements are created with the undef value assigned to it.

Example:
I am going to create an array named @a and populate only the third element $a[2] and the eleventh element $a[10]

perl -MData::Dumper -e '@a[2] = "Third element of a"; @a[10]="Eleventh element of a"; END{print Dumper \@a}'

Take a look at the representation of that array courtesy of the Data::Dumper module:

$VAR1 = [
          undef,
          undef,
          'Third element of a',
          undef,
          undef,
          undef,
          undef,
          undef,
          undef,
          undef,
          'Eleventh element of a'
        ];

Let me point to a few issues with your posted code.

    for (@mutations) {
            ($gene,$transcript,$exon,$coding,$aa) = split/\:/; # this takes col AB and splits it at colons
            grep {$transcript eq $_} keys %nms or next;
    }

I do not know how may times the for loop is being executed because it depends of the size of @mutations set previously by my @mutations=split/,/,$vals[9] , however I know that the work it does is in vain, since $gene,$transcript,$exon,$coding,$aa will only keep the last iteration of it. The rest of them are overwritten through the loop.

grep {$transcript eq $_} keys %nms or next;

Is not producing much since the result is not saved anywhere.

   $out[2]=$gene;
    $out[3]=$nms{$transcript};
    $out[4]=$transcript;
    $out[15]=$coding;
    $out[17]=$aa;
    $out[45]=$classification;

Remember, many of these will only contain the last iteration from the for loop.

Hopefully, I have given you something to consider.

cmccabe · February 3, 2016, 8:41pm

@Aia, please feel free to make any improvements/suggestions to any code posted by me. I am a scientist learning programming and so this is still new to me. I learn from each post and try to improve each time. Thank you very much :). I will try again tomorrow and post back.

Aia · February 3, 2016, 11:45pm

Unfortunately, I can tell you what the code is doing, but I can only guess and infer your intentions, and that's the hard part. If you were to post a few representative lines of $input_file and an example of the expected result to be saved in $output_file , I am sure many people would be happy to help. And we, both, might learn something along the way.

---------- Post updated at 09:45 PM ---------- Previous update was at 08:14 PM ----------

This is an example of how you might be able to insert an extra member at the header line and later output with tabs

#!/usr/bin/env perl
my @header = (
    "Index",
    "Chromosome Position",
    "Gene",
    "Inheritance",
    "RNA Accession",
    "Chr",
    "Coverage",
    "Score",
    "A(#F,#R)",
    "C(#F,#R)",
    "G(#F,#R)",
    "T(#F,#R)",
    "Ins(#F,#R)",
    "Del(#F,#R)",
    "SNP db_xref",
    "Mutation Call",
    "Mutant Allele Frequency",
    "Amino Acid Change",
    "HP",
    "SPLICE",
    "Pseudogene",
    "Classification",
    "HGMD",
    "Disease",
    "Sanger",
    "References",
);

# to be inserted after "Amino Acid Change" header member
my $nineteenth_member = "Heredity";

# insert into @header members
splice @header, 18, 0, $nineteenth_member;

$" = "\t"; # output separator for print when interpolating

# every element of @header will be separated by tab
print "@header\n";

splice works as well for inserting an array in the middle of another array

#!/usr/bin/env perl

my @numbers = qw( 1 2 3 4 );

my @fractions = qw( 3.1 3.2 3.3 3.4 );

splice @numbers, 3, 0, @fractions;

{
    local $" = ",";
    print "@numbers\n";
}

print "@numbers\n";

perl example2.pl
1,2,3,3.1,3.2,3.3,3.4,4
1 2 3 3.1 3.2 3.3 3.4 4

cmccabe · February 4, 2016, 10:12am

The basic idea of my program is that it combines multiple steps/processes into one. A set of data is inputted that is, for lack of a better term, not useful so I use a SOAP API to connect to a python tool that verifies that the input data is found in a database and converts the data into something that is useful. This data which is a set of coordinates 15 25653864 25653864 G C is an example. Those cordinates are saved as a text file that is piped into a perl program (not the one posted) to apply meaning to the coordinate. That data is then reformatted using the perl posted and the process is complete.
The science behind this makes alot of sense to me but I am learning more and more about the programming aspect. Science, especially in my field of molecular genetics and genomics is advancing very quickly and using more and more programming. Thank you for all your help :).

edit ( perl update).

#!/bin/perl
   use strict;
   my %nms=("NM_004004.5"=>"AR","NM_004992.3"=>"XLD","NM_003924.3"=>"AD");  # match NM and inheritence with gene
    
	# Accept the input and output files as parameters
       my $input_file = $ARGV[0];
       my $output_file = $ARGV[1];
     
       # Set the header columns to be added to the left
       # and to the right of the header in the input file
      my @left =  (
                       "Index",
                       "Chromosome Position",
                       "Gene",
                       "Inheritance",
                       "RNA Accession",
                       "Chr",
                       "Coverage",
                       "Score",
                       "A(#F,#R)",
                       "C(#F,#R)",
                       "G(#F,#R)",
                       "T(#F,#R)",
                       "Ins(#F,#R)",
                       "Del(#F,#R)",
                       "SNP db_xref",
                       "Mutation Call",
                       "Mutant Allele Frequency",
                       "Amino Acid Change"
                  );
      my @right = (
                      "HP",
                      "SPLICE",
                      "Pseudogene",
                      "Classification",
                      "HGMD",
                      "Disease",
                      "Sanger",
                      "References"
                  );
    
      # open the input file, read the header line and sandwich it
      # between @left and @right arrays
      my $final_header;
      open (FH, "<", $input_file) or die "Can't open $input_file: $!";
	  chomp(my $hdr=<FH>);

      $final_header = sprintf("%s\t%s\t%s\n", join("\t", @left), $hdr, join("\t",@right));

      # final header is set, print it to the output file
      open (OF, ">", $output_file) or die "Can't open $output_file: $!";
      print OF "$final_header";
     # close (FH) or die "Can't close $output_file: $!";
	  
	   my @colsleft = map "Null",(1..$#left);	  
	   my @colsright = map "Null",(0..$#right);	 
	  	  
	  while(<FH>)  {  # puts row of input file into $_ 
	  chomp;
	  my @vals = split/\t/; # this splits the line at tabs
	  my @mutations=split/,/,$vals[9]; # splits on comma to create an array of mutations
	  my ($gene,$transcript,$exon,$coding,$aa);
	  for (@mutations)
	  {	
	  $_ or next; # skip if AB empty
			($gene,$transcript,$exon,$coding,$aa) = split/\:/; # this takes col AB and splits it at colons
            
			grep {$transcript eq $_} keys %nms or next;
		}
		# warn join ("\t",$gene,$transcript,$exon,$coding,$aa);
		my @out=($.,@colsleft,$_,@colsright);
		$out[2]=$gene;
		$out[3]=$nms{$transcript};
		$out[4]=$transcript;
		$out[15]=$coding;
		$out[17]=$aa;
		$out[45] = "VUS";
                    print "$out[45]\n";
				
		#print OF join("\t",$.,@colsleft,$_,@colsright),"\n";   # row data is set, print it to the output file		
		print OF join("\t",@out),"\n";   # row data is set, print it to the output file			
   	 }

Output

Classification
Null

all the other $out fields populate from the split. Not sure why the "VUS" isn't populating but that string is hardcoded and not from the split.

Aia · February 4, 2016, 11:58pm

Hi cmccabe,

I wish you would had posted at least four lines of the input file and how do you expect those four lines to be reformatted as output. Due to the lack of that information the situation has not changed much.

If you are unable to do that, please take a look at these parts:

my @colsleft = map "Null",(1..$#left);	  
my @colsright = map "Null",(0..$#right);

What do you think is happening to @colsleft and @colsright at each iteration of the while loop?

Since those are outside the while loop, they never get refreshed for each line of the input line. Therefore, they will contain pieces of data from previous iterations, if they are not rewritten in the loop. I do not know if that's what you want.

	  for (@mutations)
	  {	
	  $_ or next; # skip if AB empty
			($gene,$transcript,$exon,$coding,$aa) = split/\:/; # this takes col AB and splits it at colons
            
			grep {$transcript eq $_} keys %nms or next;
		}

$_ or next; # skip if AB empty

$_ will always have something through the loop, if not, the for loop stops. So, I do not know what's your intention even when you commented with skip if AB empty.

split/\:/;

remove the \ , : is not especial in any way there.

Please, explain this part. What do you think that this part is doing for you?

grep {$transcript eq $_} keys %nms or next;

cmccabe · February 5, 2016, 6:05pm

The input that goes into the perl is:

Chr	Start	End	Ref	Alt	Func.refGene	Gene.refGene	GeneDetail.refGene	ExonicFunc.refGene	AAChange.refGene	PopFreqMax	1000G2012APR_ALL	1000G2012APR_AFR	1000G2012APR_AMR	1000G2012APR_ASN	1000G2012APR_EUR	ESP6500si_ALL	ESP6500si_AA	ESP6500si_EA	CG46	common	clinvar	clinvarsubmit	clinvarreference
4	41748130	41748130	G	C	exonic	PHOX2B		synonymous SNV	PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G	0.0007	.	.	.	.	.	0.0005	0.0002	0.0007	.

my @colsleft = map "Null",(1..$#left);	  
my @colsright = map "Null",(0..$#right);

provided the additional columns to the left and right. Basically, the file that is used is only 24 columns (represented by $_) and @colsleft put 18 columns before them with values of Null and @colsright put 8 additional columns after with values of Null. The total is now 50 and that is what is expected. So basically it is my @out=($.,@colsleft,$_,@colsright); unique entry, followed by 18 columns, then 24, then 8.

The split of column 10 [9] provides the arrays that are used in the $out . grep {$transcript eq $_} keys %nms or next; is a special case that matches the Stranscript with the nms in the beginning. This actually all seems to work.

The loop doesn't get refreshed because if there are multiple enteries then they are on separate rows.

I can not seem to figure out why VUS doesn't populate in $out[45] as that is the only part that doesn't seem to work.

I hope this helps a bit and thank you for all your help :).

Aia · February 5, 2016, 8:59pm

Earlier, in post #6, I mentioned the following:

I still believe that what you expect to see is misleading you of what it really is.

$out[45] = "VUS";

At this point, "VUS" is still, the 46th element of array out .

print OF join("\t",@out),"\n";

At that point, array out has been flatten out as an string separated with tabs.
You have lost your ability to know what element it would be in a string that now contains new tab separated elements.
In fact, since $out[45] is the last part, it would, actually, become $out[64] if you were to split again by tab.
You have introduced new parts that have tabs as well.

You want to test it?
Here's your original code, with some prints to show some details. Run it with just the two lines of input you posted.

#!/bin/perl

use strict;

my %nms=("NM_004004.5"=>"AR","NM_004992.3"=>"XLD","NM_003924.3"=>"AD");
my $input_file = $ARGV[0];
my $output_file = $ARGV[1];

my @left =  (
     "Index",
     "Chromosome Position",
     "Gene",
     "Inheritance",
     "RNA Accession",
     "Chr",
     "Coverage",
     "Score",
     "A(#F,#R)",
     "C(#F,#R)",
     "G(#F,#R)",
     "T(#F,#R)",
     "Ins(#F,#R)",
     "Del(#F,#R)",
     "SNP db_xref",
     "Mutation Call",
     "Mutant Allele Frequency",
     "Amino Acid Change"
 );
my @right = (
    "HP",
    "SPLICE",
    "Pseudogene",
    "Classification",
    "HGMD",
    "Disease",
    "Sanger",
    "References"
);

my $final_header;
open (FH, "<", $input_file) or die "Can't open $input_file: $!";
chomp(my $hdr=<FH>);
$final_header = sprintf("%s\t%s\t%s\n", join("\t", @left), $hdr, join("\t",@right));

open (OF, ">", $output_file) or die "Can't open $output_file: $!";
print OF "$final_header";

my @colsleft = map "Null",(1..$#left);
print "\@colsleft = ", scalar @colsleft, ": @colsleft\n";

my @colsright = map "Null",(0..$#right);
print "\@colsright = ", scalar @colsright, ": @colsright\n";

while(<FH>)  {
chomp;
my @vals = split/\t/;
my @mutations=split/,/,$vals[9];
my ($gene,$transcript,$exon,$coding,$aa);
for (@mutations)
{
    $_ or next;
    ($gene,$transcript,$exon,$coding,$aa) = split/\:/; # this takes col AB and splits it at colons

    grep {$transcript eq $_} keys %nms or next;
}
my @out=($.,@colsleft,$_,@colsright);
print "\n";
print 'After this evaluation: my @out=($.,@colsleft,$_,@colsright);', "\n";
print "\@out = ", scalar @out, ": @out\n\n";

$out[2]=$gene;
$out[3]=$nms{$transcript};
$out[4]=$transcript;
$out[15]=$coding;
$out[17]=$aa;
$out[45] = "VUS";

print 'After this evaluation:  $out[45] = "VUS";', "\n";
print "\@out = ", scalar @out, ": @out\n\n";

print OF join("\t",@out),"\n";

my @after = split( "\t", join("\t", @out) );
print "\@after = ", scalar @after, ": @after\n";
}

Run as:

perl debug.pl input.file /dev/null

Output:

@colsleft = 17: Null Null Null Null Null Null Null Null Null Null Null Null Null Null Null Null Null
@colsright = 8: Null Null Null Null Null Null Null Null

After this evaluation: my @out=($.,@colsleft,$_,@colsright);
@out = 27: 2 Null Null Null Null Null Null Null Null Null Null Null Null Null Null Null Null Null 4     41748130        41748130        G       C       exonic  PHOX2B      synonymous SNV   PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G        0.0007  .       .       .       .       .       0.0005  0.0002  0.0007  . Null Null Null Null Null Null Null Null

After this evaluation:  $out[45] = "VUS";
@out = 46: 2 Null PHOX2B AD NM_003924.3 Null Null Null Null Null Null Null Null Null Null c.C639G Null p.G213G 4        41748130        41748130        G       C       exonic       PHOX2B          synonymous SNV  PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G        0.0007  .       .       .       .       .       0.0005  0.0002  0.0007  . Null Null Null Null Null Null Null Null                   VUS

@after = 65: 2 Null PHOX2B AD NM_003924.3 Null Null Null Null Null Null Null Null Null Null c.C639G Null p.G213G 4 41748130 41748130 G C exonic PHOX2B  synonymous SNV PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G 0.0007 . . . . . 0.0005 0.0002 0.0007 . Null Null Null Null Null Null Null Null                   VUS

Please, scroll all the way to the right to see VUS highlighted.

Again, knowing what the code does... it is not a problem; knowing what you expect is the hard part.
If you were to post an example of how those 2 lines are supposed to look after the process, that might help.

cmccabe · February 6, 2016, 9:12am

desired output: column [45] or classification is VUS

Index	Chromosome Position	Gene	Inheritance	RNA Accession	Chr	Coverage	Score	A(#F,#R)	C(#F,#R)	G(#F,#R)	T(#F,#R)	Ins(#F,#R)	Del(#F,#R)	SNP db_xref	Mutation Call	Mutant Allele Frequency	Amino Acid Change	Chr	Start	End	Ref	Alt	Func.refGene	Gene.refGene	GeneDetail.refGene	ExonicFunc.refGene	AAChange.refGene	PopFreqMax	1000G2012APR_ALL	1000G2012APR_AFR	1000G2012APR_AMR	1000G2012APR_ASN	1000G2012APR_EUR	ESP6500si_ALL	ESP6500si_AA	ESP6500si_EA	CG46	common	clinvar	clinvarsubmit	clinvarreference	HP	SPLICE	Pseudogene	Classification	HGMD	Disease	Sanger	References
1	Null	PHOX2B	AD	NM_003924.3	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	4	41748071	41748078	GGCCCGGG	-	exonic	PHOX2B		frameshift deletion	PHOX2B:NM_003924.3:exon3:c.691_698del:p.G231fs															Null	Null	Null	VUS	Null	Null	Null	Null

So if I am understanding (please correct me if I am wrong)

print OF join("\t",@out),"\n";

introduced 16 additional tabs, presumably from the @out split [9] (5 additional fields) and the @nms (3 additional fields) . Thank you for all your help :).

Aia · February 6, 2016, 3:37pm

cmccabe:

So if I am understanding (please correct me if I am wrong)
print OF join("\t",@out),"\n";
introduced 16 additional tabs, presumably from the @out split [9] (5 additional fields) and the @nms (3 additional fields) . Thank you for all your help :).

Yes, you have elements of the array that has tabs in itself and when you convert that array to a string it creates more fields if you were to separate again those fields by tab.

Here's the input you posted in #11

4	41748130	41748130	G	C	exonic	PHOX2B		synonymous SNV	PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G	0.0007	.	.	.	.	.	0.0005	0.0002	0.0007	.

Here's the desired output posted in #13

1	Null	PHOX2B	AD	NM_003924.3	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	4	41748071	41748078	GGCCCGGG	-	exonic	PHOX2B		frameshift deletion	PHOX2B:NM_003924.3:exon3:c.691_698del:p.G231fs

I have to assume you did not correlate them since it is not possible to produce this output from that input, unless other information is missing.
Please, explain where those fields highlighted in red come from, since they are not found anywhere in your posted input or code.
If this were a case of wrong output against input, please, provide a corrected set of input and output, to remove ambiguity.

Also, please, explain the extra tabs in your output file, every ^I identify a tab in the line.

cat -T desired_output

1^INull^IPHOX2B^IAD^INM_003924.3^INull^INull^INull^INull^INull^INull^INull^INull^INull^INull^INull^INull^INull^I4^I41748071^I41748078^IGGCCCGGG^I-^Iexonic^IPHOX2B^I^Iframeshift deletion^IPHOX2B:NM_003924.3:exon3:c.691_698del:p.G231fs^I^I^I^I^I^I^I^I^I^I^I^I^I^I^INull^INull^INull^IVUS^INull^INull^INull^INull

Or if you substitute tab for bar

perl -pe 's/\t/\|/g' desired_output

1|Null|PHOX2B|AD|NM_003924.3|Null|Null|Null|Null|Null|Null|Null|Null|Null|Null|Null|Null|Null|4|41748071|41748078|GGCCCGGG|-|exonic|PHOX2B||frameshift deletion|PHOX2B:NM_003924.3:exon3:c.691_698del:p.G231fs|||||||||||||||Null|Null|Null|VUS|Null|Null|Null|Null

As it stands, your example output has 50 fields. Could you, please, confirm you want an output of 50 fields, all the time?
Here's a breakout of it:

perl -nalF"\t" -e 'for(@F){print $n, ": ", $F[$n++]}' desired_output

1: 1
2: Null
3: PHOX2B
4: AD
5: NM_003924.3
6: Null
7: Null
8: Null
9: Null
10: Null
11: Null
12: Null
13: Null
14: Null
15: Null
16: Null
17: Null
18: Null
19: 4
20: 41748071
21: 41748078
22: GGCCCGGG
23: -
24: exonic
25: PHOX2B
26:
27: frameshift deletion
28: PHOX2B:NM_003924.3:exon3:c.691_698del:p.G231fs
29:
30:
31:
32:
33:
34:
35:
36:
37:
38:
39:
40:
41:
42:
43: Null
44: Null
45: Null
46: VUS
47: Null
48: Null
49: Null
50: Null

Please, explain what would produce fields number 22, 23 and 27 and 26, 29 to 42.
Would you like those tab-empty fields to be Null?

Another question, concerning your code:

my @mutations=split/,/,$vals[9];

$vals[9] contains PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G according to your input. It can not be split by commas.
Can you explain that? Are there any lines that would have something like:

PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G,PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G,PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G

And if so how would you like to handle them?

Thank you.

cmccabe · February 8, 2016, 10:46am

I apologize, I put the wrong output file for the input previously posted.

input

Chr	Start	End	Ref	Alt	Func.refGene	Gene.refGene	GeneDetail.refGene	ExonicFunc.refGene	AAChange.refGene	PopFreqMax	1000G2012APR_ALL	1000G2012APR_AFR	1000G2012APR_AMR	1000G2012APR_ASN	1000G2012APR_EUR	ESP6500si_ALL	ESP6500si_AA	ESP6500si_EA	CG46	common	clinvar	clinvarsubmit	clinvarreference
4	41748130	41748130	G	C	exonic	PHOX2B		synonymous SNV	PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G	0.0007	.	.	.	.	.	0.0005	0.0002	0.0007	.

output

Index	Chromosome Position	Gene	Inheritance	RNA Accession	Chr	Coverage	Score	A(#F,#R)	C(#F,#R)	G(#F,#R)	T(#F,#R)	Ins(#F,#R)	Del(#F,#R)	SNP db_xref	Mutation Call	Mutant Allele Frequency	Amino Acid Change	Chr	Start	End	Ref	Alt	Func.refGene	Gene.refGene	GeneDetail.refGene	ExonicFunc.refGene	AAChange.refGene	PopFreqMax	1000G2012APR_ALL	1000G2012APR_AFR	1000G2012APR_AMR	1000G2012APR_ASN	1000G2012APR_EUR	ESP6500si_ALL	ESP6500si_AA	ESP6500si_EA	CG46	common	clinvar	clinvarsubmit	clinvarreference	HP	SPLICE	Pseudogene	Classification	HGMD	Disease	Sanger	References
2	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	4	41748130	41748130	G	C	exonic	PHOX2B		synonymous SNV	PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G	0.0007	.	.	.	.	.	0.0005	0.0002	0.0007	.					Null	Null	Null	Null	Null	Null	Null	Null

field headers where info comes from

1: 1                   (Index)
2: Null                (Chromosome)
3: PHOX2B          (Gene)
4: AD                 (Inheritence)
5: NM_003924.3   (RNA Accession)
6: Null                (Chr)
7: Null                (Coverage)
8: Null                (Score)
9: Null                (A(#F,#R)
10: Null              (C(#F,#R)
11: Null              (G(#F,#R)
12: Null              (T(#F,#R)
13: Null              (Ins(#F,#R)
14: Null              (Del(#F,#R)
15: Null              (SNP db_xref)
16: c.C639G        (Mutation Call)
17: Null              (Mutant Allele Frequency)
18: G213G          (Amino Acid Change)
19: 4                 (Chr)
20: 41748130      (Start)
21: 41748130      (Stop)
22: G                 (Ref)
23: C                 (Alt)
24: exonic          (Func.refGene)
25: PHOX2B        (Gene.refGene)
26:                    (GeneDetail.refGene)
27: synonymous   (ExonicFunc.refGene)
28: PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G (AAChange.refGene) - used for the split to get values in 3,4,5,16, and 18) - this  split uses the @nms to only use the record in this field that starts with the same NM_ as the @nms (this field can have multiple records in it, so to ensure I get the correct one I use @nms and only return that matching value)

29:    (PopFreqMax)
30:    (1000G2012APR_ALL)
31:    (1000G2012APR_AFR)
32:    (1000G2012APR_AMR)
33:    (1000G2012APR_ASN)
34:    (1000G2012APR_EUR)
35:    (ESP6500si_ALL)
36:    (ESP6500si_AA)
37:    (ESP6500si_EA)
38:    (CG46)
39:    (common)
40:    (clinvar)
41:    (clinvarsubmit)
42:    (clinvarreference)
43: Null   (HP)
44: Null   (Splice)
45: Null   (Pseudogene)
46: VUS   (Classification) - currently not showing up (Null is)
47: Null   (HGMD)
48: Null   (Disease)
49: Null   (Sanger)
50: Null   (References)

I did not mean nor do I know why the extra tabs are there.

the perl that is used to populate this column only allows the format with : in it, so commas should not show up.

Thank you for all your help :).

Aia · February 8, 2016, 9:56pm

Please, give it a try.
You can modify at your content.

#!/usr/bin/env perl
# reformat.pl
use strict;
use warnings;

my %nms = (
    "NM_004004.5" => "AR",
    "NM_004992.3" => "XLD",
    "NM_003924.3" => "AD",
);

my $readf = shift || die "Missing input file: $!\n";
my $writef = shift || die "Missing output file: $!\n";

my @header = (
    "Index",
    "Chromosome Position",
    "Gene",
    "Inheritance",
    "RNA Accession",
    "Chr",
    "Coverage",
    "Score",
    "A(#F,#R)",
    "C(#F,#R)",
    "G(#F,#R)",
    "T(#F,#R)",
    "Ins(#F,#R)",
    "Del(#F,#R)",
    "SNP db_xref",
    "Mutation Call",
    "Mutant Allele Frequency",
    "Amino Acid Change",
    "HP",
    "SPLICE",
    "Pseudogene",
    "Classification",
    "HGMD",
    "Disease",
    "Sanger",
    "References",
);

open my $in, '<', $readf or die "Cannot open $readf: $!\n";
open my $out, '>', $writef or die "Cannot create $writef: $!\n";

my $add2header;
chomp( $add2header = <$in> );
splice @header, 18, 0, $add2header;
save(@header);
$.= 0; # reset lines count to remove header
while( <$in> ) {
    chomp;
    my @ruler = (("Null")x17, ("")x25, ("Null")x8);
    my @fields = split "\t";
    my $len = @fields;
    splice @ruler, 17, $len, @fields;
    my ($gene, $transcript, $exon, $coding, $aa) = split ":", $fields[9];
    $ruler[0] = $.;
    $ruler[2] = $gene;
    $ruler[3] = $nms{$transcript};
    $ruler[4] = $transcript;
    $ruler[15] = $coding;
    $ruler[17] = $aa;
    $ruler[45] = "VUS";
    save(@ruler);
}

sub save {
    local $" = "\t";
    print $out "@_\n";
}

close $in;
close $out;

cmccabe · February 13, 2016, 11:43am

I apologize for the delay and just got to test the perl using the input from post 15. The results look the same as before with VUS appearing after the "Null" values:
Thank you for all you help :).

Index	Chromosome Position	Gene	Inheritance	RNA Accession	Chr	Coverage	Score	A(#F,#R)	C(#F,#R)	G(#F,#R)	T(#F,#R)	Ins(#F,#R)	Del(#F,#R)	SNP db_xref	Mutation Call	Mutant Allele Frequency	Amino Acid Change	Chr	Start	End	Ref	Alt	Func.refGene	Gene.refGene	GeneDetail.refGene	ExonicFunc.refGene	AAChange.refGene	PopFreqMax	1000G2012APR_ALL	1000G2012APR_AFR	1000G2012APR_AMR	1000G2012APR_ASN	1000G2012APR_EUR	ESP6500si_ALL	ESP6500si_AA	ESP6500si_EA	CG46	common	clinvar	clinvarsubmit	clinvarreference	HP	SPLICE	Pseudogene	Classification	HGMD	Disease	Sanger	References
2	Null	PHOX2B	AD	NM_003924.3	Null	Null	Null	Null	Null	Null	Null	Null	Null	Null	c.C639G	Null	p.G213G	4	41748130	41748130	G	C	exonic	PHOX2B		synonymous SNV	PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G	0.0007	.	.	.	.	.	0.0005	0.0002	0.0007	.					Null	Null	Null	Null	Null	Null	Null	Null																																						VUS

Aia · February 13, 2016, 4:57pm

This is the result I get when I run the code I posted in #16 against the input you posted in #15.

Index   Chromosome Position Gene    Inheritance RNA Accession   Chr Coverage    Score   A(#F,#R)    C(#F,#R)    G(#F,#R)    T(#F,#R)    Ins(#F,#R)  Del(#F,#R)  SNP db_xref Mutation Call   Mutant Allele Frequency Amino Acid Change   Chr Start   End Ref Alt Func.refGene    Gene.refGene    GeneDetail.refGene  ExonicFunc.refGene  AAChange.refGene    PopFreqMax  1000G2012APR_ALL    1000G2012APR_AFR    1000G2012APR_AMR    1000G2012APR_ASN    1000G2012APR_EUR    ESP6500si_ALL   ESP6500si_AA    ESP6500si_EA    CG46    common  clinvar clinvarsubmit   clinvarreference    HP  SPLICE  Pseudogene  Classification  HGMD    Disease Sanger  References
1   Null    PHOX2B  AD  NM_003924.3 Null    Null    Null    Null    Null    Null    Null    Null    Null    Null    c.C639G Null    p.G213G 41748130    41748130    G   C   exonic  PHOX2B      synonymous SNV  PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G    0.0007  .   .   .   .   .   0.0005  0.0002  0.0007  .                       Null    Null    Null    VUS Null    Null    Null    Null

As you see, VUS is there in the right place. Which leads me to believe that there is a discrepancy between what you posted and what you used for input.
Also, there is a discrepancy between what my code would output for first field of second line: a 1; meaning line 1 and what you are showing: a 2.

If you would like to continue troubleshooting it, all that I can offer you is the result of what your input looks like when reformatted to show tabs.

perl -pe 's/\t/\[TAB\]/g' new_cmccabe_input

Chr[TAB]Start[TAB]End[TAB]Ref[TAB]Alt[TAB]Func.refGene[TAB]Gene.refGene[TAB]GeneDetail.refGene[TAB]ExonicFunc.refGene[TAB]AAChange.refGene[TAB]PopFreqMax[TAB]1000G2012APR_ALL[TAB]1000G2012APR_AFR[TAB]1000G2012APR_AMR[TAB]1000G2012APR_ASN[TAB]1000G2012APR_EUR[TAB]ESP6500si_ALL[TAB]ESP6500si_AA[TAB]ESP6500si_EA[TAB]CG46[TAB]common[TAB]clinvar[TAB]clinvarsubmit[TAB]clinvarreference
4[TAB]41748130[TAB]41748130[TAB]G[TAB]C[TAB]exonic[TAB]PHOX2B[TAB][TAB]synonymous SNV[TAB]PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G[TAB]0.0007[TAB].[TAB].[TAB].[TAB].[TAB].[TAB]0.0005[TAB]0.0002[TAB]0.0007[TAB].

Please, run the same two input lines you used and compare against these. There should be the same, since I am using what you posted.

Note:
I am assuming that you have taken care of making sure this input comes from a properly Unix type file and not a MSDOS.

cmccabe · February 15, 2016, 10:56am

The input is the proper unix style but is slightly different then what I posted: it is only 5 fields . I apologize for the oversight

input

4	41748130	41748130	G	C

perl -pe 's/\t/\[TAB\]/g' input

4[TAB]41748130[TAB]41748130[TAB]G[TAB]C

The additional information is populating by those 5 fields most of the time. A small percentage of the time [9] will be Null and need to be skipped, thats what $_ or next; this was supposed to do in the original code. [45] is stil "VUS" however. Thank you :).

Aia · February 15, 2016, 7:11pm

cmccabe:

The input is the proper unix style but is slightly different then what I posted: it is only 5 fields . I apologize for the oversight

input
4	41748130	41748130	G	C 
perl -pe 's/\t/\[TAB\]/g' input
4[TAB]41748130[TAB]41748130[TAB]G[TAB]C
The additional information is populating by those 5 fields most of the time. A small percentage of the time [9] will be Null and need to be skipped, thats what $_ or next; this was supposed to do in the original code. [45] is stil "VUS" however. Thank you :).

It would have complained about: Use of uninitialized value in split if encounters such a short input.
Here's the previous code with the modification to accommodate that small percentage of times that the input does not have a "PHOX2B:NM_003924.3:exon3:c.C639G:p.G213G" string

#!/usr/bin/env perl
# reformat.pl
use strict;
use warnings;

my %nms = (
    "NM_004004.5" => "AR",
    "NM_004992.3" => "XLD",
    "NM_003924.3" => "AD"
);

my $readf = shift || die "Missing input file: $!\n";
my $writef = shift || die "Missing output file: $!\n";

my @header = (
    "Index",
    "Chromosome Position",
    "Gene",
    "Inheritance",
    "RNA Accession",
    "Chr",
    "Coverage",
    "Score",
    "A(#F,#R)",
    "C(#F,#R)",
    "G(#F,#R)",
    "T(#F,#R)",
    "Ins(#F,#R)",
    "Del(#F,#R)",
    "SNP db_xref",
    "Mutation Call",
    "Mutant Allele Frequency",
    "Amino Acid Change",
    "HP",
    "SPLICE",
    "Pseudogene",
    "Classification",
    "HGMD",
    "Disease",
    "Sanger",
    "References",
);

open my $in, '<', $readf or die "Cannot open $readf: $!\n";
open my $out, '>', $writef or die "Cannot create $writef: $!\n";

my $add2header;
chomp( $add2header = <$in> );
splice @header, 18, 0, $add2header;
save(@header);

$.=0;
while( <$in> ) {
    chomp;
    my @ruler = (("Null")x17, ("")x25, ("Null")x8);
    my @fields = split /\t/;
    if($fields[9]) {
        my $len = @fields;
        splice @ruler, 17, $len, @fields;
        my ($gene, $transcript, $exon, $coding, $aa) = split /:/, $fields[9];
        $ruler[0] = $.;
        $ruler[2] = $gene;
        $ruler[3] = $nms{$transcript};
        $ruler[4] = $transcript;
        $ruler[15] = $coding;
        $ruler[17] = $aa;
        $ruler[45] = "VUS";
        save(@ruler);
    }
}

sub save {
    local $" = "\t";
    print $out "@_\n";
}

close $in;
close $out;

Nevertheless, that would not do anything to solve your input discrepancy.
Did you compare the input that produced the defective reformat output with the one you posted previously?