Need help comparing Base Pairs within PERL

drossy · June 5, 2012, 9:46am

Hi I have a multi-step project I am working on and have been finding it difficult to come up with the correct approach.
The data I have been given resembles:

 
Index      Chr       Genotype   Mutation Type
1           Chr1           TT            Intronic
2           Chr1           AA            Exonic
3           Chr1           AG            Exonic
4           Chr1           CC            Frameshift
5           Chr1           CA            Intronic
...         ...              ....

My goal here is to compare a large file such as this one to a set of references that are a single letter (T,A,C,G). I need to split the genotype in the given file into two individual letters and then compare this to the reference. If either matches the reference, then the program should move on to the next reference. If, however, neither letter is consistant, then the program should give an output of the index number and mutation type that corresponds to that data.
I am still fairly new to this so all help would be greatly appreciated.
Thanks!

bartus11 · June 5, 2012, 10:04am

Can you post sample data and desired output?

drossy · June 5, 2012, 10:23am

Yes sorry for not specifying more.
Say the given reference was:

 
Index 
1)T
2)A
3)C
4)G
5)C

and that the matix in my first post remains the same, indexes 3 and 4 would not have an allele in their genotype that matches that of the reference for that position. My desired output in this situation would be:

 
3   Chr1   Exonic
4   Chr1   Frameshift

I hope this explanation is more useful.
Thanks again

ddreggors · June 5, 2012, 10:32am

If I understand correctly, you will want something like this:

#!/usr/bin/perl


use strict;
use warnings;

my ($index, $chr, $geno, $mutation);
my $ref = "A";
open(FILE,"<","file.txt") or die $!;
while (<FILE>) {
	next unless $_ !~ /Index/;
	next unless $_ =~ /^(\d*)\s*([a-z]*\d)\s*([a-z]*)\s*([a-z]*)/i;
	$index = $1;
	$geno = $3;
	next unless $geno !~ /$ref/;
	print "Genotype '" . $geno . "' (Index: " . $index . ") does not match the reference type '" . $ref . "'.\n";
}

Which will result in this:

Genotype 'TT' (Index: 1) does not match the reference type 'A'.
Genotype 'CC' (Index: 4) does not match the reference type 'A'.

Obviously, I have printed my own formatted line, however you could simply use the '$' to leave as is ( print $ . "\n"; ).

drossy · June 5, 2012, 10:41am

Thank you ddreggors for your help. It is similar to that but the reference is not the same for each index. For example, for index 1, 'T' is the reference and thus the genotype, 'TT', for that index fits the requirement of having one or more alleles that matches the reference. In index 3 and 4, however, the reference alleles are 'C' and 'G' respectively, and and both of these cases, the coordinating genotype does not have an allele that matches the reference. It is in these cases that I wish for output.
Sorry if I had not specified that well

ddreggors · June 5, 2012, 1:51pm

@drossy

I am not exactly sure I follow but that's ok. Fortunately for us, I do not have to understand genome reference indexes or alleles for that matter to understand logic.

Using the example I have given you, you should be able see that while I set $ref as a static value (A), you can follow the exact logic inside the while loop to pull a value or array of values from another file. If all needed values exist in "THIS" file, then I have already given you what you need.

Example:

next unless $_ =~ /^(\d*)\s*([a-z]*\d)\s*([a-z]*)\s*([a-z]*)/i;
$index = $1;
$geno = $3;

these lines do something very nice for you, namely it grabs all text and splits it into separate variables delimiting on white space (or multiple white space) characters.All text surrounded in parenthesis are "kept".

You can see that I have given friendly names to the index and genotype columns, but $2 would contain the "Chr" column and $4 would contain the "Mutation Type" column with that regex match. You can easily give them friendly names to reuse as well...

Example:

next unless $_ =~ /^(\d*)\s*([a-z]*\d)\s*([a-z]*)\s*([a-z]*)/i;
$index = $1;
$chr = $2;
$geno = $3;
$mutation = $4;

Going a bit further, if you need to split geno into 2 separate characters you can now take the $geno variable and do something like this:

Example:

my ($geno1,$geno2) = split(undef,$geno);

The framework is all here, for you to do everything you want now, but for me (or others) to give you a better solution it would require a more logical approach in explaining the problem I fear.

Maybe I am slow, but I do not see (based on your explanation) the correlation you are trying to make with these references/alleles.

You say:

What is considered a "coordinating genotype?
What are the reference alleles?
Last letter in the pair of 2 is reference?

More precisely it would be easier to phrase as:

The second letter in that column must match the first.

cheers

drossy · June 6, 2012, 10:09am

I still don't believe I am explaining it correctly:
The index number for the given data must match that of the reference.
Data:

Index      Chr       Genotype   Mutation Type
1           Chr1           TT            Intronic
2           Chr1           AA            Exonic
3           Chr1           AG            Exonic
4           Chr1           CC            Frameshift
5           Chr1           CA            Intronic

Reference:

1)T
2)A
3)C
4)G
5)C

For example: The genotype at position 1 is TT and the reference is T. Because atleast one of the letters in the genotype matches the reference, then the program should move on.
At position 3, the genotype is AG and the reference is C. Neither of the letters that make up the genotype match the reference, so the program should take note of this and display '3 Chr1 AG Exonic' in the output.

Thanks again

ddreggors · June 6, 2012, 10:35am

OK, so the reference will always be static?

As in this is ALWAYS the reference:

Reference:
1)T
2)A
3)C
4)G
5)C

drossy · June 6, 2012, 10:47am

Yes but the numbers continue into the millions so itd be easier to set the reference as the entire filename instead I believe.

ddreggors · June 6, 2012, 11:03am

OK getting closer, but I am still unsure of *where* the references come from.

If they were static like you posted that is easy. Now you mention a filename:

What file name?

I cannot find anywhere in this post where you have mentioned getting the references from filenames...

If there is more than one file involved please specify that, if the *names* of the file are important then also point that out and specify the names.

We need a sample of *ALL* data you are working with (even if bogus but representative), and a description of what columns, fields, or characters you want to match on.

Beyond that we can only work with what we are given and your results may not be helpful.

EDIT:
Going on the static reference you gave I come up with this:

#!/usr/bin/perl


use strict;
use warnings;

my ($idx, $chr, $geno, $mutation);
my @ref = ('T', 'A', 'C', 'G', 'C');
open(FILE,"<","file.txt") or die $!;
while (<FILE>) {
        next unless $_ !~ /Index/;
        next unless $_ =~ /^(\d*)\s*([a-z]*\d)\s*([a-z]*)\s*([a-z]*)/i;
        $idx = $1 - 1;
        $geno = $3;
        next unless $geno !~ /$ref[$idx]/;
        print $_;
}

RESULTS:

> ./test.pl   
3           Chr1           AG            Exonic
4           Chr1           CC            Frameshift

Maybe that will help you get further in your solution...

drossy · June 6, 2012, 11:29am

Okay I'll start over and try to be more exact.
I am looking at a file that has about 3 million rows and about 100 columns. The row number is given by the index number, in the 1st column, as previously shown. The genotypes that I need to look at, 'TT' or 'GG', for example, are located in the 88th column. The last column of importance is the 6th in which it gives the name of the mutation, such as 'intronic'.
In a seperate file, there are two columns. The first being the index number that ranges from 1 to about 3 million, and the second has static values for the references, i.e 'T' or 'G'.
I would like to compare the letters in index 1 of the first file with index 1 of the second file, and so on and so forth.
Now I am changing one aspect of this so I apologize.
If either one of the letters from the first file at index 1 matches the letter from the second file at index 1, I would like for the 1st column and 6th column of the first file to be printed out.

For example:

 
File #1
Column 1     ....        Column 6     .....   Column 88
(Index)                   (mutation)            (Genotype)
1                             Intronic                 TT
2                             Frameshift             GT
3                             Exonic                   AT
4                             Exonic                   AA
5                             Intronic                 GC
 
File #2
Column 1      Column 2
(index)          (reference letter)
1                      A
2                      C
3                      C
4                      A
5                      G

Output: Since at index 4 and 5, one of the letters in the genotype in file 1 match the letter from file 2, i would like the following to be displayed:

 
4 Exonic 
5 Intronic

I hope this is more helpful

ddreggors · June 6, 2012, 1:16pm

OK, now I can do that. I understand now and will give you an example in a minute...

---------- Post updated at 01:05 PM ---------- Previous update was at 11:34 AM ----------

OK given the following files:

FILE1

Column 1     ....        Column 6     .....   Column 88
(Index)                   (mutation)            (Genotype)
1                             Intronic                 TT
2                             Frameshift             GT
3                             Exonic                   AT
4                             Exonic                   AA
5                             Intronic                 GC

FILE2

Column 1      Column 2
(index)          (reference letter)
1                      A
2                      C
3                      C
4                      A
5                      G

I have written this:

#!/usr/bin/perl


use strict;
use warnings;
my ($idx, $tmp, $geno, @ref, @data);


# file1.text is the actual data file
# We want to match data lines to reference lines
# So first we place the data lines (not columns) into an array.

open(FILE1,"<","file1.txt") or die $!;
while (<FILE1>) {
        chomp;
        # If the line does not start with a number
        # we skip this line
        next unless $_ =~ /^\d/;
        # split the line into index (idx), and all else is placed in tmp
        ($idx, $tmp) = split(/\s+/,$_,2);
        # Populate the data array with the line minus the index column
        $data[$idx] = $tmp;
}


# file2.txt is the referece file with only 2 columns
# We parse the file and split it into index and value pairs.
# Then we can use the index to match the data index and
# once we have that data we can begin to break it down to
# it's column components and match as needed/
open(FILE2,"<","file2.txt") or die $!;
while (<FILE2>) {
        chomp;
        next unless $_ =~ /^\d/;
        # Split the index and data
        /^(\d*)\s*([a-z]*)/i;
        # Split the data line columns by spaces and 
        # place these columns into a new temp array  
        my @tmparr = split(/\s+/,$data[$1]);
        # Now we can look directly at 1 (or other) column for testing
        # column 1 in my case but column 86 in yours 
        # Column 88 becomes column 86 because we -1 for removed index in first loop above and we -1 for 0 based array
        # My file1.txt had 3 colmns, removing the index leaves 2 columns, and a zero based array means we have column 0 and 1.
        next unless $tmparr[1] !~ /$2/;
        print $1 . "\t" . $data[$1] . "\n";
}

and the result is:

> ./test.pl   
1       Intronic                 TT
2       Frameshift             GT
3       Exonic                   AT

---------- Post updated at 01:16 PM ---------- Previous update was at 01:05 PM ----------

Just a quick note, without all my comments this is not a large script either:

#!/usr/bin/perl


use strict;
use warnings;
my ($idx, $tmp, $geno, @ref, @data);
open(FILE1,"<","file1.txt") or die $!;
while (<FILE1>) {
        chomp;
        next unless $_ =~ /^\d/;
        ($idx, $tmp) = split(/\s+/,$_,2);
        $data[$idx] = $tmp;
}
open(FILE2,"<","file2.txt") or die $!;
while (<FILE2>) {
        chomp;
        next unless $_ =~ /^\d/;
        /^(\d*)\s*([a-z]*)/i;
        my @tmparr = split(/\s+/,$data[$1]);
        next unless $tmparr[1] !~ /$2/;
        print $1 . "\t" . $data[$1] . "\n";
}

drossy · June 6, 2012, 2:28pm

Thank you so much its running flawlessly for me.
Is there any quick alteration that could be made so that instead of reporting back the cases in which neither letter is the same as the reference, it would instead report all cases where at least one of the letters DOES match the reference?
Sorry for changing up, and i appreciate your help a ton.

ddreggors · June 6, 2012, 2:51pm

As in the opposite of what it is doing now?

If so then the next to last line in the second loop can be changed as follows...

Original:

next unless $tmparr[1] !~ /$2/;

Changed:

next unless $tmparr[1] =~ /$2/;

BTW my original code was only really good for 3 columns in the data file.
Here is the same code working for 88 columns as you stated...

#!/usr/bin/perl

use strict;
use warnings;
my ($idx, $tmp, @data);
open(FILE1,"<","file3.txt") or die $!;
while (<FILE1>) {
        chomp;
        next unless $_ =~ /^\d/;
        ($idx, $tmp) = split(/\s+/,$_,2);
        $data[$idx] = $tmp;
}
close FILE1;

open(FILE2,"<","file2.txt") or die $!;
while (<FILE2>) {
        chomp;
        next unless $_ =~ /^\d/;
        /^(\d*)\s*([a-z]*)/i;
        my @tmparr = split(/\s+/,$data[$1]);
        next unless $tmparr[86] !~ /$2/;
        print $1 . "\t" . $tmparr[4] . "\t" . $tmparr[86] . "\n";
}
close FILE2;

again changing the next to last line of the second loop from "!~" to "=~" to get the opposite behavior.

NOTE:
I cleaned up the code a bit, I removed the unused variables ($geno & @ref) and also now I close the file handles (FILE1 & FILE2).

drossy · June 6, 2012, 3:10pm

when I make the change to '=~' i get the following errors:

 
Use of uninitialized value $1 in concatenation (.) or string at ddreggors.pl line 26, <FILE2> line 5.
Use of uninitialized value $1 in array element at ddreggors.pl line 26, <FILE2> line 5.
Use of uninitialized value in concatenation (.) or string at ddreggors.pl line 26, <FILE2> line 5.
Use of uninitialized value $1 in concatenation (.) or string at ddreggors.pl line 26, <FILE2> line 6.
Use of uninitialized value $1 in array element at ddreggors.pl line 26, <FILE2> line 6.
Use of uninitialized value in concatenation (.) or string at ddreggors.pl line 26, <FILE2> line 6.

Do you know what could be causing this?
Thanks again

ddreggors · June 6, 2012, 3:27pm

can you send me your code, or at minimum line 26 of your code?

---------- Post updated at 03:21 PM ---------- Previous update was at 03:19 PM ----------

Is that all you have changed?
When you change it back it works?

---------- Post updated at 03:27 PM ---------- Previous update was at 03:21 PM ----------

Never mind I see it....

I had to assign the 2 reference file columns to a variable and use the variable then it it works. Somehow after the match line $1 was getting unset, so the print line was failing on $1.

#!/usr/bin/perl


use strict;
use warnings;
my ($idx, $tmp, @data);
open(FILE1,"<","file3.txt") or die $!;
while (<FILE1>) {
        chomp;
        next unless $_ =~ /^\d/;
        ($idx, $tmp) = split(/\s+/,$_,2);
        $data[$idx] = $tmp;
}
open(FILE2,"<","file2.txt") or die $!;
while (<FILE2>) {
        chomp;
        next unless $_ =~ /^\d/;
        /^(\d*)\s*([a-z]*)/i;
        my @tmparr = split(/\s+/,$data[$1]);
        my $tmpref = $2;
        my $tmpidx = $1;
        next unless $tmparr[86] =~ /$tmpref/;
        print $tmpidx . "\t" . $tmparr[4] . "\t" . $tmparr[86] . "\n";
}

drossy · June 6, 2012, 3:29pm

This is what it currently is:

#!/usr/bin/perl


use strict;
use warnings;
my ($idx, $tmp, $geno, @ref, @data);
open(FILE1,"<","file1.txt") or die $!;
while (<FILE1>) {
        chomp;
        next unless $_ =~ /^\d/;
        ($idx, $tmp) = split(/\s+/,$_,2);
        $data[$idx] = $tmp;
}
open(FILE2,"<","file2.txt") or die $!;
while (<FILE2>) {
        chomp;
        next unless $_ =~ /^\d/;
        /^(\d*)\s*([a-z]*)/i;
        my @tmparr = split(/\s+/,$data[$1]);
        next unless $tmparr[1] =~ /$2/;
        print $1 . "\t" . $data[$1] . "\n";
}

ddreggors · June 7, 2012, 9:24am

I found the issue.
See my above comment... I updated it to show what I did.

---------- Post updated at 04:09 PM ---------- Previous update was at 03:31 PM ----------

Actually, now that we have solved the original code... there is a better way to do this.
Here is a way smaller piece code that removes the need for second array (@ref) and I nest the loops.

#!/usr/bin/perl


use strict;
use warnings;
my (@data, $tmpidx, $tmpref);
open(FILE1,"<","file1.txt") or die $!;
while (<FILE1>) {
        chomp;
        next unless $_ =~ /^\d/;
        @data =  split(/\s+/,$_);
        open(FILE2,"<","file2.txt") or die $!;
        while (<FILE2>) {
                /^(\d*)\s*([a-z]*)/i;
                $tmpidx = $1;
                $tmpref = $2;
                next unless $_ =~ /^\d/;
                next unless $data[0] == $tmpidx;
                next unless $data[87] =~ /$tmpref/i;
                print $data[0] . "\t" . $data[5] . "\t" . $data[87] . "\n";
        }
        close FILE2;
}
close FILE1;

Result:

[user@host ~]$ ./test3.pl    
4       Exonic  AA
5       Intronic        GC

---------- Post updated 06-07-12 at 09:24 AM ---------- Previous update was 06-06-12 at 04:09 PM ----------

OK, after a bit more tweaking, here is the smallest and cleanest I could get the code.

#!/usr/bin/perl

use strict;
use warnings;
my (@ref, @data, $refidx, $ref, $row, @dataline, $refline);
open(FILE1,"<","file1.txt") or die $!; @data = <FILE1>; close(FILE1);
open(FILE2,"<","file2.txt") or die $!; @ref = <FILE2>; close(FILE2);
foreach $row (@data) {
        next unless $row =~ /^\d/;
        foreach $refline (@ref) {
                next unless $refline =~ /^\d/;
                ($refidx, $ref) = split('\s*', $refline);
                @dataline = split('\s',$row);
                next unless $dataline[0] == $refidx;
                next unless $dataline[87] =~ /$ref/i;
                print $dataline[0] . "\t" . $dataline[5] . "\t" . $dataline[87] . "\n";
        }
}

It weighs in at 18 lines, the files are read into arrays and closed immediately.

drossy · June 7, 2012, 12:00pm

Thank you so much that is working great for me and I really appreciate the help.
Hopefully one my last questions but I now have a similar situation but the data and reference are both in the same file. For example:

Column 1     ....        Column 6     .....   Column 88      Column 89
(Index)                   (mutation)            (Genotype)      Reference
1                             Intronic                 TT                 A
2                             Frameshift             GT                 C
3                             Exonic                   AT                 C
4                             Exonic                   AA                 A
5                             Intronic                 GC                 G

So now rather than comparing the data to column 2 in another file, I would like to compare it to column 89 of the same file.
Is there a quick fix for this?
Thanks again!

ddreggors · June 7, 2012, 1:54pm

That is actually easier and least memory/disk intensive.

Here is a quick modification (untested) that should work...

#!/usr/bin/perl

use strict;
use warnings;
my (@data, $row, @dataline);
open(FILE1,"<","file1.txt") or die $!; @data = <FILE1>; close(FILE1);
foreach $row (@data) {
        next unless $row =~ /^\d/;
        @dataline = split('\s',$row);
        next unless $dataline[87] =~ /$dataline[88]/i;
        print $dataline[0] . "\t" . $dataline[5] . "\t" . $dataline[87] . "\n";
}

If you have any issues let me know and I will change as needed.