Determining number of overlaps between two files using Hashes?

Hi there,

I have a doubt about how to set this up. This is the situation.

I have two files, one that is ~31,000 in length and has the following information (7 fields):
file1

1    +       100208127       100261594       6       100208127,100231680,100237404,100245177,100249508,100260529,    100208306,100231885,100237559,100245300,100249677,100261594,
1    +       100217082       100217185       1       100217082,      100217185,
1    +       100276376       100321515       12      100276376,100288052,100296809,100298021,100299978,100306120,100306616,100307757,100315308,100316594,100318639,100320146,        100276460,100288148,100296872,100298149,100300093,100306339,100306730,100307829,100315421,100316692,100318803,100321515,

the 5th field is important and it explains the number of segments represented in fields 6 and 7. So for example, the first line shows 6, so if you took the first number of field 6 this would represent the start of the first segment and the first number of field 7 would represent the end of the first segment, and so on till you have the total 6 segments. The second line for example shows only 1 in field 5 and hence there's only one segment starting at 100217082 and ending at 100217185.

the second file I have is variable in length and can be from 3,000,000 to 10,000,000 lines. The format contains 4 fields:
file2

1    100208130       100208166       +
1    100208310       100208346       +
1    100217090       100217126       +
1    100231689       100231725       +

As you can see, field 2 and 3 is just a difference of 36 numbers and I want to know how many times each line in file2 is contained within file1 specifically when looking at the segments (remember each line in file1 has different numbers of segments above, e.g. 6, 1, and 12 as represented in field 5).

So if I use these two files to generate my output, my output would tell me:

There are 3 lines from file2 that matches or overlaps segments in file1 and 1 line from file2 that DOESNOT match or overlap segments in file1.

YES 1    100208130       100208166       +
NO 1    100208310       100208346       +
YES 1    100217090       100217126       +
YES 1    100231689       100231725       +

To get this kind of computation, do you think it's important to use hashes for the first file or second file and if so, how would I set this up? Can someone assist here? Thanks!

does anyone know if this is doable? I'm really at a lost here. this is what i have so far:

#!/usr/bin/perl -w
use strict;


my %hash;


my $file = 'file1';
open my $fh, "<", $file or die "Can't open $file: $!";
while (<$fh>) {
	chomp;
	my @field = split /\t/;
	my @start = split /,/, $field[5];
	my @end = split /,/, $field[6];
	
}

once I have these starts/ends stored, how would you compare file2 with it?

Are there any other comparisons or output expectations you have not mentioned - because you maybe thought they did not matter?

Hashing is meant for a lookup for a match, not necessarily finding something numerically in a range or something that falls in between two values.

The first thing you must do is to undo the complexity of the line structures and then sort them if you want the process to complete in reasonable time.

I don't want to try anything until I'm sure we won't get into a feedback loop: 'Now I need this...'

I once wrote a script that would do a comparison of two files for reporting on changes to a file (like diff, but with output for phb's)
It would open each file into an array.
As it stepped through the first array, it would do any specific line analysis required, then it would go through the second array line by line - looking for matches (or !matches).
It would write that data to an output array.

Next, it would step through the second array, it would do any specific line analysis required, then step through the first array line by line - again looking for matches (or !matches). It would write that data to a different output array.

These two arrays had the smaller subset of results that I required, and could be cross matched via a third function.

It's not the nicest way to do this sort of thing, but it did allow me to resolve my requirement in the shortest amount of time, without killing the system.

In this case, you have an array that is composed of 5 or 12 or 1 number pairs.

while (@file1Array)  #open the file
{
   chomp;
   @lineFile1Array = split (/\t/,@_);   #split the line into temporary array elements
   @tempStart = split (/,/,$lineArray[5]);
   @tempEnd = split (/,/,$lineArray[6]);
   while $count >= $lineArray[4]
   #create a new array composed of $lineArray[5]:$lineArray[6]
   # I didn't put this code in, as the syntax escapes me this early in the morning...

   while (@newLineArray)
   {
      $start,$end = split (/:/, $_)
      while $line(@file2Array)
      {
         @lineFile2Array = split (/\t/,$line);
          if (($lineFile2Array[1] >= $start) && $lineFile2Array[2] <= $end)
         {
            #Match found = write to yourfile
         }
      # If no match found (or when done evaluating that element), move on to the next element of the line
      }
   # If no match found, (or when done evaluating that line) move on to the next line in file2
   } 
# If no match found, move on to the next line in file2
}

This is a rough idea - not a complete code snippet (obviously!)

I didn't include code for opening either file - open both files before entering the loop

Thank you for the post. I can't think of anything else but I know that knowing if they do fall in the range or not would be informative for my analysis as I indicated in the expected output.

thank you for your post. I think I understand the logic here but I'm not sure where you get

while $count >= $lineArray[4]

You'll need to create a counter $count
While $lineArray[4] (your number of elements), is less than $count, you'll loop through creating the new array of (element0start:element0end element1start:element1end etc.)

Then increment the counter so that you don't continue trying to create new elements in the new array after you've mapped the sets

For 10 million records your output will be 31000 * 10 million lines of YES NO or 310 million lines. Is this what you really want? And does your filesystem support filesizes over 4GB (so-called large files)?

That's a good point Jim... it would clobber a small server, and take a long time to run.
My requirements were for about 1 million yes/no - a much smaller footprint.

It is one way to do the task, I'd be happy to learn a more elegant (and less memory intensive) method, too! :slight_smile:

Hi Jim,

Yes unfortunately this is what I need. I'm analyzing genetic data and working with large number of data is quite common. I'm not sure what you mean by filesystem, but the hardware I have is pretty sufficient (RAM is 8GB). If worst case, I can break them up into 24 sections (for each human chromosome - thus making them each anywhere from 20,000 to 1,000,000).

Would you be breaking BOTH files into 24 files each?
If so, this would reduce the overhead by a large amount, as each section only needs to be compared against 1/24th of the entire set, rather than each section being compared against the entire set.

yes I can. currently each file contains all the chrom, but I can extract out each chrom from each file into a separate file and use it when running the script.

When you run your script, preface it with time so you can see how long it took to complete

time scriptname.pl variable1 variable2

This *should* show you how long it's taken to run the comparison.

I'd test it first with the smallest pair, then again with the largest pair. Prepare to go home at the end of the day before starting the largest pair to run...

If the smallest pair takes FAR too long to run, then I wouldn't recommend running the largest pair! :wink:

thanks for the tip. i'm still not sure how to organize the script to test it :(.

Well, here's the start - I don't have time to work on it at the moment, but it should certainly get you started...

#!/usr/bin/perl -w

################################################################################
################################################################################
# What this script does:                                                       #
# This script can be used to compare data overlap                              #
#                                                                              #
# How this script works:                                                       #
# It parses two external files and identifies which line entries fall within   #
# the match statement.                                                         #
#                                                                              #
# Where this script is run (and by whom):                                      #
# This script should be run from a host where the files can be accessed for    #
# comparison                                                                   #
#                                                                              #
# Revision history:                                                            #
# September 15, 2008                                                           #
#    AKG - File creation                                                       #
################################################################################
################################################################################


################################################################################
################################################################################
# Define Pragma                                                                #
################################################################################
################################################################################
use strict;
use Getopt::Std;
use vars qw/ %opt /;
use Time::HiRes qw(gettimeofday);

################################################################################
# Define Variables                                                             #
################################################################################
#my $dir        = "/usr/local/overlap"; # base directory
my $dir        = "/opt/home/agray/scripting/overlap"; # base directory
my $DEBUG;
my @tm;
my $logfile;
my $timeStamp;
my @fileA;
my @fileB;

################################################################################
# Define Prerequisites  (Require / Include statements go here)                 #
################################################################################

################################################################################
# Forward declaration of subroutines                                           #
################################################################################
sub do_init();              # This manages the command line options
sub do_usage();             # This is the usage message for do_init()

################################################################################
################################################################################
# MAIN                                                                         #
################################################################################
################################################################################

# Parse command line variables                                                 #
do_init();

# set DEBUG flag according to command line options                             #
if ( $opt{d} )
{
   $DEBUG = 1;
}
else
{
   $DEBUG = 0;
}

# Next, create a timestamp and logfile to store information                    #
# getTimestamp
@tm = (localtime($^T))[0..5];
++$tm[4];
$tm[5] += 1900;
$timeStamp = sprintf("%04d%02d%02d.%02d%02d%02d", reverse @tm);
$logfile = "$dir/log/$0.$timeStamp.log";

open LOG, ">> $logfile" or die "Can't open $logfile for write: $!";
print LOG "$timeStamp: running $0\n";


################################################################################
# Open files into array                                                        #
################################################################################

open(FILEA, "<$opt{a}") or die "Cannot open $opt{a} for read :$!";
@fileA = <FILEA>;
close( FILEA );

open(FILEB, "<$opt{b}") or die "Cannot open $opt{b} for read :$!";
@fileB = <FILEB>;
close( FILEB );

my @lineFileArray;
my @tempStart;
my @temEnd;
my $count;
my @neLineArray;
my $start;
my $end;

while (@fileA)  #open the file
{
   chomp;
   @lineFile1Array = split (/\t/,@_);   #split the line into temporary array elements
   @tempStart = split (/,/,$lineArray[5]);
   @tempEnd = split (/,/,$lineArray[6]);
   while ($count >= $lineArray[4])
   {
      #create a new array composed of $lineArray[5]:$lineArray[6]
      # I didn't put this code in, as the syntax escapes me this early in the morning...
   }
   while (@newLineArray)
   {
      $start,$end = split (/:/, $_)
      while $line(@fileB)
      {
         @lineFile2Array = split (/\t/,$line);
          if (($lineFile2Array[1] >= $start) && $lineFile2Array[2] <= $end)
         {
            #Match found = write to yourfile
         }
      # If no match found (or when done evaluating that element), move on to the next element of the line
      }
   # If no match found, (or when done evaluating that line) move on to the next line in file2
   }
# If no match found, move on to the next line in file2
}

################################################################################
################################################################################
# Subroutines                                                                  #
################################################################################
################################################################################
sub do_init()
{
   my $opt_string = 'hda:b:';
   getopts( "$opt_string", \%opt ) or do_usage();
   do_usage() if $opt{h};
   do_usage() unless ($opt{a} && $opt{b});
}


sub do_usage()
{
   print "\nusage: $0 [-h] [-d] [-a file1] [-b file2]\n\n";
   #############################################################################
   print "\n\n";
   exit;
}

IT's called a cartesian product. All rows in file1 * all rows in file2.
My suggestion would be to use a database.

Failing that - can you not just record positives (YES in your example). Based on your information I would guess that at a minimum - 23/24 of the time or 98.53% of the tests will result in NO. Why record a NO count when you can infallibly infer it?

yes that makes sense, if I know how many lines I start with and then infer the yes or nos, then I can simply do a subtraction and get the answers for both.

Thank you for this and i've been working on it since you posted it. however, i get compilation errors which i tried to correct but can't seem to get pass this:

Whoops - that should be $_ not @_ (hadn't had my coffee...)

For the record, when a script is fairly simply using $_ is fine:

while (@array)
{
   something $_;
}

However, when it get's complicated, I'd recommend assigning a variable that makes sense to you such that you can follow it through the loop:

while $element(@array)
{
   something $element;
}