Perl: Need help comparing huge files

mrn6430 · July 12, 2012, 12:55pm

What do i need to do have the below perl program load 205 million record files into the hash. It currently works on smaller files, but not working on huge files. Any idea what i need to do to modify to make it work with huge files:

#!/usr/bin/perl
$ot1=$ARGV[2];
$ot2=$ARGV[3];
open(mfileot1, ">$ot1");
open(mfileot2, ">$ot2");
use strict;
#----------------
# Hash Definition
#----------------
my %HashArray;
my @file1Line;
my @file2Line;
#--------------------
# Subroutine
#--------------------
sub comp_file{
  my ($FILE1, $FILE2) = @_;
  open (R, $FILE1) or die ("Can't open file $FILE1");
  foreach my $FP1(<R>){
    chomp($FP1);
    my ($k, $l) = split(/\s+/,$FP1);
    push @{$HashArray{'$FP1'}{$k}},$l;
  }
  close (R);
  open (P, $FILE2) or die ("Can't open file $FILE2");
  foreach my $FP2(<P>){
    chomp($FP2);
    my ($k, $l) = split(/\s+/,$FP2);
    push @{$HashArray{'$FP2'}{$k}},$l;
  }
  close (P);
  foreach my $key(keys %{$HashArray{'$FP1'}}){
    if (!exists $HashArray{'$FP2'}{$key}){
      foreach my $last(@{$HashArray{'$FP1'}{$key}}){
        push (@file1Line,"$key$last");
      }
    }
  }
  print mfileot1 "$_\n" for (sort @file1Line);
  close(mfileot1);
  foreach my $key(keys %{$HashArray{'$FP2'}}){
    if (!exists $HashArray{'$FP1'}{$key}){
      foreach my $last(@{$HashArray{'$FP2'}{$key}}){
        push (@file2Line,"$key$last");
      }
    }
  }
  print mfileot2 "$_\n" for (sort @file2Line);
  close(mfileot2);
}
############MAIN MENU####################################
# Pre-check Condition
# if the input doesn't contain two(2) files, return help
# USAGE: hash2files.pl FILE1 FILE2 FILE3 FILE4
#########################################################

if ($#ARGV != 3){
  print "USAGE: $0 <FILE1> <FILE2> <FILE3> <FILE4>\n";
  exit;
}
else {
  my ($FILE1, $FILE2, $OT1, $OT2)= @ARGV;
  &comp_file($FILE1, $FILE2);
}

Corona688 · July 12, 2012, 2:55pm

What exactly does your program do? Show a sample of input and output.

mrn6430 · July 12, 2012, 4:12pm

Basically to run it: hash2files.pl inputfile1 inputfile2 outputfile1 outputfile2

Inputfile1 contains nuneric id's:

To be compared against Inpufile2 which also has id's:

The outputfile1 will contain all the id's in inputfile1 that are not found in inputfile2
In this case the result would be;

1233
4444
7777

Outputfile2 will have all the id's in inputfile2 not found in inputfile1. In this case:

9898
9999

It works really well with average size file. But it it can not handle loading 2 huge files (inputfile1 and 2) into the hash memory and it stops after a while w/o any error msgs oither than it does it produce the results. It terminates basically.

How can I make this work for huge files. The inputfile1 is about 204 million records and almost the same amount of records in inputfile2? I kniow it needs to be modified to somehow load one of them such as inputfile2 into the hash memory and not both, and do a compare on the id by reading one line from inputfile1 and if found in the has just delete it from the hash one at a time since we do not care about the matched one's at this point. What should remain in the hash is all not found id's and write them to a file. But i do not knoq how to do that !!

I hope helps explaining my issue.

birei · July 12, 2012, 4:18pm

Hi mrn6430,

Value 1233 isn't found in inputfile2, and similar issue for 1244. Did you forget it or did I miss anything?

mrn6430 · July 13, 2012, 10:09am

Yes. I updated my reply to include it. Besides the point, need a way to deal with such huge files. That is the mean issue. Thanks

birei · July 13, 2012, 12:43pm

Try:

$ cat inputfile1
1233
2345
3456
4444
7777
$ cat inputfile2
1244
2345
3456
9898
9999
$ cat script.pl
use warnings;
use strict;

my (%hash);

die qq|Usage: $0 <inputfile-1> <inputfile-2> <outputfile-1> <outputfile-2>\n| 
        unless @ARGV == 4;

open my $ifh1, q|<|, shift or die;
open my $ifh2, q|<|, shift or die;
open my $ofh1, q|>|, shift or die;
open my $ofh2, q|>|, shift or die;

while ( <$ifh1> ) {
        chomp;
        $hash{ $_ } = 1;
}

while ( <$ifh2> ) {
        chomp;
        if ( exists $hash{ $_ } ) {
                delete $hash{ $_ };
                next;
        }

        printf $ofh2 qq|%d\n|, $_;
}

for ( sort { $a <=> $b } keys %hash ) {
        printf $ofh1 qq|%d\n|, $_;
}
$ perl script.pl inputfile1 inputfile2 outputfile1 outputfile2
$ cat outputfile1
1233
4444
7777
$ cat outputfile2
1244
9898
9999

mrn6430 · July 13, 2012, 5:19pm

Thank you so much. I will test it. Do you know if there is any limitation of how many records max to load into hash using perl? I have a 205million records to load.

Thanks

Corona688 · July 13, 2012, 5:26pm

Perl's only limitation is the amount of memory in your system.

If these 205 million records are 4 or 5 bytes each like you've shown, that might amount to close to a gig of memory. If they're much larger, they probably won't fit into memory on a 32-bit system and other approaches would need to be tried, such as sorting them, so you can tell when a record's absent without having to load every possible record into memory at once...

If your records aren't as you've shown, then nothing we've written for you is likely to work at all anyway. We need to see what you're really dealing with.

birei · July 13, 2012, 5:27pm

I don't know if your memory will be enougth. Try. Otherwise you will need another approach.

mrn6430 · July 13, 2012, 5:56pm

It is about 20 bytes in each.

Corona688 · July 13, 2012, 6:37pm

That's about 3.8 gigs of memory. Not going to fit in a 32-bit process.

Corona688 · July 13, 2012, 6:43pm

If you sort your data, however, you can use the comm utility, which does not need to completely load either file into memory. Since the lines are in sorted order, it can tell when a line and when a line is skipped by whether the next line is greater or less or equal...

sort should be smart enough to process in blocks and not run out of memory. Be sure you have enough /tmp/ space, or redirect it to use another folder for temporary files where you have the room. See man sort for details.

$ sort data1 > data1-s
$ sort data2 > data2-s
$ comm -2 -3 data1-s data2-s > only-data1
$ comm -1 -3 data1-s data2-s > only-data2
$ cat only-data1

1233
4444
7777

$ cat only-data2

1244
9898
9999

$

Note that it might be possible to run comm once to get both sets of data, if only I knew what your data looks like -- which I still don't, after asking several times...

mrn6430 · July 25, 2012, 10:32am

Thank you. Your verison worked.

birei:

Try:

$ cat inputfile1
1233
2345
3456
4444
7777
$ cat inputfile2
1244
2345
3456
9898
9999
$ cat script.pl
use warnings;
use strict;
 
my (%hash);
 
die qq|Usage: $0 <inputfile-1> <inputfile-2> <outputfile-1> <outputfile-2>\n| 
   unless @ARGV == 4;
 
open my $ifh1, q|<|, shift or die;
open my $ifh2, q|<|, shift or die;
open my $ofh1, q|>|, shift or die;
open my $ofh2, q|>|, shift or die;
 
while ( <$ifh1> ) {
   chomp;
   $hash{ $_ } = 1;
}
 
while ( <$ifh2> ) {
   chomp;
   if ( exists $hash{ $_ } ) {
   delete $hash{ $_ };
   next;
   }
 
   printf $ofh2 qq|%d\n|, $_;
}
 
for ( sort { $a <=> $b } keys %hash ) {
   printf $ofh1 qq|%d\n|, $_;
}
$ perl script.pl inputfile1 inputfile2 outputfile1 outputfile2
$ cat outputfile1
1233
4444
7777
$ cat outputfile2
1244
9898
9999