So here is my sample,
I repeat its only a sample code and needs to be tweaked a lot before actually using it but it doesn't mean that it won't work, it will work and as we are not aiming only at a just working code, always a better one
I have designed as a master-slave code.
Master code
Master will split the big file into chunks and the slave will process that. Finally master will delete the part files, other intermediate files and merge the final output.
Here is the master code
#! /opt/third-party/bin/perl
use strict;
#Either number of instances or number_of_line can be used for configuration
#For example am using number_of_lines as configuration
use constant NUM_OF_LINES => 1000000;
use constant SLAVE_NAME => 'slave.pl';
use constant END_MARKER => '_END_PROCESSED_';
use constant SLAVE_FILE_PART_NAME => 'part';
use constant FINAL_OUTPUT_FILE => 'final.output';
my $line_counter = 0;
my $split_file_counter = 0;
my %splitFileHash;
my $file_name = $split_file_counter;
my $file_handle = undef;
my $command = "./" . +SLAVE_NAME;
die "[MASTER] Please provide filename as input\n" if ( ! defined $ARGV[0] );
sub mergeOutput {
open(FOFILE, ">", +FINAL_OUTPUT_FILE)
or die "[MASTER] Unable to open final output file : +FINAL_OUTPUT_FILE <$!>\n";
foreach my $file ( keys %splitFileHash ) {
my $modified_file = ($file . "." . +SLAVE_FILE_PART_NAME);
open(PFILE, "<", $modified_file) or die "[MASTER] Unable to open part file : $modified_file <$!>\n";
while(chomp ( my $data = <PFILE>) ) {
next if ( $data eq +END_MARKER );
print FOFILE "$data\n";
}
close(PFILE);
unlink($modified_file) or die "[MASTER] Unable to delete part file : $modified_file <$!>\n";
unlink($file) or die "[MASTER] Unable to delete split file : $file <$!>\n";
}
close(FOFILE);
}
sub checkFileHashStatus {
foreach my $file ( keys %splitFileHash ) {
return 0 if ( $splitFileHash{$file} eq "N" );
}
return 1; #This means all the files have been processed
}
sub checkForJobsCompletion {
foreach my $file ( keys %splitFileHash ) {
next if ( $splitFileHash{$file} eq "Y" );
my $data = undef;
my $modified_file = ($file . "." . +SLAVE_FILE_PART_NAME);
open(LFILE, "<", $modified_file)
or warn "[MASTER] Unable to open file : $modified_file for checking <$!>\n";
while(chomp($data = <LFILE>)) {
if( $data eq +END_MARKER ) {
#File processing is completed, mark it
$splitFileHash{$file} = "Y";
print "[MASTER] File:$file processing completed\n";
last;
}
}
close(FILE);
}
}
sub closeLastFile {
close($file_handle);
my $local_command = $command . " " . $split_file_counter . " " . $split_file_counter . " &";
print "[MASTER] Spawning instance $split_file_counter : $local_command\n";
system("$local_command");
}
sub getNewFile {
close($file_handle) if defined ( $file_handle );
if ( $split_file_counter != 0 ) {
my $local_command = $command . " " . $split_file_counter . " " . $split_file_counter . " &";
print "[MASTER] Spawning instance $split_file_counter : $local_command\n";
system("$local_command");
}
$split_file_counter++;
my $file_name = $split_file_counter;
$splitFileHash{$file_name} = "N";
open($file_handle, ">", $file_name) or die "[MASTER] Unable to open file for writing : <$!>\n";
}
open(FILE, "<", $ARGV[0]) or die "[MASTER] Unable to open file : $ARGV[0]\n";
while(<FILE>) {
getNewFile if( ( ! defined $file_handle && $line_counter == 0 ) || $line_counter % +NUM_OF_LINES == 0 );
print $file_handle "$_";
$line_counter++;
}
close(FILE);
closeLastFile;
my $iteration_counter = 1;
while ( 1 ) {
print "[MASTER] FileCheck Iteration Counter:$iteration_counter\n";
checkForJobsCompletion;
last if ( checkFileHashStatus == 1 );
$iteration_counter++;
}
print "[MASTER] Merging output\n";
mergeOutput;
exit (0);
Slave code
For demonstration purpose, I have used a simple logic to split data of the form
abcd;efgh
and form an output like
abcd-efgh-efgh-abcd
Only the logic needs to be changed in the slave code and the master code is generic. It will work for all the cases and can be used for computations involving huge data where sequence is not important
Here is the slave code
#! /opt/third-party/bin/perl
use strict;
my $outputfilename = $ARGV[1] . ".part";
open(OFILE, ">", $outputfilename) or die "[SLAVE-$ARGV[1]] Unable to open file : $ARGV[1]\n";
open(FILE, "<", $ARGV[0]) or die "[SLAVE-$ARGV[1]] Unable to open file : $ARGV[0]\n";
while(<FILE>) {
chomp;
my($first, $second) = split(';');
print OFILE "$first-$second#$second-$first\n";
}
close(FILE);
print OFILE "_END_PROCESSED_\n";
close(OFILE);
exit (0);