Making things run faster

Legend986 · October 17, 2008, 2:30am

I am processing some terabytes of information on a computer having 8 processors (each with 4 cores) with a 16GB RAM and 5TB hard drive implemented as a RAID. The processing doesn't seem to be blazingly fast perhaps because of the IO limitation.

I am basically running a perl script to read some data and then either modify it a little or grep something out of it and writing it back to disk. Could someone please tell me if there is a superior method I could use to improve performance?

zaxxon · October 17, 2008, 4:11am

There is no "warp 9" button At least I haven't seen one yet.

That depends on your code efficiency and your settings for the OS. Different OS'es have different tuning options. No offense, but I guess it's primarily the code you use since the hardware sounds somewhat powerful. There is a lot of people here in the forum that are good at Perl - maybe you post your code if it is not tons of pages of code here with the fancy [ code ] and [ /code ] tags so they can give a small hint.

Also you can give a snippet of the input file and the desired output. Maybe people can just give an alternative.

For your current setting, have the time written down (use the "time" command in front of the line when you start the script) to compare it after tuning/using alternatives.

Legend986 · October 17, 2008, 4:16am

Sure. Thank You so much. I am open to any advice as I am more interested in learning Please let me know if there is some obvious mistake I am doing.

My perl code is:

open (FILE, $ARGV[0]);
my %hTmp;

while (my $fileLine = <FILE>) {

        if($fileLine =~ /PREFIX/) {
                if(!($fileLine =~ /[:]{2}/)) {
                        $flag = 1;
                }
        }

        if($flag == 1) {
                if($fileLine =~ /ASPATH/) {
                        $fileLine =~ s/\n//;
                        @myarray = ($fileLine =~ m/([0-9]{3,5}\s)/g);

                        #Following removes prepending. Remove if you do not want it

                        undef %saw;
                        @out = grep(!$saw{$_}++, @myarray);

                        $temp = join("", @out);
                        $temp =~ s/^\s//;
                        $temp =~ s/^\s+//;
                        print $temp."\n" unless ($hTmp{$temp}++);
                        $flag = 0;
                }
        }

}

And a sample from the input file is:

TIME: 12/01/07 00:40:57
TYPE: TABLE_DUMP/INET
VIEW: 0
SEQUENCE: 1
PREFIX: 0.0.0.0/0
FROM:213.140.32.148 AS12956
ORIGINATED: 11/28/07 09:12:40
ORIGIN: IGP
ASPATH: 12956
NEXT_HOP: 213.140.32.148
STATUS: 0x1

TIME: 12/01/07 00:40:57
TYPE: TABLE_DUMP/INET
VIEW: 0
SEQUENCE: 2
PREFIX: 3.0.0.0/8
FROM:208.51.134.246 AS3549
ORIGINATED: 11/30/07 17:06:53
ORIGIN: IGP
ASPATH: 3549 701 703 80
NEXT_HOP: 208.51.134.246
MULTI_EXIT_DISC: 12653
COMMUNITY: 3549:2355 3549:30840
STATUS: 0x1

TIME: 12/01/07 00:40:57
TYPE: TABLE_DUMP/INET
VIEW: 0
SEQUENCE: 3
PREFIX: 3.0.0.0/8
FROM:209.161.175.4 AS14608
ORIGINATED: 11/30/07 13:43:49
ORIGIN: IGP
ASPATH: 14608 19029 3356 701 703 80
NEXT_HOP: 209.161.175.4
COMMUNITY: no-export
STATUS: 0x1

I want the ASPATHS corresponding to the IPv4 addresses in the input data. Please let me know of any obvious improvements if possible.

matrixmadhan · October 17, 2008, 5:00am

 if($fileLine =~ /PREFIX/) {
                if($fileLine =~ /ASPATH/) {

This is something that I noticed when going through the code

 if($fileLine =~ /^PREFIX/) {
                if($fileLine =~ /^ASPATH/) {

Help the regex to help us

From the input file, both the literals PREFIX and ASPATH are at the start of the line ( at least in the examples provided ), so hint the perl regex by saying its at the start always.

Though its trivial, this will definitely improve the performance.

If it can appear anywhere in the line, please ignore the tip.

matrixmadhan · October 17, 2008, 5:07am

And am interested to see by what percentage computational time came down, if that happens ?

>>>>>>>>>>>>>>>>

I have got one more suggestion, since there is more processing power best thing is to exploit them.

What could be done is ? - A master script whose only job is to read through the file and splits into chunks and assign it to the script you have written

With this multiple process would be doing the job instead of waiting for 1 job to complete the task.

I assume that there are no dependencies that file has to be processed only in sequential order and the only
aim is to process the file quickly.

If possible, I will post a sample code tonight

Legend986 · October 17, 2008, 2:05pm

Hmm... that's an interesting idea! I'd love to try that out... Actually by the time I woke up, it was able to process some 1 TB (so that makes it 7 hours). I don't know how to formally write down the whole thing but I'll try:

Total Data Size: 2.2 TB (currently handling around 1TB though)
Special Info: The formats of the data were slightly different. There were a total of four data sets (lets call them DS):
DS1 & DS2: Format 1

Size of DS1: 556G RAW
Size of DS2: 105G Gzip Compressed

DS3 & DS4: Format 2

Size of DS3: 157G RAW
Size of DS4: 109G RAW

Further, there were two other tasks (extracting a 36G archive and copying some 10G worth data) running handled by a different processor (another computer in fact) on the disk of this main computer.

Tasks running Simultaneously:

Parsing DS1, DS2, DS3, DS4 and writing the result onto disk again
Extracting an archive on the same disk using a different computer on which the disk is mounted as a remote drive
Copying the gzipped files back into the main computer(if anyone has seen my other threads, yes, in fact these were the huge archives I was talking about converting into individual smaller archives )

As of now, I have finished parsing DS1, DS2, DS3 and DS4 but I am left with extracting the huge archives and then parsing the last DS5 which will be around 1.7TB uncompressed. I will perhaps run the optimization then.

Thanks for the advice and looking forward for a post from you.

Legend986 · October 17, 2008, 2:20pm

Added to that, I have a small question (not so sure if its silly though but can't seem to understand it completely)...

If I have four datasets like in the problem above and all I have to do is grep some text out of it, does it really make a difference doing the jobs parallely on all the datasets or doing them in a sequential order? In fact, to be more precise, the argument goes something like this:

Four datasets are stored on the disk. The CPU has to fetch some data everytime for the four processes to process them and write back to the disk. Now, if it has to provide data to all the four processes, then shouldn't the head keep moving around to provide the data as opposed to just one process where it just keeps reading the data (provided there is no fragmentation). As I said, I'm sorry if my question seems silly but just want to clear some basic concepts.

matrixmadhan · October 17, 2008, 4:55pm

So here is my sample,

I repeat its only a sample code and needs to be tweaked a lot before actually using it but it doesn't mean that it won't work, it will work and as we are not aiming only at a just working code, always a better one

I have designed as a master-slave code.

Master code
Master will split the big file into chunks and the slave will process that. Finally master will delete the part files, other intermediate files and merge the final output.

Here is the master code

#! /opt/third-party/bin/perl

use strict;

#Either number of instances or number_of_line can be used for configuration
#For example am using number_of_lines as configuration

use constant NUM_OF_LINES => 1000000;
use constant SLAVE_NAME => 'slave.pl';
use constant END_MARKER => '_END_PROCESSED_';
use constant SLAVE_FILE_PART_NAME => 'part';
use constant FINAL_OUTPUT_FILE => 'final.output';

my $line_counter = 0;
my $split_file_counter = 0;
my %splitFileHash;
my $file_name = $split_file_counter;
my $file_handle = undef;
my $command = "./" . +SLAVE_NAME;

die "[MASTER] Please provide filename as input\n" if ( ! defined $ARGV[0] );

sub mergeOutput {

  open(FOFILE, ">", +FINAL_OUTPUT_FILE) 
  or die "[MASTER] Unable to open final output file : +FINAL_OUTPUT_FILE <$!>\n";

  foreach my $file ( keys %splitFileHash ) {

    my $modified_file = ($file . "." . +SLAVE_FILE_PART_NAME);
    open(PFILE, "<", $modified_file) or die "[MASTER] Unable to open part file : $modified_file <$!>\n";
    while(chomp ( my $data = <PFILE>) ) {
      next if ( $data eq +END_MARKER );
      print FOFILE "$data\n";
    }
    close(PFILE);

    unlink($modified_file) or die "[MASTER] Unable to delete part file : $modified_file <$!>\n";
    unlink($file) or die "[MASTER] Unable to delete split file : $file <$!>\n";
  }

  close(FOFILE);
}

sub checkFileHashStatus {

  foreach my $file ( keys %splitFileHash ) {
    return 0 if ( $splitFileHash{$file} eq "N" );
  }

  return 1; #This means all the files have been processed
}

sub checkForJobsCompletion {

  foreach my $file ( keys %splitFileHash ) {

    next if ( $splitFileHash{$file} eq "Y" );
    my $data = undef;
    my $modified_file = ($file . "." . +SLAVE_FILE_PART_NAME);

    open(LFILE, "<", $modified_file) 
    or warn "[MASTER] Unable to open file : $modified_file for checking <$!>\n";

      while(chomp($data = <LFILE>)) {

        if( $data eq +END_MARKER ) {

          #File processing is completed, mark it
          $splitFileHash{$file} = "Y";
          print "[MASTER] File:$file processing completed\n";
          last;
        }
      }

    close(FILE);
  }
}

sub closeLastFile {

  close($file_handle);
  my $local_command = $command . " " . $split_file_counter . " " . $split_file_counter . " &";
  print "[MASTER] Spawning instance $split_file_counter : $local_command\n";
  system("$local_command");
}

sub getNewFile {

  close($file_handle) if defined ( $file_handle );

  if ( $split_file_counter != 0 ) {
    my $local_command = $command . " " . $split_file_counter . " " . $split_file_counter . " &";
    print "[MASTER] Spawning instance $split_file_counter : $local_command\n";
    system("$local_command");
  }

  $split_file_counter++;
  my $file_name = $split_file_counter;
  $splitFileHash{$file_name} = "N";
  open($file_handle, ">", $file_name) or die "[MASTER] Unable to open file for writing : <$!>\n";

}

open(FILE, "<", $ARGV[0]) or die "[MASTER] Unable to open file : $ARGV[0]\n";

while(<FILE>) {

  getNewFile if( ( ! defined $file_handle && $line_counter == 0 ) || $line_counter % +NUM_OF_LINES == 0 );
  print $file_handle "$_";
  $line_counter++;

}

close(FILE);

closeLastFile;

my $iteration_counter = 1;
while ( 1 ) {
  print "[MASTER] FileCheck Iteration Counter:$iteration_counter\n";
  checkForJobsCompletion;
  last if ( checkFileHashStatus == 1 );
  $iteration_counter++;
}

print "[MASTER] Merging output\n";
mergeOutput;

exit (0);

Slave code

For demonstration purpose, I have used a simple logic to split data of the form
abcd;efgh

and form an output like
abcd-efgh-efgh-abcd

Only the logic needs to be changed in the slave code and the master code is generic. It will work for all the cases and can be used for computations involving huge data where sequence is not important

Here is the slave code

#! /opt/third-party/bin/perl

use strict;

my $outputfilename = $ARGV[1] . ".part";

open(OFILE, ">", $outputfilename) or die "[SLAVE-$ARGV[1]] Unable to open file : $ARGV[1]\n";

open(FILE, "<", $ARGV[0]) or die "[SLAVE-$ARGV[1]] Unable to open file : $ARGV[0]\n";

while(<FILE>) {
  chomp;
  my($first, $second) = split(';');
  print OFILE "$first-$second#$second-$first\n";
}

close(FILE);

print OFILE "_END_PROCESSED_\n";

close(OFILE);

exit (0);

matrixmadhan · October 17, 2008, 4:58pm

Currently am running some tests to verify that this approach reduces overall computation time with multiple process'.

Will post the results, once they are done

shamrock · October 17, 2008, 5:33pm

legend986:

Added to that, I have a small question (not so sure if its silly though but can't seem to understand it completely)...

If I have four datasets like in the problem above and all I have to do is grep some text out of it, does it really make a difference doing the jobs parallely on all the datasets or doing them in a sequential order? In fact, to be more precise, the argument goes something like this:

Four datasets are stored on the disk. The CPU has to fetch some data everytime for the four processes to process them and write back to the disk. Now, if it has to provide data to all the four processes, then shouldn't the head keep moving around to provide the data as opposed to just one process where it just keeps reading the data (provided there is no fragmentation). As I said, I'm sorry if my question seems silly but just want to clear some basic concepts.

And that is the reason for caching data and striping it over multiple disks in order to reduce disk arm contention. This way reads/writes are done in parallel and with caching in play most reads/write are logical instead of physical. As you have terabytes of data I am assuming that all of it isn't on a single drive like a JBOD of some sort and that it is on a high end storage array with significant intelligence and caching built into it while being striped for performace and mirrored for availability.

bakunin · October 17, 2008, 8:04pm

It might be that PERL is not the right tool for you. My experience with bigger datasets (not nearly as big as yours) is that sed and awk are much faster than PERL with sed having a slight edge over awk performancewise. So you might try to implement your program as sed script and compare runtimes (maybe on a smaller sample).

I hope this helps.

bakunin

Legend986 · October 18, 2008, 5:40pm

@matrixmadhan: Thanks a lot... I have used an almost similar approach from your script but slightly adapted for my own datasets. I will try timing both the approaches and will paste the result here.

And one more thing: I have found this really cool package called xjobs. Would you mind taking a look at it? It basically handles the master part from your logic and is very useful. Thought you might find some use out of it too. You can access it here: xjobs

@shamrock: Again, thank you for clarifying the issue. I just didn't know if it was really RAID not JBOD because the CPU is spending 88% of its time waiting (taken from the mpstat command) which seemed really weird to me.

@bakunin: Thank You for the advice. I actually agree with you as that was my experience too. I switched to PERL after a really bad experience with awk. Blame it on my lack of expertise in them. Other than that, I am still using awk and sed whenever things can be done easily.

matrixmadhan · October 20, 2008, 4:23am

Hello Legend,

Thanks for the xjobs link. Am going through that but its not yet done. I revised my perl code and frankly I had to slap myself for there are so many points that I missed and just thinking that the design could have been much more better.

Anyway, I had cowardly escaped saying that was just a sample and not of production quality.

If I find time, may be I should start thinking about that for the next improved version.

Cheers

shamrock · October 20, 2008, 5:11pm

legend986:

@matrixmadhan: Thanks a lot... I have used an almost similar approach from your script but slightly adapted for my own datasets. I will try timing both the approaches and will paste the result here.

And one more thing: I have found this really cool package called xjobs. Would you mind taking a look at it? It basically handles the master part from your logic and is very useful. Thought you might find some use out of it too. You can access it here: xjobs

@shamrock: Again, thank you for clarifying the issue. I just didn't know if it was really RAID not JBOD because the CPU is spending 88% of its time waiting (taken from the mpstat command) which seemed really weird to me.

@bakunin: Thank You for the advice. I actually agree with you as that was my experience too. I switched to PERL after a really bad experience with awk. Blame it on my lack of expertise in them. Other than that, I am still using awk and sed whenever things can be done easily.

How many mpus are there in your machine?
The reason the mpu is spending so much time waiting is because the terrabytes of data being processed...I/O wait.