Switching over to C++

Hi,

We've been using a perl script to extract datas from several logs to generate a report. I've been asked to rewrite the code in C++. I want to know if it is wise to have a code in C++ and will it be more faster than Perl?

Most perl scripts are portable to a wide selection of operating systems. A C++ program would have to be rebuilt to move it to a different machine architecture and might need to be rebuilt to move it to a different operating system even if they both run on the same hardware.

If you are a competent C++ programmer, you should be able to write a C++ program that will run faster than an interpretive language like perl to perform the same job. (However, that doesn't mean that writing the C++ code to get an improved level of performance over a perl script will necessarily be easy.)

It may be a good idea to find out what the bottleneck in the perl program is, before you get carried away. Reimplementing a flawed algorithm in a faster language may give you little to no benefit.

Take a decision only after profiling your program, using either one of the standard distribution profilers or one from CPAN.

I think it's a fine idea to write the program to C++. Yes, a C++ program doing the same logic as a perl program will be faster. But I agree there may be a bottleneck in the perl program that could be fixed.

Normally, I do a short task using bash. But once it gets to a certain level of complexity, I move to C. It runs faster, but that's not the main point. It's so much easier to maintain a complex program in C than a complex program in a scripting language. The same logic applies to perl, which is kind of infamous for somehow leading to undecipherable code. On the other hand, there are large applications written in perl.

For the most part, I disagree with this assertion. If there is a graphic front end, it is often tricky writing multi-platform code. But for processing a log file, the program should compile on any platform, assuming the programmer is competent.

Why were you asked to rewrite in C++? Was any justification given? Or is it just a blind assumption that it will be better?

I am going to assume that the perl script is a complicated beast.

I would seriously advise against a rewrite, except as a last resort. Complex code that works well is usually the result of a lot of testing and debugging. I would not cast it aside lightly.

I concur with Corona688 and elixir_sinari. Profile your current script and see where it spends most of its time. Then, have a look at those subroutines and see if they can be optimized in perl. Even if that's insufficient, a complete rewrite is still not your only alternative. The critical perl sections can be re-implemented in C and loaded by the perl interpreter like any other module: perlxs.

One thing is certain: advice on whether to rewrite code from people who do not know what that code does isn't worth much. Perhaps not even the proverbial 2 cents.

Regards,
Alister

That's a good point. If the perl is a complex beast, any kind of "fix" could be problematic. If you can't figure out the logic, it will be difficult to rewrite in some other language. It's hard to say more without knowing what the problem is or seeing the code. My main point is that there are differences between languages, and that certain languages are better suited for more complex, large-scale systems, other languages more suited for medium tasks, others for tiny tasks. There can come a point where a program is poorly supported by the existing programming language, too slow, lacking functionality, or too hard to maintain.

The other point is that Perl is written in C/C++ too... If the Perl code spends most of its time using external modules, or just doing I/O, its performance may be quite comparable to a pure C++ program. If it spends most of its time doing complex data processing via perl statements, perhaps not so much.

So there's one easy benchmark for you. If this Perl program uses lots and lots of CPU power while running, it may benefit from optimization. If it doesn't, it's spending most of its time waiting for data to arrive or be sent... A C++ version wouldn't be able to wait much faster :wink:

You know what -There was a Perl script I had got from CPAN to generate SHA1 checksum of files in a directory, I modified to suite my needs, for the project I was working at, and the script was taking around 21-22 minutes just to generate SHA1 sum of the required files across the system disk (which was most of the files of the disk except a list of certain category of files filtered by it's extension) and later would verify at boot time.

I displayed the same to my boss, the moment I conveyed him that I started out modification of a public domain script from CPAN; he didn't allow me to use the same for a commercial project, although he wanted originally a script kind of thing only; I was made to write a C version of the same again -that too giving me just 48 hrs only, as I had already eaten up much of the scheduled time with my R & D and modifications job with the Perl script (even learning Perl too :slight_smile: ).

I somehow managed myself to come out of the soup by really creating a C program from scratch.

The pleasant part was that the job which took 22 minutes (approx) was taking only around 3 seconds.

This took away all the pain I faced in the last 48 hrs while I wrote the C version of the program.

No one expected this much performance improvement (even myself & my boss) but that was it!!!

I really do expect your C++ program, if you really create, will get you to see the same kind of performance improvements.

The porting issues and all with your C++ program would be immaterial if your project has got a build system which create images for different supported platforms.

Please do let us know here, if you chose to create the C/C++ version of the same? And how was the performance improvement?

---------- Post updated at 03:20 PM ---------- Previous update was at 03:16 PM ----------

I'd be eager to know your experience too!!!

Happy coding!! :slight_smile:

---------- Post updated at 03:31 PM ---------- Previous update was at 03:20 PM ----------

That's absolutely true!!

However a CPU intensive program implemented into C/C++ (even with exactly the same algorithm -however bad that might be) runs not just 2x or 3x time faster but much much more than that of a Perl script (of course the bottleneck is on the percentage of I/O wait, in the overall run time, that you have mentioned).

1 Like

Hey Thank you all for enlightening me!

I've not worked much on perl, but in every C++ project I've worked on I've seen the reporting scripts were done especially in PERL and thats why I was confused.
The task which needs to be done here is to read some log files, extract the data such as folder id, number of files it contains, processing time, no. of successfully processed files and no. of failed files etc and put them in an excel file.
The script is good and doesnt seem to need any fix but it takes some time to process millions of data. So the only concern here is to get this thing done in any other language for optimization.

I do not intend any personal offense, but I am compelled to say that it is absurd to tell someone that they will realize a 440x runtime improvement without any knowledge of their task, their code, or their hardware.

If the code isn't CPU-bound, the performance gain will be negligible.

Your example, by the way, of cryptographic hashing, is just the type of task that does benefit greatly from using a lower-level language. However, switching to C won't yield the massive improvements you describe because the folks who implement perl's widely-used crypto modules are already using C.

To demonstrate, on an ancient Pentium II system running at 350 MHz, let's hash all the executables in my /usr/bin directory.

Version of sha1sum (a C program):

$ sha1sum --version | head -n1
sha1sum (GNU coreutils) 5.97

Perl script:

$ cat sha1.pl
use Digest::SHA1;
binmode(STDIN);
my $s = Digest::SHA1->new;
while (read(STDIN, $buf, 1024)) {
    $s->add($buf);
}
print $s->hexdigest . "\n"

Creating data file:

$ for f in /usr/bin/*; do cat "$f"; done > usr.bin 2> /dev/null
$ stat --printf '%s\n' usr.bin  # ~ 181 MB
189537836

Comparison:

$ time sha1sum < usr.bin
bb02b0a0c08a1b4070f97035f0ad5f7f2c491c78  -

real    0m13.622s
user    0m10.217s
sys     0m1.868s
$ time perl sha1.pl < usr.bin
bb02b0a0c08a1b4070f97035f0ad5f7f2c491c78

real    0m17.131s
user    0m12.733s
sys     0m1.604s

The tests were repeated several times in different orders. The times never varied significantly. 1.25x faster is a far, far cry from 440x. The small improvement is expected since both are hashing in C.

It is reasonable to conclude that whatever perl script you were using (or the modifications made to it) was seriously deficient.

Regards,
Alister

---------- Post updated at 12:43 PM ---------- Previous update was at 12:34 PM ----------

For i/o and string comparisons and the like, the improvement ratio (perl-time/c-time) will be small. However, if the dataset is massive, that ratio could amount to a substantial speedup. The 1.25x improvement in the sha1 hashing in my previous post isn't much for 181 MB of data, 3.5 secs or so, but it's about 20 seconds per gigabyte and 5.5 hours per terabyte.

Regards,
Alister

Implementing crypto in raw perl is absolutely ludicrous, it's like trying to accomplish serious work in intercal... It's just not made for that kind of work. It's bound to have a module to do that natively instead, which they should have used.

Or perhaps they did use it but badly -- reading and processing single characters at time, or consuming huge gobs of ram to slurp in entire huge files... Perl lends itself to certain abuses.

In any case, it must be possible to write faster perl code for even that task. Perl is no excuse for a poor algorithm.

It's been done, but it is only intended for situations where a C compiler is unavailable: Digest::SHA::PurePerl

Regards,
Alister

I didn't say it hadn't, only that it was ludicrous to do so. :wink:

People have also done things like entire X servers in native Perl. It baffles.

So it takes a while. So what? Is there any reason you need the data faster?

You're probably costing your employer $100-200 an hour when everything is added up. If not more.

There really needs to be a good reason to spend lots of time simply speeding up something that's already working.

it takes some time to process millions of data

It depends on how "some time" is defined. Minutes? Hours? Days?

Well, the username, ribosome, suggests that the data may involve bioinformatics. If that's so, then very large datasets may be involved and a small improvement could yield significant monetary savings over its lifetime.

Regards,
Alister

---------- Post updated at 03:24 PM ---------- Previous update was at 03:04 PM ----------

By the way, Ribosome, if you would like some concrete advice, it's about time you posted a sample of the logs that you are processing, a description of how they ought to be processed, a sample of the processed output, the perl code that you are currently using to process them, and how that script is called. Only with that information in hand will a competent perl coder be able to make useful recommendations. If this is too much to reasonably accomodate in a forum post, attach it in a file (archived if necessary).

Given the paucity of details, it's possible that your bottleneck is an inefficient shell script wrapper. Who knows.

Regards,
Alister

Now we're getting somewhere. Rebuilding a log scanner in C sounds much more doable than rebuilding an entire application. Can you show this script?

Also, tell us the size of the dataset. How large are the files (file size in bytes) and how many are there?

Regards.
Alister

Your own code and their run on a different platform:

$ ll
total 31856
-rw-r--r--  1 pkumarpr  ipg       144 Apr 21 15:27 sha1_.pl
-rw-r--r--  1 pkumarpr  ipg  32544556 Apr 21 15:36 usr.bin
$ cat sha1.pl
use Digest::SHA1;
binmode(STDIN);
my $s = Digest::SHA1->new;
while (read(STDIN, $buf, 1024)) {
    $s->add($buf);
}
print $s->hexdigest . "\n"
$ time perl sha1_.pl < usr.bin
bad603823684e49866e5c744d560992d34743338

real    0m1.520s
user    0m0.368s
sys     0m0.196s

$ time sha1 < usr.bin
bad603823684e49866e5c744d560992d34743338

real    0m0.181s
user    0m0.128s
sys     0m0.053s

The above run itself is around 8X performance difference in performance with standard sha1sum utility.

The above was run on a FreeBSD7.1 machine; -a very busy build machine, we have in lab!!!

One should not conclude for the overall scenario with few runs on simple tests performed on a type of H/w & one platform only; because this might differ drastically in other environment especially a controlled ones like appliances/devices;
wherein something might be configured to run with 100% CPU utilization and with hi-priority (in multi-core/CPU environment). I think you should have considered this too!!!

Now about 440x performance: You calculated it with the data I described, great!!! You still don't know the environment my c-program ran and I got the kind of performance difference.

Fact:

IBM ISS Proventia GX-series IPS devices, under FIPS mode, run a boot-time integrity check (with SHA1 hash), of the entire disk, in just around 3 seconds (around 5 seconds on low end the models).

The devices are in public domain and you should have a look at it.

Under similar conditions that Perl script based utility had a 22-mintues of run!!!