perl script for file processing

SEEHTAS · March 23, 2008, 1:49pm

Aim:

To scan a file and ignore all characters that has an ASCII value from 0 to 31 and 127 to 255 and accept only those characters having an ASCII between 32 and 126.

Script:

#!/usr/local/bin/perl
$filename = "$ARGV[0]";
if (-e $filename)
{
open(OUT, "${filename}") || die "can't open $filename\n";
while (<OUT>){
$found= "";
$stat=0;
chomp $;
my @charArray = split(//, $);
my $ref = \@charArray;
foreach (@charArray) {
$val = ord($$ref[$stat]);
if(($val>31)&&($val<127)){
$found = "$found$$ref[$stat]";
}
$stat++;
}
$found = "$found\n";
print $found;
}
close(OUT);
}

Problem:
The code mentioned above runs for 20-25 mins for a 500 MB file. This is very slow.

Can someone let me know if this can be done in a more efficient way so as to reduce the file processing duration?

matrixmadhan · March 23, 2008, 2:44pm

try this,

#! /opt/third-party/bin/perl

open(FILE, "<", $ARGV[0]) || die ("unable to open <$!>\n");

while( read(FILE, $data, 1) == 1 ) {
  $ordVal = ord($data);
  print "$ordVal"  if( $ordVal >= 32 && $ordVal <= 126 );
}

close(FILE);

exit(0);

SEEHTAS · March 24, 2008, 6:16am

Hi Madhan,

Corrected code:

#!/usr/local/bin/perl
open(FILE, "<", $ARGV[0]) || die ("unable to open <$!>\n");
while( read(FILE, $data, 1) == 1 ) {
if((ord($data)>=32)&&(ord($data)<=126)){
print "$data";
}
if(ord($data)==10){
print "\n";}
}
close(FILE);

Its great it takes just 10 mins now. Is there anything else that can be done to reduce the duration further?

matrixmadhan · March 24, 2008, 6:24am

Minor change but this will make a difference

change the following

to

print "$data" if((ord($data)>=32)&&(ord($data)<=126));

print "\n" if(ord($data)==10);

SEEHTAS · March 24, 2008, 7:34am

Hi Madhan,

Thanks..It still takes 10 mins. I have one question here, if we are reading the entire file and moving one character by character won't it consume valuable memory? For eg in C we can take certain bytes (as first batch) from the file and process it and then follow with the next batch of the file.
Can anything be done here?

Please correct me if I am wrong.

matrixmadhan · March 24, 2008, 7:50am

Are you testing with the same input file ?
And probably with the same load each time ?

matrixmadhan · March 24, 2008, 7:52am

Yes.

Probably you could try something like

while( read(FILE, $data, 1000) == 1000 ) {

and then split $data and use to process,
I think it will definitely reduce I/O here.

Let me know how it goes !

SEEHTAS · March 24, 2008, 9:18am

Hi Madhan,

No..Its taking double the time...

Script:

#!/usr/local/bin/perl
open(FILE, "<", $ARGV[0]) || die ("unable to open <$!>\n");
while( read(FILE, $data, 1000) == 1000) {
$stat=0;
@char = split(//,$data);
foreach (@char){
print "@char[$stat]" if((ord(@char[$stat])>=32)&&(ord(@char[$stat])<=126));
print "\n" if(ord(@char[$stat])==10);
$stat++;
}
}
# for processing last batch
$stat=0;
@char = split(//,$data);
foreach (@char){
print "@char[$stat]" if((ord(@char[$stat])>=32)&&(ord(@char[$stat])<=126));
print "\n" if(ord(@char[$stat])==10);
$stat++;
}
close(FILE);

So I am reverting back to the old logic that took 10 mins for 300 MB file.

Please let me know if anything else can be done.

Thank you for your support and help on this..

KevinADC · March 24, 2008, 3:06pm

#!/usr/local/bin/perl
use warnings;
use strict;
$ARGV[0] or die "Need a filename\n";
while(<>) {
   foreach my $t (split(//)){
      my $ord = ord $t;
      print $t if ( $ord >= 32 && $ord <= 126);
   }
   print "\n";
}

era · March 24, 2008, 3:36pm

Perl actually has pretty efficient internal optimizations for this kind of stuff.

If you are really into optimization, you can calculate a baseline by running just perl -ne 1 on the file and then see how much your additional processing takes time. Add some more steps piecemeal and see if there are any really big jumps in the stats. If there are, figure out if you are disabling some internal optimization and if rephrasing the code can get it back.

Can you split the processing, like tr -d '\000-\037\200-\377' <file | perl ... and get away with it?

(Or '\000-\011\013-\037\200-\377' if you want to preserve the newlines, like matrixmadhan observed.)

era · March 25, 2008, 9:02am

Duh, actually, newline is '\012', sorry for the brain fart. ('\010' is tab, might want to keep that too, tho.)

// Edited the posting to correct it there.