perl script for file processing

Aim:

To scan a file and ignore all characters that has an ASCII value from 0 to 31 and 127 to 255 and accept only those characters having an ASCII between 32 and 126.

Script:

#!/usr/local/bin/perl
$filename = "$ARGV[0]";
if (-e $filename)
{
open(OUT, "${filename}") || die "can't open $filename\n";
while (<OUT>){
$found= "";
$stat=0;
chomp $;
my @charArray = split(//, $
);
my $ref = \@charArray;
foreach (@charArray) {
$val = ord($$ref[$stat]);
if(($val>31)&&($val<127)){
$found = "$found$$ref[$stat]";
}
$stat++;
}
$found = "$found\n";
print $found;
}
close(OUT);
}

Problem:
The code mentioned above runs for 20-25 mins for a 500 MB file. This is very slow.

Can someone let me know if this can be done in a more efficient way so as to reduce the file processing duration?

try this,

#! /opt/third-party/bin/perl

open(FILE, "<", $ARGV[0]) || die ("unable to open <$!>\n");

while( read(FILE, $data, 1) == 1 ) {
  $ordVal = ord($data);
  print "$ordVal"  if( $ordVal >= 32 && $ordVal <= 126 );
}

close(FILE);

exit(0);

Hi Madhan,

Corrected code:

#!/usr/local/bin/perl
open(FILE, "<", $ARGV[0]) || die ("unable to open <$!>\n");
while( read(FILE, $data, 1) == 1 ) {
if((ord($data)>=32)&&(ord($data)<=126)){
print "$data";
}
if(ord($data)==10){
print "\n";}
}
close(FILE);

Its great it takes just 10 mins now. Is there anything else that can be done to reduce the duration further?

Minor change but this will make a difference

change the following

to

print "$data" if((ord($data)>=32)&&(ord($data)<=126));

print "\n" if(ord($data)==10);

Hi Madhan,

Thanks..It still takes 10 mins. I have one question here, if we are reading the entire file and moving one character by character won't it consume valuable memory? For eg in C we can take certain bytes (as first batch) from the file and process it and then follow with the next batch of the file.
Can anything be done here?

Please correct me if I am wrong.

Are you testing with the same input file ?
And probably with the same load each time ?

Yes.

Probably you could try something like

while( read(FILE, $data, 1000) == 1000 ) {

and then split $data and use to process,
I think it will definitely reduce I/O here.

Let me know how it goes ! :slight_smile:

Hi Madhan,

No..Its taking double the time...

Script:

#!/usr/local/bin/perl
open(FILE, "<", $ARGV[0]) || die ("unable to open <$!>\n");
while( read(FILE, $data, 1000) == 1000) {
$stat=0;
@char = split(//,$data);
foreach (@char){
print "@char[$stat]" if((ord(@char[$stat])>=32)&&(ord(@char[$stat])<=126));
print "\n" if(ord(@char[$stat])==10);
$stat++;
}
}
# for processing last batch
$stat=0;
@char = split(//,$data);
foreach (@char){
print "@char[$stat]" if((ord(@char[$stat])>=32)&&(ord(@char[$stat])<=126));
print "\n" if(ord(@char[$stat])==10);
$stat++;
}
close(FILE);

So I am reverting back to the old logic that took 10 mins for 300 MB file.

Please let me know if anything else can be done.

Thank you for your support and help on this.. :slight_smile:

#!/usr/local/bin/perl
use warnings;
use strict;
$ARGV[0] or die "Need a filename\n";
while(<>) {
   foreach my $t (split(//)){
      my $ord = ord $t;
      print $t if ( $ord >= 32 && $ord <= 126);
   }
   print "\n";
}

Perl actually has pretty efficient internal optimizations for this kind of stuff.

If you are really into optimization, you can calculate a baseline by running just perl -ne 1 on the file and then see how much your additional processing takes time. Add some more steps piecemeal and see if there are any really big jumps in the stats. If there are, figure out if you are disabling some internal optimization and if rephrasing the code can get it back.

Can you split the processing, like tr -d '\000-\037\200-\377' <file | perl ... and get away with it?

(Or '\000-\011\013-\037\200-\377' if you want to preserve the newlines, like matrixmadhan observed.)

Duh, actually, newline is '\012', sorry for the brain fart. ('\010' is tab, might want to keep that too, tho.)

// Edited the posting to correct it there.