To scan a file and ignore all characters that has an ASCII value from 0 to 31 and 127 to 255 and accept only those characters having an ASCII between 32 and 126.
Script:
#!/usr/local/bin/perl
$filename = "$ARGV[0]";
if (-e $filename)
{
open(OUT, "${filename}") || die "can't open $filename\n";
while (<OUT>){
$found= "";
$stat=0;
chomp $;
my @charArray = split(//, $);
my $ref = \@charArray;
foreach (@charArray) {
$val = ord($$ref[$stat]);
if(($val>31)&&($val<127)){
$found = "$found$$ref[$stat]";
}
$stat++;
}
$found = "$found\n";
print $found;
}
close(OUT);
}
Problem:
The code mentioned above runs for 20-25 mins for a 500 MB file. This is very slow.
Can someone let me know if this can be done in a more efficient way so as to reduce the file processing duration?
Thanks..It still takes 10 mins. I have one question here, if we are reading the entire file and moving one character by character won't it consume valuable memory? For eg in C we can take certain bytes (as first batch) from the file and process it and then follow with the next batch of the file.
Can anything be done here?
#!/usr/local/bin/perl
use warnings;
use strict;
$ARGV[0] or die "Need a filename\n";
while(<>) {
foreach my $t (split(//)){
my $ord = ord $t;
print $t if ( $ord >= 32 && $ord <= 126);
}
print "\n";
}
Perl actually has pretty efficient internal optimizations for this kind of stuff.
If you are really into optimization, you can calculate a baseline by running just perl -ne 1 on the file and then see how much your additional processing takes time. Add some more steps piecemeal and see if there are any really big jumps in the stats. If there are, figure out if you are disabling some internal optimization and if rephrasing the code can get it back.
Can you split the processing, like tr -d '\000-\037\200-\377' <file | perl ... and get away with it?
(Or '\000-\011\013-\037\200-\377' if you want to preserve the newlines, like matrixmadhan observed.)