Perl script to read string from file#1 and find/replace in file#2

pchang · August 8, 2015, 11:37pm

Hello Forum.

I have a file called abc.sed with the following commands;

s/1/one/g
s/2/two/g
...

I also have a second file called abc.dat and would like to substitute all occurrences of "1 with one", "2 with two", etc and create a new file called abc_new.dat

sed -f abc.sed abc.dat > abc_new.dat

For small files, this command works fine but for large files, it's very slow

I read that Perl might be faster in doing this kind of operation.

Can you please help me write the Perl code if you think it can work faster?

I am a newbie in Perl scripting.

Thanks and appreciate all the help you can provide.

Aia · August 9, 2015, 1:36am

How long is that list of substitutes in abc.sed?

---------- Post updated at 11:36 PM ---------- Previous update was at 10:17 PM ----------

Give it a try and tell me if it does make a difference.

#!/usr/bin/perl

use strict;
use warnings;

my %replace;

open my $fh, '<', "abc.sed" or die "$!\n";
my @gsubs = <$fh>;
close $fh;
@gsubs = map{s/^s\/|\/g|\n//g; split "/"} @gsubs;
%replace = @gsubs;

my $search = join '|', keys %replace;

open $fh, '<', "abc.dat" or die "$!\n";
while(<$fh>) {
    s/($search)/$replace{$1}/ge;
    print;
}
close $fh;

pchang · August 9, 2015, 10:08am

Thanks Aia for your suggestion.

abc.sed could vary in size from 1 to maybe 10,000 records

I'm trying to find a solution that will work faster than sed. It doesn't have to be Perl specifically. If you have some other ideas to process the file faster, that would be great.

I tried your code but getting the following error:

more test.pl
#!/usr/bin/perl

use strict;
use warnings;

my %replace;

open my $fh, '<', "abc.sed" or die "$!\n";
my @gsubs = <$fh>;
close $fh;
@gsubs = map{s/^s\/|\/g|\n//g; split "/"} @gsubs;
%replace = @gsubs;

my $search = join '|', keys %replace;

open $fh, '<', "abc.dat" or die "$!\n";
while(<$fh>) {
    s/($search)/$replace{$1}/ge;
    print;
}
close $fh;

sh test.pl

test.pl: line 3: use: command not found
test.pl: line 4: use: command not found
test.pl: line 6: my: command not found
Couldn't get a file descriptor referring to the console
test.pl: line 9: syntax error near unexpected token `;'
test.pl: line 9: `my @gsubs = <$fh>;'

I made test.pl executable and /usr/bin/perl executable exists on the Linux box.

Thanks.

---------- Post updated at 10:08 AM ---------- Previous update was at 07:26 AM ----------

I'm thinking of another option to use instead of sed or Perl, how about tr?

Do you think that this will work faster?

Thanks.

Don_Cragun · August 9, 2015, 10:45am

If you could modify your abc.sed file from the format:

s/1/one/g
s/2/two/g
...

to instead be in the format:

g/1/s//one/g
g/2/s//two/g
...
w abc_new.dat
q

and save the modified contents in a file named abc.ed
and then try the command:

ed �s abc.dat < abc.ed

or:

ex �s abc.dat < abc.ed

instead of:

sed �f abc.sed abc.dat > abc_new.dat

it would be interesting to know if either of these make any difference in how long it takes.

I would imagine the difference between ed or ex and sed could be significant depending on the number of substitutions being made and on the number of lines in the file being modified. But there is no way to verify my imagination without a benchmark to test it against. Since I don't have any way to guess at the real substitutions you're performing nor of the data being processed, it is hard for me to create data on my system that would be a reasonable benchmark that might simulate your data running these commands on your OS on your hardware.

And depending on your OS, there might or might not be a significant difference between the ed and ex utilities for your data.

Hope this helps...

PS: The reason I imagine that ed and ex would be faster than sed is that each substitution command is run once for the entire file with these utilities while sed runs each substitution command once for each input line.

RudiC · August 9, 2015, 10:48am

I doubt there will be a significantly faster solution; sed itself is lightweight and fast. It may be outperformed by a few percent by sth. else, but not orders of magnitude. The task itself is heavy duty: it must compare the data file line by line, char by char, with 10000 to-be-substituted patterns - that will take its time no matter what tool you use.
Be aware that tr can only substitute char by char, not char by word and thus will not solve your problem.

Aia · August 9, 2015, 11:46am

It looks like the code is not being interpreted by Perl, but rather sh.
Executing it as /usr/bin/perl test.pl might fix that.
However, I doubt that it would produce a satisfactory result. Not in that amount of substitutions.

The substitution process is as good as it gets in sed or Perl. All that I was trying to do was to transfer the burden of the task to the REGEX engine, instead of the loop from line to line.

pchang · August 9, 2015, 12:16pm

thanks guys for all your suggestions.

I'm starting to think outside the Unix utilities and maybe a Java program will speed things up? What do you think?

Unfortunately, I don't have any experience with Java coding.

Don_Cragun · August 9, 2015, 1:53pm

Could you tell us how the run time using sed compared when you tried my suggestion in post #4 in this thread using ed and/or ex ?

I take it that my intuition was off, but for future reference, I'd appreciate hearing how the run times compared.

pchang · August 9, 2015, 6:08pm

Hi Don.

When I have some time, I will test out your suggestion.

Right now, we are just trying to find a quick solution - we are going live in a few weeks.

Thanks.

Don_Cragun · August 9, 2015, 9:41pm

Never mind. My intuition was off. At least on OS X 10.10.4 on a 2 year old MacBook Pro with abc.tx t containing about 4k lines and abc.sed containing 144 substitutions, sed averaged about .07 seconds, ex averaged about .11 seconds and ed took about .45 seconds. Your mileage may vary with different data, different hardware, and/or a different OS. If you want to try the script I used to convert abc.sed to an equivalent abc.ed and time sed and ed , it is at the end of this script. The abc.ed script it produces will work with both ed and ex . (Note that it was only tested with alphanumeric BREs in the sed substitute commands. Further work may be required if BRE special characters (especially / ) or commas are included in the BREs in abc.se .

#!/bin/ksh
rm -f abc.ed
time sed -f abc.sed abc.txt > new_abc.txt
ed abc.sed <<-"EOF"
g/s\/\([^\/]*\)/s,,g/\1/s/,
$a
w _new_abc.txt
q
.
w abc.ed
q
EOF
time ed -s abc.txt  < abc.ed
echo comparing results
diff _new_abc.txt new_abc.txt

MadeInGermany · August 10, 2015, 8:58am

I did some benchmarks, too.
With sed the runtime grows exponentionally with abc.sed.
GNU sed, when abc.sed is >90 lines, is already slower than the perl solution in post#2.
For my tests I have embedded the perl code in a shell script, so it can read from a pipe or from an argument, just like sed.

#!/bin/sh

perl -pe '
BEGIN {
open $fh, "<", "abc.sed" or die "$!\n";
@gsubs = <$fh>;
close $fh;
@gsubs = map{s#^s/|/g$|\n##g; split "/"} @gsubs;
%replace = @gsubs;

$search = join "|", keys %replace;
}

# main loop
{
    s/($search)/$replace{$1}/g;
}
' "$@"