speed issues with pattern matching and substitution

Arturi · Apr 4, 2005

Hello there,

I wrote some perl lines that scans a file looking for some words and
substitute them for their corresponding pairs, which are previously
saved in a HASH.
The code is working fine but I'm wondering what could I do to speed it
up.
Does anybody have a hint?

cheers

here the code extract:
----------------------------------------------------------------------
%HASH #is being defined previously

@keys= keys(%HASH);
my $line = undef;
my $tmp = undef;

open(FILE,"< in.txt") || die "Can't open ";
open(OUTPUT2,"< out.txt") || die "Can't open ";

while (<FILE>) {
$line = $_;
foreach $k (@keys) {
if (/(\W)$k(\W)/){
#print "$_\n";
$tmp = $HASH{$k};
while ($line =~/(\W)($k)(\W)/){$line =~
s/(\W)($k)(\W)/$1$tmp$3/;}
}
}
print OUTPUT2 $line;
}

---------------------------------------------------------------------------

A. Sinan Unur · Apr 4, 2005

(e-mail address removed) (Arturi) wrote in

The code is working fine but I'm wondering what could I do to speed it
up.
Does anybody have a hint?

%HASH #is being defined previously

use strict;
use warnings;

missing;

@keys= keys(%HASH);
my $line = undef;
my $tmp = undef;

my ($line, $tmp);

You should declare variables in the smallest applicable scope.

open(FILE,"< in.txt") || die "Can't open ";

open my $file, '<', 'in.txt' or die "Can't open in.txt: $!";

open(OUTPUT2,"< out.txt") || die "Can't open ";

Why are you opening the output file for reading only?

Also, see above.

while (<FILE>) {
$line = $_;

while(my $line = said:
foreach $k (@keys) {
if (/(\W)$k(\W)/){
#print "$_\n";
$tmp = $HASH{$k};
while ($line =~/(\W)($k)(\W)/){$line =~
s/(\W)($k)(\W)/$1$tmp$3/;}

What is the role of the while there?

}
}
print OUTPUT2 $line;
}

I have a sneaking suspicion you have not posted real code.

Please read the posting guidelines for this group for help on how to
help others help you.

Please post real code.

Sinan.

xhoster · Apr 4, 2005

Hello there,

I wrote some perl lines that scans a file looking for some words and
substitute them for their corresponding pairs, which are previously
saved in a HASH.
The code is working fine but I'm wondering what could I do to speed it
up.
Does anybody have a hint?

Yes. Post real code. For example, opening an output file for reading only
is probably not real code. Describe the input or better give some short
code which generates acceptable fake input for testing. Tell us how long
it took to run. Tell us how much improvement you are looking for.

Xho

Greg K · Apr 4, 2005

I am more of a lurker here than a Perl guru(someday I hope), but my
guess would be the three comparisons for a match of your key are what
is slowing you down:

if (/(\W)$k(\W)/){
$line =~ /(\W)($k)(\W)/
$line =~ s/(\W)($k)(\W)/$1$tmp$3/;

I use a similar program for updating a bunch of files in a directory. I
wrote it so you could use two arrays which contain old and new values,
but I just modified it to use a hash instead. I know my code probably
isn't as concise as some people here can write, but you're welcome to
use if you like. It actually scans a directory listing for files which
match a regexp, then processes those.

[CodeA - Arrays]
#!/usr/local/bin/perl
use strict;
use warnings;

#directory info and file reg exp
my $dir = opendir DIR,"../old";
my @files = readdir(DIR);
my $fileregexp = "txt";

# number of files processed
my $count=0;

#arrays of stuff you want to change
my @old;
my @new;

## old will get interpreted as a regexp so make sure you escape chars
as needed
$old[0] = 'C';
$new[0] = 'Perl';

foreach my $filename (@files)
{
if( $filename =~ /$fileregexp/ )
{
$count++;
open IN, "<../old/$filename" || die "Unable to open files for
reading";
open OUT, ">../new/$filename"|| die "Unable to open files for
writing";
while ( <IN> )
{
for( my $t = 0; $t < ($#old + 1); $t++ )
{
s/$old[$t]/$new[$t]/g;
}
print OUT $_;
}
close IN;
close OUT;
print "../new/$filename created\n";
}
}
print "$count files modified\n";
[/CodeA]

[CodeB - Hash]
#!/usr/local/bin/perl
use strict;
use warnings;

#directory info and file reg exp
my $dir = opendir DIR,"../old";
my @files = readdir(DIR);
my $fileregexp = "txt";

# number of files processed
my $count=0;

## the key will be interpreted as a regexp so escape anything you need
to here, such as parens
my %changes = (
'C' => 'Perl'
);

my @keys = keys( %changes );

foreach my $filename ( @files )
{
if( $filename =~ /$fileregexp/ )
{
$count++;
open IN, "<../old/$filename" || die "Unable to open files for
reading";
open OUT, ">../new/$filename" || die "Unable to open files for
writing";
while( <IN> )
{
foreach my $k ( @keys )
{
s/$k/$changes{$k}/g;
}
print OUT $_;
}
close IN;
close OUT;
print "../new/$filename created\n";
}
}
print "$count files modified\n";
[/CodeB]

Data input file(essay.txt):
C is the best programming language.
There is more than one way to do it in C.
I like all the premade data structures in C.

Data output file(essay.txt):
Perl is the best programming language.
There is more than one way to do it in Perl.
I like all the premade data structures in Perl.

-Greg

John W. Krahn · Apr 4, 2005

Arturi said:
I wrote some perl lines that scans a file looking for some words and
substitute them for their corresponding pairs, which are previously
saved in a HASH.
The code is working fine but I'm wondering what could I do to speed it
up.

Use a profiling module or the Benchmark module to determine which parts of the
code are too slow.

here the code extract:
----------------------------------------------------------------------
%HASH #is being defined previously

@keys= keys(%HASH);
my $line = undef;
my $tmp = undef;

open(FILE,"< in.txt") || die "Can't open ";
open(OUTPUT2,"< out.txt") || die "Can't open ";

while (<FILE>) {
$line = $_;
foreach $k (@keys) {
if (/(\W)$k(\W)/){

Why use \W which requires a character at either end of $k instead of \b which
doesn't? Why use capturing parentheses?

#print "$_\n";
$tmp = $HASH{$k};
while ($line =~/(\W)($k)(\W)/){$line =~
s/(\W)($k)(\W)/$1$tmp$3/;}
}

More simply written as:

1 while $line =~ s/(\W)($k)(\W)/$1$tmp$3/;

But the only reason to do it like that is if the keys/values are overlapping
or recursive. Is there any reason that you can't just use the /g global option?

}
print OUTPUT2 $line;
}

---------------------------------------------------------------------------

open FILE, '<', 'in.txt' or die "Can't open 'in.txt' $!";
open OUTPUT2, '>', 'out.txt' or die "Can't open 'out.txt' $!";

while ( my $line = <FILE> ) {
for my $k ( keys %HASH ) {
$line =~ s/\b($k)\b/$HASH{$k}/g;

# if you REALLY need it
# 1 while $line =~ s/\b($k)\b/$HASH{$k}/g;
}
print OUTPUT2 $line;
}

John

Brian McCauley · Apr 4, 2005

Arturi said:
I wrote some perl lines that scans a file looking for some words and
substitute them for their corresponding pairs, which are previously
saved in a HASH.
The code is working fine but I'm wondering what could I do to speed it
up.
Does anybody have a hint?

You are writing something of the form:

if( some_condition() ) {
do_something();
}

Where do_something() will not do anything anyhow if some_condition() is
not met. As such it would be simpler just to write:

do_something();

Actually you've gone one step further and written:

if( some_condition() ) {
while( some_condition() ) {
do_something();
}
}

Where do_something() will not only not do anything if some_condition()
is not met but will also return false. As such it would be simpler just
to write:

while( do_something() ) {}

Now if we actually look a the something, in quesion it's a s///. Do you
really want to to be able to seach and replace again within the previous
replacements? Looking at your code I'd guess that wouldn't make sense.
And if that's the case then you could simply use s///g.

here the code extract:
----------------------------------------------------------------------
%HASH #is being defined previously

@keys= keys(%HASH);
my $line = undef;
my $tmp = undef;

Nasty case of premature declaration you've got there.

open(FILE,"< in.txt") || die "Can't open ";
open(OUTPUT2,"< out.txt") || die "Can't open ";

while (<FILE>) {
$line = $_;

If you are worried abot speed why are you doing that?

foreach $k (@keys) {
if (/(\W)$k(\W)/){
#print "$_\n";
$tmp = $HASH{$k};

Awful name for a variable, $tmp.

while ($line =~/(\W)($k)(\W)/){$line =~
s/(\W)($k)(\W)/$1$tmp$3/;}
}
}

If you are concerned about speed you should take the comoposition and
compilatiion of the regex outside the loop.

my @patterns = map { qr/(\W)($_)(\W)/ } keys %HASH;

local *_; # Wise not to stomp on someone else's $_

while (<FILE>) {
my @values = values %HASH;
for my $pattern ( @patterns ) {
my $value = shift @values;
s/$pattern/$1$value$3/g;
}
# print or whatever
}

But my spider-sense is telling me that perhaps the keys of %HASH really
arbirary regexes but are just words and all you want is word
substitution. This is a classic bit of Perl.

while (<FILE>) {
s/(\w+)/$HASH{$1}||$1/eg;
# print or whatever
}

Tad McClellan · Apr 5, 2005

Greg K said:
I am more of a lurker here than a Perl guru(someday I hope),

Use the FAQ Luke!

perldoc -q match

How do I efficiently match many regular expressions at once?

but my
guess

There is no need to guess:

perldoc -q profile

How do I profile my Perl programs?

I use a similar program

I sure hope that does not mean the the code below is what
you have actually used.

If it is, then you just haven't found the bugs that are in it yet...

#!/usr/local/bin/perl
use strict;
use warnings;

A most excellent start, BTW.

my $dir = opendir DIR,"../old";

You should check the return value to see if you actually got
what you asked for, just like with open() below.

$old[0] = 'C';
$new[0] = 'Perl';

open IN, "<../old/$filename" || die "Unable to open files for
reading";

You should include the $! variable in your die message, it
contains the reason for the dying.

for( my $t = 0; $t < ($#old + 1); $t++ )

Gak!

foreach my $t ( 0 .. $#old )

or at least:

for( my $t = 0; $t < @old; $t++ )

or even:

for( my $t = 0; $t <= $#old; $t++ )

s/$old[$t]/$new[$t]/g;

If

$_ = 'Chuck likes C';

Then you _want_ it to be changed to:

Perlhuck likes Perl

??

Probably not, so maybe this instead:

s/\b$old[$t]\b/$new[$t]/g;

Greg K · Apr 5, 2005

Use the FAQ Luke!

perldoc -q match
How do I efficiently match many regular expressions at once?

Nice, didn't know about that. Thanks for the tip!

I sure hope that does not mean the the code below is what
you have actually used.
If it is, then you just haven't found the bugs that are in it yet...

Actually, yes, I do use the following.

for( my $t = 0; $t < @old; $t++ )
for( my $t = 0; $t <= $#old; $t++ )

I would use either of these this one. Unfortunately I still have to
switch back and forth between languages frequently so I probably won't
switch to the ( 0 .. $#old ) notation any time soon.

s/$old[$t]/$new[$t]/g;

Click to expand...

If
$_ = 'Chuck likes C';
Then you _want_ it to be changed to:
Perlhuck likes Perl

Actually, for my case it does what I want. I am matching long unique
strings so I don't have to worry about accidental matches(strings such
as "../group_name/participants.html", etc). I figured the OP can write
the regexp for their specific need.

Anyway, thx for the constructive criticism. I am going to keep reading
through the PerlFAQ.

Anno Siegel · Apr 5, 2005

Brian McCauley said:
Arturi said:

I wrote some perl lines that scans a file looking for some words and
substitute them for their corresponding pairs, which are previously
saved in a HASH.

Click to expand...

[...]

If you are concerned about speed you should take the comoposition and
compilatiion of the regex outside the loop.

my @patterns = map { qr/(\W)($_)(\W)/ } keys %HASH;

local *_; # Wise not to stomp on someone else's $_

while (<FILE>) {
my @values = values %HASH;
for my $pattern ( @patterns ) {
my $value = shift @values;
s/$pattern/$1$value$3/g;
}
# print or whatever
}

But my spider-sense is telling me that perhaps the keys of %HASH really
arbirary regexes but are just words and all you want is word
substitution. This is a classic bit of Perl.

while (<FILE>) {
s/(\w+)/$HASH{$1}||$1/eg;
# print or whatever
}

At first sight that looks rather inefficient because *every* word
is captured and replaced (often by itself). Out of interest I
benchmarked a few variants, and it turns out that it is just as
fast as more selective solutions.

I'm appending the benchmark script, in case anyone cares. The four
variants tested are (least efficient first, the difference between
the last two is insignificant):

direct -- every regex is compiled from the string at run time
regex -- all alternatives compiled into a big regex
brian -- replace every word, like the second solution above
precomp -- regular expressions precompiled via qr//, like the
first solution above

Anno

--

#!/usr/local/bin/perl
use strict; use warnings; $| = 1;
use Vi::QuickFix;

use Benchmark qw( cmpthese);

our @lines = <DATA>;

my %repla; # string replacements (master)
@repla{
qw( one two three four five six seven eight nine ten eleven twelve)
} = qw( ein zwei drei vier fuenf sechs sieben acht neun zehn elf zwoelf);

my @repla = map [ qr/\b$_\b/ => $repla{ $_}], keys %repla; # precompiled

my $repla = do { # single regex
my $re = join '|', map "\\b$_\\b", keys %repla;
qr/($re)/;
};

goto bench;

check:
print brian( @lines);
print "\n", @lines;
exit;

bench:
@lines = ( @lines) x 10; # vary to check linearity
cmpthese -1, {
direct => 'direct( @lines)',
precomp => 'precomp( @lines)',
brian => 'brian( @lines)',
regex => 'regex( @lines)',

};
#######################################################################

sub direct {
my @new = @_;
for ( @new ) {
my ( $str, $rpl);
s/\b$str\b/$rpl/g while ( $str, $rpl) = each %repla;
}
@new;
}

sub precomp {
my @new = @_;
for ( @new ) {
for my $r ( @repla ) {
s/$r->[ 0]/$r->[ 1]/g;
}
}
@new ;
}

sub brian {
my @new = @_;
s/(\w+)/$repla{ $1} || $1/eg for @new;
@new;
}

sub regex {
my @new = @_;
s/$repla/$repla{ $1}/g for @new;
@new;
}

__DATA__
one spot was detected on each of the two blades
three jolly coachmen, three jolly coachmen
twelve to a dozen
a display of oneupmanship
three o'clock four o'clock five o'clock ROCK

Help with pattern matching	20	Apr 11, 2012
Need help with pattern matching/substitution.	6	Sep 26, 2006
Push regex search result into hash with multiple values	14	May 19, 2014
Regular expression segmentation Fault with in-place substitution	1	Jul 29, 2009
possible ActivePerl substitution optimisation issue	0	Jun 12, 2007
Pyautogui, cv2 and cannot find image	0	Feb 7, 2023
find a matching pattern in file and find it in another file too	2	Mar 13, 2008
Php combine identical lines in text file	4	Oct 11, 2023

speed issues with pattern matching and substitution

Arturi

A. Sinan Unur

xhoster

Greg K

John W. Krahn

Brian McCauley

Tad McClellan

Greg K

Anno Siegel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads