speed issues with pattern matching and substitution

A

Arturi

Hello there,

I wrote some perl lines that scans a file looking for some words and
substitute them for their corresponding pairs, which are previously
saved in a HASH.
The code is working fine but I'm wondering what could I do to speed it
up.
Does anybody have a hint?

cheers

here the code extract:
----------------------------------------------------------------------
%HASH #is being defined previously

@keys= keys(%HASH);
my $line = undef;
my $tmp = undef;

open(FILE,"< in.txt") || die "Can't open ";
open(OUTPUT2,"< out.txt") || die "Can't open ";

while (<FILE>) {
$line = $_;
foreach $k (@keys) {
if (/(\W)$k(\W)/){
#print "$_\n";
$tmp = $HASH{$k};
while ($line =~/(\W)($k)(\W)/){$line =~
s/(\W)($k)(\W)/$1$tmp$3/;}
}
}
print OUTPUT2 $line;
}

---------------------------------------------------------------------------
 
A

A. Sinan Unur

(e-mail address removed) (Arturi) wrote in
The code is working fine but I'm wondering what could I do to speed it
up.
Does anybody have a hint?
%HASH #is being defined previously

use strict;
use warnings;

missing;
@keys= keys(%HASH);
my $line = undef;
my $tmp = undef;

my ($line, $tmp);

You should declare variables in the smallest applicable scope.
open(FILE,"< in.txt") || die "Can't open ";

open my $file, '<', 'in.txt' or die "Can't open in.txt: $!";
open(OUTPUT2,"< out.txt") || die "Can't open ";

Why are you opening the output file for reading only?

Also, see above.
while (<FILE>) {
$line = $_;

while(my $line = said:
foreach $k (@keys) {
if (/(\W)$k(\W)/){
#print "$_\n";
$tmp = $HASH{$k};
while ($line =~/(\W)($k)(\W)/){$line =~
s/(\W)($k)(\W)/$1$tmp$3/;}

What is the role of the while there?
}
}
print OUTPUT2 $line;
}

I have a sneaking suspicion you have not posted real code.

Please read the posting guidelines for this group for help on how to
help others help you.

Please post real code.

Sinan.
 
X

xhoster

Hello there,

I wrote some perl lines that scans a file looking for some words and
substitute them for their corresponding pairs, which are previously
saved in a HASH.
The code is working fine but I'm wondering what could I do to speed it
up.
Does anybody have a hint?

Yes. Post real code. For example, opening an output file for reading only
is probably not real code. Describe the input or better give some short
code which generates acceptable fake input for testing. Tell us how long
it took to run. Tell us how much improvement you are looking for.

Xho
 
G

Greg K

I am more of a lurker here than a Perl guru(someday I hope), but my
guess would be the three comparisons for a match of your key are what
is slowing you down:
if (/(\W)$k(\W)/){
$line =~ /(\W)($k)(\W)/
$line =~ s/(\W)($k)(\W)/$1$tmp$3/;

I use a similar program for updating a bunch of files in a directory. I
wrote it so you could use two arrays which contain old and new values,
but I just modified it to use a hash instead. I know my code probably
isn't as concise as some people here can write, but you're welcome to
use if you like. It actually scans a directory listing for files which
match a regexp, then processes those.

[CodeA - Arrays]
#!/usr/local/bin/perl
use strict;
use warnings;

#directory info and file reg exp
my $dir = opendir DIR,"../old";
my @files = readdir(DIR);
my $fileregexp = "txt";

# number of files processed
my $count=0;

#arrays of stuff you want to change
my @old;
my @new;

## old will get interpreted as a regexp so make sure you escape chars
as needed
$old[0] = 'C';
$new[0] = 'Perl';

foreach my $filename (@files)
{
if( $filename =~ /$fileregexp/ )
{
$count++;
open IN, "<../old/$filename" || die "Unable to open files for
reading";
open OUT, ">../new/$filename"|| die "Unable to open files for
writing";
while ( <IN> )
{
for( my $t = 0; $t < ($#old + 1); $t++ )
{
s/$old[$t]/$new[$t]/g;
}
print OUT $_;
}
close IN;
close OUT;
print "../new/$filename created\n";
}
}
print "$count files modified\n";
[/CodeA]


[CodeB - Hash]
#!/usr/local/bin/perl
use strict;
use warnings;

#directory info and file reg exp
my $dir = opendir DIR,"../old";
my @files = readdir(DIR);
my $fileregexp = "txt";

# number of files processed
my $count=0;

## the key will be interpreted as a regexp so escape anything you need
to here, such as parens
my %changes = (
'C' => 'Perl'
);

my @keys = keys( %changes );

foreach my $filename ( @files )
{
if( $filename =~ /$fileregexp/ )
{
$count++;
open IN, "<../old/$filename" || die "Unable to open files for
reading";
open OUT, ">../new/$filename" || die "Unable to open files for
writing";
while( <IN> )
{
foreach my $k ( @keys )
{
s/$k/$changes{$k}/g;
}
print OUT $_;
}
close IN;
close OUT;
print "../new/$filename created\n";
}
}
print "$count files modified\n";
[/CodeB]

Data input file(essay.txt):
C is the best programming language.
There is more than one way to do it in C.
I like all the premade data structures in C.

Data output file(essay.txt):
Perl is the best programming language.
There is more than one way to do it in Perl.
I like all the premade data structures in Perl.

-Greg
 
J

John W. Krahn

Arturi said:
I wrote some perl lines that scans a file looking for some words and
substitute them for their corresponding pairs, which are previously
saved in a HASH.
The code is working fine but I'm wondering what could I do to speed it
up.

Use a profiling module or the Benchmark module to determine which parts of the
code are too slow.

here the code extract:
----------------------------------------------------------------------
%HASH #is being defined previously

@keys= keys(%HASH);
my $line = undef;
my $tmp = undef;

open(FILE,"< in.txt") || die "Can't open ";
open(OUTPUT2,"< out.txt") || die "Can't open ";

while (<FILE>) {
$line = $_;
foreach $k (@keys) {
if (/(\W)$k(\W)/){

Why use \W which requires a character at either end of $k instead of \b which
doesn't? Why use capturing parentheses?

#print "$_\n";
$tmp = $HASH{$k};
while ($line =~/(\W)($k)(\W)/){$line =~
s/(\W)($k)(\W)/$1$tmp$3/;}
}

More simply written as:

1 while $line =~ s/(\W)($k)(\W)/$1$tmp$3/;

But the only reason to do it like that is if the keys/values are overlapping
or recursive. Is there any reason that you can't just use the /g global option?

}
print OUTPUT2 $line;
}

---------------------------------------------------------------------------


open FILE, '<', 'in.txt' or die "Can't open 'in.txt' $!";
open OUTPUT2, '>', 'out.txt' or die "Can't open 'out.txt' $!";

while ( my $line = <FILE> ) {
for my $k ( keys %HASH ) {
$line =~ s/\b($k)\b/$HASH{$k}/g;

# if you REALLY need it
# 1 while $line =~ s/\b($k)\b/$HASH{$k}/g;
}
print OUTPUT2 $line;
}



John
 
B

Brian McCauley

Arturi said:
I wrote some perl lines that scans a file looking for some words and
substitute them for their corresponding pairs, which are previously
saved in a HASH.
The code is working fine but I'm wondering what could I do to speed it
up.
Does anybody have a hint?

You are writing something of the form:

if( some_condition() ) {
do_something();
}

Where do_something() will not do anything anyhow if some_condition() is
not met. As such it would be simpler just to write:

do_something();

Actually you've gone one step further and written:

if( some_condition() ) {
while( some_condition() ) {
do_something();
}
}

Where do_something() will not only not do anything if some_condition()
is not met but will also return false. As such it would be simpler just
to write:

while( do_something() ) {}

Now if we actually look a the something, in quesion it's a s///. Do you
really want to to be able to seach and replace again within the previous
replacements? Looking at your code I'd guess that wouldn't make sense.
And if that's the case then you could simply use s///g.
here the code extract:
----------------------------------------------------------------------
%HASH #is being defined previously

@keys= keys(%HASH);
my $line = undef;
my $tmp = undef;

Nasty case of premature declaration you've got there.
open(FILE,"< in.txt") || die "Can't open ";
open(OUTPUT2,"< out.txt") || die "Can't open ";

while (<FILE>) {
$line = $_;

If you are worried abot speed why are you doing that?
foreach $k (@keys) {
if (/(\W)$k(\W)/){
#print "$_\n";
$tmp = $HASH{$k};

Awful name for a variable, $tmp.
while ($line =~/(\W)($k)(\W)/){$line =~
s/(\W)($k)(\W)/$1$tmp$3/;}
}
}

If you are concerned about speed you should take the comoposition and
compilatiion of the regex outside the loop.

my @patterns = map { qr/(\W)($_)(\W)/ } keys %HASH;

local *_; # Wise not to stomp on someone else's $_

while (<FILE>) {
my @values = values %HASH;
for my $pattern ( @patterns ) {
my $value = shift @values;
s/$pattern/$1$value$3/g;
}
# print or whatever
}

But my spider-sense is telling me that perhaps the keys of %HASH really
arbirary regexes but are just words and all you want is word
substitution. This is a classic bit of Perl.

while (<FILE>) {
s/(\w+)/$HASH{$1}||$1/eg;
# print or whatever
}
 
T

Tad McClellan

Greg K said:
I am more of a lurker here than a Perl guru(someday I hope),


Use the FAQ Luke!

perldoc -q match

How do I efficiently match many regular expressions at once?

but my
guess


There is no need to guess:

perldoc -q profile

How do I profile my Perl programs?

I use a similar program


I sure hope that does not mean the the code below is what
you have actually used.

If it is, then you just haven't found the bugs that are in it yet...

#!/usr/local/bin/perl
use strict;
use warnings;


A most excellent start, BTW.

my $dir = opendir DIR,"../old";


You should check the return value to see if you actually got
what you asked for, just like with open() below.

$old[0] = 'C';
$new[0] = 'Perl';

open IN, "<../old/$filename" || die "Unable to open files for
reading";


You should include the $! variable in your die message, it
contains the reason for the dying.

for( my $t = 0; $t < ($#old + 1); $t++ )


Gak!

foreach my $t ( 0 .. $#old )

or at least:

for( my $t = 0; $t < @old; $t++ )

or even:

for( my $t = 0; $t <= $#old; $t++ )

s/$old[$t]/$new[$t]/g;


If

$_ = 'Chuck likes C';

Then you _want_ it to be changed to:

Perlhuck likes Perl

??


Probably not, so maybe this instead:

s/\b$old[$t]\b/$new[$t]/g;
 
G

Greg K

Use the FAQ Luke!
perldoc -q match
How do I efficiently match many regular expressions at once?

Nice, didn't know about that. Thanks for the tip!
I sure hope that does not mean the the code below is what
you have actually used.
If it is, then you just haven't found the bugs that are in it yet...

Actually, yes, I do use the following.
for( my $t = 0; $t < @old; $t++ )
for( my $t = 0; $t <= $#old; $t++ )

I would use either of these this one. Unfortunately I still have to
switch back and forth between languages frequently so I probably won't
switch to the ( 0 .. $#old ) notation any time soon.
s/$old[$t]/$new[$t]/g;
If
$_ = 'Chuck likes C';
Then you _want_ it to be changed to:
Perlhuck likes Perl

Actually, for my case it does what I want. I am matching long unique
strings so I don't have to worry about accidental matches(strings such
as "../group_name/participants.html", etc). I figured the OP can write
the regexp for their specific need.

Anyway, thx for the constructive criticism. I am going to keep reading
through the PerlFAQ.
 
A

Anno Siegel

Brian McCauley said:
Arturi said:
I wrote some perl lines that scans a file looking for some words and
substitute them for their corresponding pairs, which are previously
saved in a HASH.
[...]

If you are concerned about speed you should take the comoposition and
compilatiion of the regex outside the loop.

my @patterns = map { qr/(\W)($_)(\W)/ } keys %HASH;

local *_; # Wise not to stomp on someone else's $_

while (<FILE>) {
my @values = values %HASH;
for my $pattern ( @patterns ) {
my $value = shift @values;
s/$pattern/$1$value$3/g;
}
# print or whatever
}

But my spider-sense is telling me that perhaps the keys of %HASH really
arbirary regexes but are just words and all you want is word
substitution. This is a classic bit of Perl.

while (<FILE>) {
s/(\w+)/$HASH{$1}||$1/eg;
# print or whatever
}

At first sight that looks rather inefficient because *every* word
is captured and replaced (often by itself). Out of interest I
benchmarked a few variants, and it turns out that it is just as
fast as more selective solutions.

I'm appending the benchmark script, in case anyone cares. The four
variants tested are (least efficient first, the difference between
the last two is insignificant):

direct -- every regex is compiled from the string at run time
regex -- all alternatives compiled into a big regex
brian -- replace every word, like the second solution above
precomp -- regular expressions precompiled via qr//, like the
first solution above

Anno

--

#!/usr/local/bin/perl
use strict; use warnings; $| = 1;
use Vi::QuickFix;

use Benchmark qw( cmpthese);

our @lines = <DATA>;

my %repla; # string replacements (master)
@repla{
qw( one two three four five six seven eight nine ten eleven twelve)
} = qw( ein zwei drei vier fuenf sechs sieben acht neun zehn elf zwoelf);

my @repla = map [ qr/\b$_\b/ => $repla{ $_}], keys %repla; # precompiled

my $repla = do { # single regex
my $re = join '|', map "\\b$_\\b", keys %repla;
qr/($re)/;
};

goto bench;

check:
print brian( @lines);
print "\n", @lines;
exit;

bench:
@lines = ( @lines) x 10; # vary to check linearity
cmpthese -1, {
direct => 'direct( @lines)',
precomp => 'precomp( @lines)',
brian => 'brian( @lines)',
regex => 'regex( @lines)',

};
#######################################################################

sub direct {
my @new = @_;
for ( @new ) {
my ( $str, $rpl);
s/\b$str\b/$rpl/g while ( $str, $rpl) = each %repla;
}
@new;
}

sub precomp {
my @new = @_;
for ( @new ) {
for my $r ( @repla ) {
s/$r->[ 0]/$r->[ 1]/g;
}
}
@new ;
}

sub brian {
my @new = @_;
s/(\w+)/$repla{ $1} || $1/eg for @new;
@new;
}

sub regex {
my @new = @_;
s/$repla/$repla{ $1}/g for @new;
@new;
}

__DATA__
one spot was detected on each of the two blades
three jolly coachmen, three jolly coachmen
twelve to a dozen
a display of oneupmanship
three o'clock four o'clock five o'clock ROCK
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,578
Members
45,052
Latest member
LucyCarper

Latest Threads

Top