Basic Regular Expressions question...

W

Will

Hi,
I have a longer program that finds and recursively replaces text in
many html files that works beautifully for most cases, but I think I'm
getting hung up on a s/// and regular expressions glitch. I wrote a
very short program that gets to the heart of the matter...

################################################################################
use strict;
use warnings;


my
$find="https://sinaicentral.mssm.edu/intranet/intranet/ct_public/view?trial_id=MSM03204&searchNow=no";
my $replace="http://www.excite.com";


my $thisPage=
"https://sinaicentral.mssm.edu/intranet/intranet/ct_public/view?trial_id=MSM03204&searchNow=no";

$thisPage =~ s#$find#$replace#g;

print $thisPage;
################################################################################

To my understanding, this program should take the long string in $find
and then replace it with $replace and the output should be
"http://www.excite.com". I think the "?" in the $find variable is
being treated as a Regular Expression but I can't figure out a way to
nullify that effect. I'm a librarian not a programmer! Sombody please
help! I'm working for a worthy non-profit that is strapped for cash, so
I have to figure this out! It will bring you good karma! Thanks a
bunch!

Will Jiang
 
A

A. Sinan Unur

To my understanding, this program should take the long string in $find
and then replace it with $replace and the output should be
"http://www.excite.com". I think the "?" in the $find variable is
being treated as a Regular Expression but I can't figure out a way to
nullify that effect.

To put it correctly, ? is special in a regular expression.

perldoc perlreref

? Matches the preceding element 0 or 1 times

also from the same document

\Q Disable pattern metacharacters until \E

$thispage =~ s{\Q$find\E}{$replace};

should work.
I'm a librarian not a programmer! Sombody
please help! I'm working for a worthy non-profit that is strapped for
cash, so I have to figure this out! It will bring you good karma!

None of this increases your chances of getting help. Describing your
problem accurately, as you did, is the crucial part.

For further information on how to help others help you, please see the
posting guidelines for this group if you haven't already done so.

Sinan
 
G

Gunnar Hjalmarsson

A. Sinan Unur said:
\Q Disable pattern metacharacters until \E

$thispage =~ s{\Q$find\E}{$replace};

Since escaping all the characters in PATTERN makes it a non-regex
problem, I played with using index() and substr() instead:

substr $thisPage, index($thisPage, $find), length $find, $replace;

However, to take the /g modifier into consideration (which the OP
originally used), you seem to need something like:

my ($i, $length) = (0,0);
while ( ( $i = index $thisPage, $find, $i+$length ) >= 0 ) {
$length = length $find;
substr $thisPage, $i, $length, $replace;
}

That's much typing to 'emulate'

$thispage =~ s/\Q$find/$replace/g;

Assuming that using index() and substr() is more efficient than the
using the s/// operator, is there any easier way to combine them to
achieve the same result?
 
A

Anno Siegel

Gunnar Hjalmarsson said:
Since escaping all the characters in PATTERN makes it a non-regex
problem, I played with using index() and substr() instead:

substr $thisPage, index($thisPage, $find), length $find, $replace;

However, to take the /g modifier into consideration (which the OP
originally used), you seem to need something like:

my ($i, $length) = (0,0);
while ( ( $i = index $thisPage, $find, $i+$length ) >= 0 ) {
$length = length $find;
substr $thisPage, $i, $length, $replace;
}

That's much typing to 'emulate'

$thispage =~ s/\Q$find/$replace/g;

A bit tighter:

my $i = -1;
substr $thisPage, $i, length $find, $replace while
( $i = index $thisPage, $find, $i + 1) >= 0;

Anno
 
G

Gunnar Hjalmarsson

Anno said:
A bit tighter:

my $i = -1;
substr $thisPage, $i, length $find, $replace while
( $i = index $thisPage, $find, $i + 1) >= 0;

Yeah, but I just realized that neither of the above index() + substr()
solutions would work on e.g. this set of input:

my $thisPage = "It's a ball. The ball is brown.";
my $find = 'ball';
my $replace = 'football';

Isn't it something like this that's needed:

my $repl_length = length $replace;
my $i = -$repl_length;
while ( ( $i = index $thisPage, $find, $i + $repl_length ) >= 0 ) {
substr $thisPage, $i, length $find, $replace;
}

Maybe no wonder that the s/// operator is frequently used also for
replacing non-regex patterns when efficiency is not a restriction.
 
A

Anno Siegel

Gunnar Hjalmarsson said:
Yeah, but I just realized that neither of the above index() + substr()
solutions would work on e.g. this set of input:

my $thisPage = "It's a ball. The ball is brown.";
my $find = 'ball';
my $replace = 'football';

Isn't it something like this that's needed:

my $repl_length = length $replace;
my $i = -$repl_length;
while ( ( $i = index $thisPage, $find, $i + $repl_length ) >= 0 ) {
substr $thisPage, $i, length $find, $replace;
}

You're right, though I wouldn't bother with storing the length of
anything. Working backwards runs smoother:

my $i = length $thisPage;
substr( $thisPage, $i, length $find) = $replace while
( $i = rindex $thisPage, $find, $i) >= 0;

That way only the unchanged part of the string is ever searched.
Maybe no wonder that the s/// operator is frequently used also for
replacing non-regex patterns when efficiency is not a restriction.

I don't think I've used index for anything but simple location or just
presence/absence. Replacement is too much hassle with the substr() for
my taste.

Anno
 
J

John W. Krahn

Anno said:
You're right, though I wouldn't bother with storing the length of
anything. Working backwards runs smoother:

my $i = length $thisPage;
substr( $thisPage, $i, length $find) = $replace while
( $i = rindex $thisPage, $find, $i) >= 0;

That way only the unchanged part of the string is ever searched.

Also, using the four argument substr() should be faster.

my $i = length $thisPage;
substr $thisPage, $i, length $find, $replace
while ( $i = rindex $thisPage, $find, $i ) >= 0;



John
 
A

Anno Siegel

[using index() instead of s///]
Also, using the four argument substr() should be faster.

How so? I never heard of that.

I use "=" with substr() assignments because it reads better. Four argument
substr is for when I need the old value of the substring too.
my $i = length $thisPage;
substr $thisPage, $i, length $find, $replace
while ( $i = rindex $thisPage, $find, $i ) >= 0;

Anno
 
T

Tassilo v. Parseval

Also sprach Anno Siegel:
[using index() instead of s///]
Also, using the four argument substr() should be faster.

How so? I never heard of that.

John is right according to a benchmark:

#!/usr/bin/perl -w

use strict;
use Benchmark qw/cmpthese/;

my $string = "0" x 100;

cmpthese(-2, {
arg4 => sub {
substr $string, rand length $string, 1, "0";
},
arg3 => sub {
substr($string, rand length $string, 1) = "0";
},
});
__END__
Rate arg3 arg4
arg3 512250/s -- -43%
arg4 903253/s 76% --

The reason for arg4 being faster is the fact that perl needs to attach
magic to the return value of the 3-argument substr(). In the 4-argument
case this is not the case.
I use "=" with substr() assignments because it reads better. Four argument
substr is for when I need the old value of the substring too.

In most cases though the argument of readability supersedes speed, so
now you're right. :)

Tassilo
 
A

Anno Siegel

Tassilo v. Parseval said:
Also sprach Anno Siegel:

John is right according to a benchmark:

Yup. I ran one too. The difference is significant.
The reason for arg4 being faster is the fact that perl needs to attach
magic to the return value of the 3-argument substr(). In the 4-argument
case this is not the case.

Ah... the return value from 4-arg substr is indeed not magic:

my $x = 'aaaaabbbbbcccc';
my $ref = \ substr( $x, 5, 5, 'ZZZZZ');

$$ref = 'XXXXX'; # has no effect on $x
print "$x\n"; # aaaaaZZZZZcccc

After removing the fourth argument from substr() $x changes to
"aaaaaXXXXXcccc".

Live and learn.

Anno
 
A

Anno Siegel

Tassilo v. Parseval said:
Also sprach Anno Siegel:

John is right according to a benchmark:

Yup. I ran one too. The difference is significant.
The reason for arg4 being faster is the fact that perl needs to attach
magic to the return value of the 3-argument substr(). In the 4-argument
case this is not the case.

Ah... the return value from 4-arg substr is indeed not magic:

my $x = 'aaaaabbbbbcccc';
my $ref = \ substr( $x, 5, 5, 'ZZZZZ');

$$ref = 'XXXXX'; # has no effect on $x
print "$x\n"; # aaaaaZZZZZcccc

After removing the fourth argument from substr(), $x changes to
"aaaaaXXXXXcccc".

Live and learn.

Anno
 
A

Anno Siegel

Tassilo v. Parseval said:
Also sprach Anno Siegel:

John is right according to a benchmark:

Yup. I ran one too. The difference is significant.
The reason for arg4 being faster is the fact that perl needs to attach
magic to the return value of the 3-argument substr(). In the 4-argument
case this is not the case.

Ah... the return value from 4-arg substr is indeed not magic:

my $x = 'aaaaabbbbbcccc';
my $ref = \ substr( $x, 5, 5, 'ZZZZZ');

$$ref = 'XXXXX'; # has no effect on $x
print "$x\n"; # aaaaaZZZZZcccc

After removing the fourth argument from substr(), $x changes to
"aaaaaXXXXXcccc".

Live and learn.

Anno
 
B

Bart Lateur

Anno said:
You're right, though I wouldn't bother with storing the length of
anything. Working backwards runs smoother:

my $i = length $thisPage;
substr( $thisPage, $i, length $find) = $replace while
( $i = rindex $thisPage, $find, $i) >= 0;

That way only the unchanged part of the string is ever searched.

I'm sure you're aware that the behaviour of this version is not the same
as the original one for overlapping matches. Try replacing "papa" in
"papapaya", for example.
 
A

Anno Siegel

Bart Lateur said:
I'm sure you're aware that the behaviour of this version is not the same
as the original one for overlapping matches. Try replacing "papa" in
"papapaya", for example.

Frankly, no, I didn't notice. Thanks for pointing it out.

Anno
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top