Basic Regular Expressions question...

Will · Apr 6, 2005

Hi,
I have a longer program that finds and recursively replaces text in
many html files that works beautifully for most cases, but I think I'm
getting hung up on a s/// and regular expressions glitch. I wrote a
very short program that gets to the heart of the matter...

################################################################################
use strict;
use warnings;

my
$find="https://sinaicentral.mssm.edu/intranet/intranet/ct_public/view?trial_id=MSM03204&searchNow=no";
my $replace="http://www.excite.com";

my $thisPage=
"https://sinaicentral.mssm.edu/intranet/intranet/ct_public/view?trial_id=MSM03204&searchNow=no";

$thisPage =~ s#$find#$replace#g;

print $thisPage;
################################################################################

To my understanding, this program should take the long string in $find
and then replace it with $replace and the output should be
"http://www.excite.com". I think the "?" in the $find variable is
being treated as a Regular Expression but I can't figure out a way to
nullify that effect. I'm a librarian not a programmer! Sombody please
help! I'm working for a worthy non-profit that is strapped for cash, so
I have to figure this out! It will bring you good karma! Thanks a
bunch!

Will Jiang

A. Sinan Unur · Apr 6, 2005

To my understanding, this program should take the long string in $find
and then replace it with $replace and the output should be
"http://www.excite.com". I think the "?" in the $find variable is
being treated as a Regular Expression but I can't figure out a way to
nullify that effect.

To put it correctly, ? is special in a regular expression.

perldoc perlreref

? Matches the preceding element 0 or 1 times

also from the same document

\Q Disable pattern metacharacters until \E

$thispage =~ s{\Q$find\E}{$replace};

should work.

I'm a librarian not a programmer! Sombody
please help! I'm working for a worthy non-profit that is strapped for
cash, so I have to figure this out! It will bring you good karma!

None of this increases your chances of getting help. Describing your
problem accurately, as you did, is the crucial part.

For further information on how to help others help you, please see the
posting guidelines for this group if you haven't already done so.

Sinan

Will · Apr 6, 2005

THANKS SO MUCH! I really appreciate the help! Have a wonderful day!

Will Jiang

Gunnar Hjalmarsson · Apr 6, 2005

A. Sinan Unur said:
\Q Disable pattern metacharacters until \E

$thispage =~ s{\Q$find\E}{$replace};

Since escaping all the characters in PATTERN makes it a non-regex
problem, I played with using index() and substr() instead:

substr $thisPage, index($thisPage, $find), length $find, $replace;

However, to take the /g modifier into consideration (which the OP
originally used), you seem to need something like:

my ($i, $length) = (0,0);
while ( ( $i = index $thisPage, $find, $i+$length ) >= 0 ) {
$length = length $find;
substr $thisPage, $i, $length, $replace;
}

That's much typing to 'emulate'

$thispage =~ s/\Q$find/$replace/g;

Assuming that using index() and substr() is more efficient than the
using the s/// operator, is there any easier way to combine them to
achieve the same result?

Tad McClellan · Apr 7, 2005

Will said:
I have a longer program that finds and recursively replaces text in

There is no recursion in what you are doing.

Anno Siegel · Apr 7, 2005

Gunnar Hjalmarsson said:
Since escaping all the characters in PATTERN makes it a non-regex
problem, I played with using index() and substr() instead:

substr $thisPage, index($thisPage, $find), length $find, $replace;

However, to take the /g modifier into consideration (which the OP
originally used), you seem to need something like:

my ($i, $length) = (0,0);
while ( ( $i = index $thisPage, $find, $i+$length ) >= 0 ) {
$length = length $find;
substr $thisPage, $i, $length, $replace;
}

That's much typing to 'emulate'

$thispage =~ s/\Q$find/$replace/g;

A bit tighter:

my $i = -1;
substr $thisPage, $i, length $find, $replace while
( $i = index $thisPage, $find, $i + 1) >= 0;

Anno

Gunnar Hjalmarsson · Apr 7, 2005

Anno said:
A bit tighter:

my $i = -1;
substr $thisPage, $i, length $find, $replace while
( $i = index $thisPage, $find, $i + 1) >= 0;

Yeah, but I just realized that neither of the above index() + substr()
solutions would work on e.g. this set of input:

my $thisPage = "It's a ball. The ball is brown.";
my $find = 'ball';
my $replace = 'football';

Isn't it something like this that's needed:

my $repl_length = length $replace;
my $i = -$repl_length;
while ( ( $i = index $thisPage, $find, $i + $repl_length ) >= 0 ) {
substr $thisPage, $i, length $find, $replace;
}

Maybe no wonder that the s/// operator is frequently used also for
replacing non-regex patterns when efficiency is not a restriction.

Anno Siegel · Apr 7, 2005

Gunnar Hjalmarsson said:
Yeah, but I just realized that neither of the above index() + substr()
solutions would work on e.g. this set of input:

my $thisPage = "It's a ball. The ball is brown.";
my $find = 'ball';
my $replace = 'football';

Isn't it something like this that's needed:

my $repl_length = length $replace;
my $i = -$repl_length;
while ( ( $i = index $thisPage, $find, $i + $repl_length ) >= 0 ) {
substr $thisPage, $i, length $find, $replace;
}

You're right, though I wouldn't bother with storing the length of
anything. Working backwards runs smoother:

my $i = length $thisPage;
substr( $thisPage, $i, length $find) = $replace while
( $i = rindex $thisPage, $find, $i) >= 0;

That way only the unchanged part of the string is ever searched.

Maybe no wonder that the s/// operator is frequently used also for
replacing non-regex patterns when efficiency is not a restriction.

I don't think I've used index for anything but simple location or just
presence/absence. Replacement is too much hassle with the substr() for
my taste.

Anno

John W. Krahn · Apr 8, 2005

Anno said:
You're right, though I wouldn't bother with storing the length of
anything. Working backwards runs smoother:

my $i = length $thisPage;
substr( $thisPage, $i, length $find) = $replace while
( $i = rindex $thisPage, $find, $i) >= 0;

That way only the unchanged part of the string is ever searched.

Also, using the four argument substr() should be faster.

my $i = length $thisPage;
substr $thisPage, $i, length $find, $replace
while ( $i = rindex $thisPage, $find, $i ) >= 0;

John

Anno Siegel · Apr 8, 2005

[using index() instead of s///]

Also, using the four argument substr() should be faster.

How so? I never heard of that.

I use "=" with substr() assignments because it reads better. Four argument
substr is for when I need the old value of the substring too.

my $i = length $thisPage;
substr $thisPage, $i, length $find, $replace
while ( $i = rindex $thisPage, $find, $i ) >= 0;

Anno

Tassilo v. Parseval · Apr 8, 2005

Also sprach Anno Siegel:

[using index() instead of s///]

Also, using the four argument substr() should be faster.

Click to expand...

How so? I never heard of that.

John is right according to a benchmark:

#!/usr/bin/perl -w

use strict;
use Benchmark qw/cmpthese/;

my $string = "0" x 100;

cmpthese(-2, {
arg4 => sub {
substr $string, rand length $string, 1, "0";
},
arg3 => sub {
substr($string, rand length $string, 1) = "0";
},
});
__END__
Rate arg3 arg4
arg3 512250/s -- -43%
arg4 903253/s 76% --

The reason for arg4 being faster is the fact that perl needs to attach
magic to the return value of the 3-argument substr(). In the 4-argument
case this is not the case.

I use "=" with substr() assignments because it reads better. Four argument
substr is for when I need the old value of the substring too.

In most cases though the argument of readability supersedes speed, so
now you're right.

Tassilo

Anno Siegel · Apr 8, 2005

Tassilo v. Parseval said:
Also sprach Anno Siegel:

John is right according to a benchmark:

Yup. I ran one too. The difference is significant.

The reason for arg4 being faster is the fact that perl needs to attach
magic to the return value of the 3-argument substr(). In the 4-argument
case this is not the case.

Ah... the return value from 4-arg substr is indeed not magic:

my $x = 'aaaaabbbbbcccc';
my $ref = \ substr( $x, 5, 5, 'ZZZZZ');

$$ref = 'XXXXX'; # has no effect on $x
print "$x\n"; # aaaaaZZZZZcccc

After removing the fourth argument from substr() $x changes to
"aaaaaXXXXXcccc".

Live and learn.

Anno

Anno Siegel · Apr 8, 2005

Tassilo v. Parseval said:
Also sprach Anno Siegel:

John is right according to a benchmark:

Yup. I ran one too. The difference is significant.

The reason for arg4 being faster is the fact that perl needs to attach
magic to the return value of the 3-argument substr(). In the 4-argument
case this is not the case.

Ah... the return value from 4-arg substr is indeed not magic:

my $x = 'aaaaabbbbbcccc';
my $ref = \ substr( $x, 5, 5, 'ZZZZZ');

$$ref = 'XXXXX'; # has no effect on $x
print "$x\n"; # aaaaaZZZZZcccc

After removing the fourth argument from substr(), $x changes to
"aaaaaXXXXXcccc".

Live and learn.

Anno

Anno Siegel · Apr 8, 2005

Tassilo v. Parseval said:
Also sprach Anno Siegel:

John is right according to a benchmark:

Yup. I ran one too. The difference is significant.

The reason for arg4 being faster is the fact that perl needs to attach
magic to the return value of the 3-argument substr(). In the 4-argument
case this is not the case.

Ah... the return value from 4-arg substr is indeed not magic:

my $x = 'aaaaabbbbbcccc';
my $ref = \ substr( $x, 5, 5, 'ZZZZZ');

$$ref = 'XXXXX'; # has no effect on $x
print "$x\n"; # aaaaaZZZZZcccc

After removing the fourth argument from substr(), $x changes to
"aaaaaXXXXXcccc".

Live and learn.

Anno

Bart Lateur · Apr 11, 2005

Anno said:
You're right, though I wouldn't bother with storing the length of
anything. Working backwards runs smoother:

my $i = length $thisPage;
substr( $thisPage, $i, length $find) = $replace while
( $i = rindex $thisPage, $find, $i) >= 0;

That way only the unchanged part of the string is ever searched.

I'm sure you're aware that the behaviour of this version is not the same
as the original one for overlapping matches. Try replacing "papa" in
"papapaya", for example.

Anno Siegel · Apr 11, 2005

Bart Lateur said:
I'm sure you're aware that the behaviour of this version is not the same
as the original one for overlapping matches. Try replacing "papa" in
"papapaya", for example.

Frankly, no, I didn't notice. Thanks for pointing it out.

Anno

regular expressions and matching delimeters	17	May 21, 2014
Python Regular Expressions	4	Jun 22, 2011
Regular Expressions IgnoreCase	1	Jan 3, 2007
Regular expressions help	8	Feb 24, 2004
POP3 Mail Client in PERL using IO::Socket module only and regular expressions	8	Apr 11, 2006
regular expressions eliminating filenames of type foo.thumbnail.jpg	7	Jun 25, 2007
Regular Expressions and String Replacement	1	Dec 13, 2003
Is possible to combine handle_data and regular expressions?	0	Jan 19, 2006

Basic Regular Expressions question...

Will

A. Sinan Unur

Will

Gunnar Hjalmarsson

Tad McClellan

Anno Siegel

Gunnar Hjalmarsson

Anno Siegel

John W. Krahn

Anno Siegel

Tassilo v. Parseval

Anno Siegel

Anno Siegel

Anno Siegel

Bart Lateur

Anno Siegel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads