Need more efficient use of the substitution operator

N

Niall Macpherson

I don't use regexp / substitution handling very often and although I
think I have a basic grasp I am having problems with understanding how
to make multiple substitutions of different characters within a
string. I understand the use of appending a 'g' to the command for
multiple substitutions of the same pattern , but the following code
looks as if it could be improved.

I am trying to find the first occurence of anything between a '[' and
a ']'
and return that string

i.e the following code should print 'STRING'. It appears to work but
seems a bit long winded. Is there a better way of doing it ?

use strict;
use warnings;
use diagnostics;

sub GetString
{
my ($teststring) = @_;

if ($teststring =~ /\[.*\]/)
{
my $match = $&;
$match =~ s/\[//;
$match =~ s/\]//;
return($match);
}
else
{
return("");
}
}

my $input = " foo [STRING] bar ";
my $output = GetString($input);
print "Result = '$output'";

Thanks
 
G

Gunnar Hjalmarsson

Niall said:
I don't use regexp / substitution handling very often and although
I think I have a basic grasp I am having problems with
understanding how to make multiple substitutions of different
characters within a string. I understand the use of appending a 'g'
to the command for multiple substitutions of the same pattern , but
the following code looks as if it could be improved.

I am trying to find the first occurence of anything between a '['
and a ']' and return that string

If you are trying to *find* something, it's not substitution you
should do, but you'd rather use the m// (matching) operator with
capturing parentheses (see "perldoc perlop").
i.e the following code should print 'STRING'. It appears to work
but seems a bit long winded. Is there a better way of doing it ?

<code snipped>

Indeed.

my $input = " foo [STRING] bar ";
print "Result = '", $input =~ /\[(.*?)\]/, "'\n";
 
A

Anno Siegel

Niall Macpherson said:
I don't use regexp / substitution handling very often and although I
think I have a basic grasp I am having problems with understanding how
to make multiple substitutions of different characters within a
string. I understand the use of appending a 'g' to the command for
multiple substitutions of the same pattern , but the following code
looks as if it could be improved.

I am trying to find the first occurence of anything between a '[' and
a ']'
and return that string

That is, you want to match part of a string and return the result.
That is what capturing parentheses are for.
i.e the following code should print 'STRING'. It appears to work but
seems a bit long winded. Is there a better way of doing it ?

It doesn't even do exactly what you want. Test it with
" foo [STRING] [A-LING] bar ".
use strict;
use warnings;
use diagnostics;

sub GetString
{
my ($teststring) = @_;

if ($teststring =~ /\[.*\]/)

This matches everything from the first opening "[" to the last closing
"]". To catch only the first pair, make the /.*/ non-greedy:

/\[.*?\]/
{
my $match = $&;
$match =~ s/\[//;
$match =~ s/\]//;
return($match);

You could have returned the substring of $match from the second to
the next-to-last character, instead of deleting the brackets:

return substr( $match, 1, -1);

But see below.
}
else
{
return("");

It would be wiser to return nothing instead of an empty string in
case of failure. An empty string is a legitimate return value
for an empty "[]". Just

return;
}
}

my $input = " foo [STRING] bar ";
my $output = GetString($input);
print "Result = '$output'";

The use of $& to capture the match is still supported, but there are
better ways. Use capturing parentheses to extract exactly the part
of the match you want. That way, you get the content of the "[...]"
directly:

my ( $match ) = $teststring =~ /\[(.*?)\]/;

That is all. Putting it together:

sub GetString {
my $teststring = shift;
my ( $match) = $teststring =~ /\[(.*?)\]/ or return;
$match;
}

or even

sub GetString { ( shift =~ /\[(.*?)\]/)[ 0] }

Anno
 
A

A. Sinan Unur

(e-mail address removed) (Niall Macpherson) wrote in
I am trying to find the first occurence of anything between a '[' and
a ']' and return that string

In addition to the useful responses by others, consider reading the faq
entry

perldoc -q match

Also, for simple string matches, keep in mind the index function:

perldoc -f index
use strict;
use warnings;
use diagnostics;

sub GetString
{
my ($teststring) = @_;

if ($teststring =~ /\[.*\]/)
{
my $match = $&;

Have you read perldoc perlvar?

$& The string matched by the last successful pattern match
....
The use of this variable anywhere in a program imposes a
considerable performance penalty on all regular expression
matches. See "BUGS".

If you wanted to do what you are doing above in a better way, you could
do this:

#! perl

use strict;
use warnings;

my $s = 'Hello [ insert planet name here ]';

print scalar find_bracketed_string($s), "\n";

sub find_bracketed_string {
my ($s) = @_;

my ($l, $r);

if(($l = 1 + index $s, '[') > $[
and ($r = index $s, ']', $l) >= $[) {
my $rs = substr $s, $l, $r - $l;
return wantarray ? ($rs, $r + 1) : $rs;
}

return;
}

Sinan.
 
N

Niall Macpherson

Gunnar Hjalmarsson said:
If you are trying to *find* something, it's not substitution you
should do, but you'd rather use the m// (matching) operator with
capturing parentheses (see "perldoc perlop").

Thanks Gunnar . The reason that I was doing the substitution was that
I didn't fully understand the concept of the capturing parentheses in
a regexp.

Therefore all I had to work with was the string [STRING] returned from
via the $& variable which needed the '[' and ']' removed.

In your example you use the return value from the expression. Am I
right in thinking that this value will also be in $1 ?

And if I have multiple regexps inside my expression then the matches
will be in $1, $2, $3 ?
 
G

Gunnar Hjalmarsson

Niall said:
Gunnar said:
my $input = " foo [STRING] bar ";
print "Result = '", $input =~ /\[(.*?)\]/, "'\n";

In your example you use the return value from the expression. Am I
right in thinking that this value will also be in $1 ?

If there is a match: yes, otherwise: no. Consequently, if you want to
work with $1, $2 etc., you need to first check if the match succeeded,
and only use those variables if it did.
And if I have multiple regexps inside my expression then the matches
will be in $1, $2, $3 ?

No. The dollar-digit variables contain what was captured from the last
succeeded match.

Or did you mean multiple pairs of capturing parentheses inside the
regex? If you had asked that, the answer would have been yes. (Again
provided that the match succeeded.)
 
M

Michael Slass

Gunnar Hjalmarsson said:
Niall said:
I am trying to find the first occurence of anything between a '['
and a ']' and return that string

If you are trying to *find* something, it's not substitution you
should do, but you'd rather use the m// (matching) operator with
capturing parentheses (see "perldoc perlop").


Indeed.

my $input = " foo [STRING] bar ";
print "Result = '", $input =~ /\[(.*?)\]/, "'\n";

--
Is there a differnce in regex efficiency between the non-greedy ".*?" as
used above, and the more specific "[^]]*" ? I can't remember the
backtracking rules for NFA non-greedy quantifiers, and my Mastering
Regular Expressions is out on loan.
 
G

Gunnar Hjalmarsson

Michael said:
Gunnar Hjalmarsson said:
my $input = " foo [STRING] bar ";
print "Result = '", $input =~ /\[(.*?)\]/, "'\n";

Is there a differnce in regex efficiency between the non-greedy
".*?" as used above, and the more specific "[^]]*" ?

Not sure, but I believe the latter is more efficient (but two more
characters to type...).
I can't remember the backtracking rules for NFA non-greedy
quantifiers, and my Mastering Regular Expressions is out on loan.

Do a benchmark! ;-)
 
M

Michael Slass

Gunnar Hjalmarsson said:
Michael said:
Gunnar Hjalmarsson said:
my $input = " foo [STRING] bar ";
print "Result = '", $input =~ /\[(.*?)\]/, "'\n";
Is there a differnce in regex efficiency between the non-greedy
".*?" as used above, and the more specific "[^]]*" ?

Do a benchmark! ;-)


:) Yup, that's the true engineer's answer; I'm more interested in the
professor's answer -- *why* the faster one is faster. A rule from
Mastering Regular Expressions, "Say what you mean", seems to come to
mind --- in this case, we mean "anything that's not ]" --- so "[^]]*"
is more exact.

I'll try to dig up the Dragon book for the regex discussion on NFA
backtracking and *.
 
E

Eric Bohlman

If there is a match: yes, otherwise: no. Consequently, if you want to
work with $1, $2 etc., you need to first check if the match succeeded,
and only use those variables if it did.

Just to amplify on this (I'm sure you know it, but many newbies won't): if
the match failed, the $digit variables will be *untouched*. Not set to ""
or undef or anything like that. In particular, if a regex succeeds once
and then fails on subsequent input, the $digit variables will still have
the values *left over from the successful match*. Failing to take this
into account can lead to extremely puzzling bugs (which often result in
plausible-looking but incorrect output).
 
N

Niall Macpherson

Michael Slass said:
Is there a differnce in regex efficiency between the non-greedy ".*?" as
used above, and the more specific "[^]]*" ? I can't remember the
backtracking rules for NFA non-greedy quantifiers, and my Mastering
Regular Expressions is out on loan.

This 'Mastering Regular Expressions' book sounds useful - this is
presumably the O'Reilly book by Jeffrey Freidl ? Think I had better
get myself a copy. Is there much Perl related stuff in this book ?
 
N

Niall Macpherson

Abigail said:
Well, it isn't clear what you want to return from:

one [two [three] four] five.

Should it be
a) two [three] four
b) two [three
c) three

Sorry - should have made this clearer. I always want the text between
the first '[' and the first ']' (since anything inside the '[]' which
is non-alpha is invalid in my case ) so the answer would be b)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,734
Messages
2,569,441
Members
44,832
Latest member
GlennSmall

Latest Threads

Top