Need more efficient use of the substitution operator

Discussion in 'Perl Misc' started by Niall Macpherson, Sep 17, 2004.

  1. I don't use regexp / substitution handling very often and although I
    think I have a basic grasp I am having problems with understanding how
    to make multiple substitutions of different characters within a
    string. I understand the use of appending a 'g' to the command for
    multiple substitutions of the same pattern , but the following code
    looks as if it could be improved.

    I am trying to find the first occurence of anything between a '[' and
    a ']'
    and return that string

    i.e the following code should print 'STRING'. It appears to work but
    seems a bit long winded. Is there a better way of doing it ?

    use strict;
    use warnings;
    use diagnostics;

    sub GetString
    {
    my ($teststring) = @_;

    if ($teststring =~ /\[.*\]/)
    {
    my $match = $&;
    $match =~ s/\[//;
    $match =~ s/\]//;
    return($match);
    }
    else
    {
    return("");
    }
    }

    my $input = " foo [STRING] bar ";
    my $output = GetString($input);
    print "Result = '$output'";

    Thanks
    Niall Macpherson, Sep 17, 2004
    #1
    1. Advertising

  2. Niall Macpherson wrote:
    > I don't use regexp / substitution handling very often and although
    > I think I have a basic grasp I am having problems with
    > understanding how to make multiple substitutions of different
    > characters within a string. I understand the use of appending a 'g'
    > to the command for multiple substitutions of the same pattern , but
    > the following code looks as if it could be improved.
    >
    > I am trying to find the first occurence of anything between a '['
    > and a ']' and return that string


    If you are trying to *find* something, it's not substitution you
    should do, but you'd rather use the m// (matching) operator with
    capturing parentheses (see "perldoc perlop").

    > i.e the following code should print 'STRING'. It appears to work
    > but seems a bit long winded. Is there a better way of doing it ?


    <code snipped>

    Indeed.

    my $input = " foo [STRING] bar ";
    print "Result = '", $input =~ /\[(.*?)\]/, "'\n";

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, Sep 17, 2004
    #2
    1. Advertising

  3. Niall Macpherson

    Anno Siegel Guest

    Niall Macpherson <> wrote in comp.lang.perl.misc:
    > I don't use regexp / substitution handling very often and although I
    > think I have a basic grasp I am having problems with understanding how
    > to make multiple substitutions of different characters within a
    > string. I understand the use of appending a 'g' to the command for
    > multiple substitutions of the same pattern , but the following code
    > looks as if it could be improved.
    >
    > I am trying to find the first occurence of anything between a '[' and
    > a ']'
    > and return that string


    That is, you want to match part of a string and return the result.
    That is what capturing parentheses are for.

    > i.e the following code should print 'STRING'. It appears to work but
    > seems a bit long winded. Is there a better way of doing it ?


    It doesn't even do exactly what you want. Test it with
    " foo [STRING] [A-LING] bar ".

    > use strict;
    > use warnings;
    > use diagnostics;
    >
    > sub GetString
    > {
    > my ($teststring) = @_;
    >
    > if ($teststring =~ /\[.*\]/)


    This matches everything from the first opening "[" to the last closing
    "]". To catch only the first pair, make the /.*/ non-greedy:

    /\[.*?\]/

    > {
    > my $match = $&;
    > $match =~ s/\[//;
    > $match =~ s/\]//;
    > return($match);


    You could have returned the substring of $match from the second to
    the next-to-last character, instead of deleting the brackets:

    return substr( $match, 1, -1);

    But see below.

    > }
    > else
    > {
    > return("");


    It would be wiser to return nothing instead of an empty string in
    case of failure. An empty string is a legitimate return value
    for an empty "[]". Just

    return;

    > }
    > }
    >
    > my $input = " foo [STRING] bar ";
    > my $output = GetString($input);
    > print "Result = '$output'";


    The use of $& to capture the match is still supported, but there are
    better ways. Use capturing parentheses to extract exactly the part
    of the match you want. That way, you get the content of the "[...]"
    directly:

    my ( $match ) = $teststring =~ /\[(.*?)\]/;

    That is all. Putting it together:

    sub GetString {
    my $teststring = shift;
    my ( $match) = $teststring =~ /\[(.*?)\]/ or return;
    $match;
    }

    or even

    sub GetString { ( shift =~ /\[(.*?)\]/)[ 0] }

    Anno
    Anno Siegel, Sep 17, 2004
    #3
  4. (Niall Macpherson) wrote in
    news::

    > I am trying to find the first occurence of anything between a '[' and
    > a ']' and return that string


    In addition to the useful responses by others, consider reading the faq
    entry

    perldoc -q match

    Also, for simple string matches, keep in mind the index function:

    perldoc -f index

    > use strict;
    > use warnings;
    > use diagnostics;
    >
    > sub GetString
    > {
    > my ($teststring) = @_;
    >
    > if ($teststring =~ /\[.*\]/)
    > {
    > my $match = $&;


    Have you read perldoc perlvar?

    $& The string matched by the last successful pattern match
    ....
    The use of this variable anywhere in a program imposes a
    considerable performance penalty on all regular expression
    matches. See "BUGS".

    If you wanted to do what you are doing above in a better way, you could
    do this:

    #! perl

    use strict;
    use warnings;

    my $s = 'Hello [ insert planet name here ]';

    print scalar find_bracketed_string($s), "\n";

    sub find_bracketed_string {
    my ($s) = @_;

    my ($l, $r);

    if(($l = 1 + index $s, '[') > $[
    and ($r = index $s, ']', $l) >= $[) {
    my $rs = substr $s, $l, $r - $l;
    return wantarray ? ($rs, $r + 1) : $rs;
    }

    return;
    }

    Sinan.
    A. Sinan Unur, Sep 17, 2004
    #4
  5. Gunnar Hjalmarsson <> wrote in message news:<>...
    >
    > If you are trying to *find* something, it's not substitution you
    > should do, but you'd rather use the m// (matching) operator with
    > capturing parentheses (see "perldoc perlop").
    >


    Thanks Gunnar . The reason that I was doing the substitution was that
    I didn't fully understand the concept of the capturing parentheses in
    a regexp.

    Therefore all I had to work with was the string [STRING] returned from
    via the $& variable which needed the '[' and ']' removed.

    In your example you use the return value from the expression. Am I
    right in thinking that this value will also be in $1 ?

    And if I have multiple regexps inside my expression then the matches
    will be in $1, $2, $3 ?
    Niall Macpherson, Sep 17, 2004
    #5
  6. Niall Macpherson wrote:
    > Gunnar Hjalmarsson wrote:
    >>
    >> my $input = " foo [STRING] bar ";
    >> print "Result = '", $input =~ /\[(.*?)\]/, "'\n";

    >
    > In your example you use the return value from the expression. Am I
    > right in thinking that this value will also be in $1 ?


    If there is a match: yes, otherwise: no. Consequently, if you want to
    work with $1, $2 etc., you need to first check if the match succeeded,
    and only use those variables if it did.

    > And if I have multiple regexps inside my expression then the matches
    > will be in $1, $2, $3 ?


    No. The dollar-digit variables contain what was captured from the last
    succeeded match.

    Or did you mean multiple pairs of capturing parentheses inside the
    regex? If you had asked that, the answer would have been yes. (Again
    provided that the match succeeded.)

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, Sep 17, 2004
    #6
  7. Gunnar Hjalmarsson <> writes:

    >Niall Macpherson wrote:
    >> I am trying to find the first occurence of anything between a '['
    >> and a ']' and return that string

    >
    >If you are trying to *find* something, it's not substitution you
    >should do, but you'd rather use the m// (matching) operator with
    >capturing parentheses (see "perldoc perlop").
    >
    >
    >Indeed.
    >
    > my $input = " foo [STRING] bar ";
    > print "Result = '", $input =~ /\[(.*?)\]/, "'\n";
    >
    >--

    Is there a differnce in regex efficiency between the non-greedy ".*?" as
    used above, and the more specific "[^]]*" ? I can't remember the
    backtracking rules for NFA non-greedy quantifiers, and my Mastering
    Regular Expressions is out on loan.

    --
    Mike Slass
    Michael Slass, Sep 17, 2004
    #7
  8. Michael Slass wrote:
    > Gunnar Hjalmarsson <> writes:
    >>
    >> my $input = " foo [STRING] bar ";
    >> print "Result = '", $input =~ /\[(.*?)\]/, "'\n";

    >
    > Is there a differnce in regex efficiency between the non-greedy
    > ".*?" as used above, and the more specific "[^]]*" ?


    Not sure, but I believe the latter is more efficient (but two more
    characters to type...).

    > I can't remember the backtracking rules for NFA non-greedy
    > quantifiers, and my Mastering Regular Expressions is out on loan.


    Do a benchmark! ;-)

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, Sep 17, 2004
    #8
  9. Gunnar Hjalmarsson <> writes:

    >Michael Slass wrote:
    >> Gunnar Hjalmarsson <> writes:
    >>> my $input = " foo [STRING] bar ";
    >>> print "Result = '", $input =~ /\[(.*?)\]/, "'\n";

    >> Is there a differnce in regex efficiency between the non-greedy
    >> ".*?" as used above, and the more specific "[^]]*" ?

    >
    >Do a benchmark! ;-)



    :) Yup, that's the true engineer's answer; I'm more interested in the
    professor's answer -- *why* the faster one is faster. A rule from
    Mastering Regular Expressions, "Say what you mean", seems to come to
    mind --- in this case, we mean "anything that's not ]" --- so "[^]]*"
    is more exact.

    I'll try to dig up the Dragon book for the regex discussion on NFA
    backtracking and *.
    --
    Mike Slass
    Michael Slass, Sep 17, 2004
    #9
  10. Niall Macpherson

    Eric Bohlman Guest

    Gunnar Hjalmarsson <> wrote in news:2r0guiF144m9kU1@uni-
    berlin.de:

    >> In your example you use the return value from the expression. Am I
    >> right in thinking that this value will also be in $1 ?

    >
    > If there is a match: yes, otherwise: no. Consequently, if you want to
    > work with $1, $2 etc., you need to first check if the match succeeded,
    > and only use those variables if it did.


    Just to amplify on this (I'm sure you know it, but many newbies won't): if
    the match failed, the $digit variables will be *untouched*. Not set to ""
    or undef or anything like that. In particular, if a regex succeeds once
    and then fails on subsequent input, the $digit variables will still have
    the values *left over from the successful match*. Failing to take this
    into account can lead to extremely puzzling bugs (which often result in
    plausible-looking but incorrect output).
    Eric Bohlman, Sep 18, 2004
    #10
  11. Michael Slass <> wrote in message news:<>...
    > Is there a differnce in regex efficiency between the non-greedy ".*?" as
    > used above, and the more specific "[^]]*" ? I can't remember the
    > backtracking rules for NFA non-greedy quantifiers, and my Mastering
    > Regular Expressions is out on loan.


    This 'Mastering Regular Expressions' book sounds useful - this is
    presumably the O'Reilly book by Jeffrey Freidl ? Think I had better
    get myself a copy. Is there much Perl related stuff in this book ?
    Niall Macpherson, Sep 20, 2004
    #11
  12. Abigail <> wrote in message news:<>...
    >
    > Well, it isn't clear what you want to return from:
    >
    > one [two [three] four] five.
    >
    > Should it be
    > a) two [three] four
    > b) two [three
    > c) three
    >
    >


    Sorry - should have made this clearer. I always want the text between
    the first '[' and the first ']' (since anything inside the '[]' which
    is non-alpha is invalid in my case ) so the answer would be b)
    Niall Macpherson, Sep 20, 2004
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Vivek Mandava
    Replies:
    28
    Views:
    2,163
    ArWeGod
    Sep 11, 2003
  2. Michael
    Replies:
    4
    Views:
    399
    Matt Hammond
    Jun 26, 2006
  3. Robert Klemme

    With a Ruby Yell: more, more more!

    Robert Klemme, Sep 28, 2005, in forum: Ruby
    Replies:
    5
    Views:
    206
    Jeff Wood
    Sep 29, 2005
  4. Sam Kong
    Replies:
    15
    Views:
    217
    Sam Kong
    Jan 24, 2007
  5. David Deutsch
    Replies:
    1
    Views:
    109
    Gunnar Hjalmarsson
    Feb 14, 2005
Loading...

Share This Page