Strange behavior by regex with variable

Discussion in 'Perl Misc' started by DJ Stunks, Apr 5, 2006.

  1. DJ Stunks

    DJ Stunks Guest

    Hello all,

    In order to not have to perpetually be escaping a vertical bar field
    separator in my regular expressions I tried to use a variable instead
    (as in _PBP_).

    However, I'm getting some strange behavior. Specifically, perl
    believes the match has succeeded, but the capture is not performed.

    Here's a short but complete script which demonstrates the issue.
    Please let me know if you have any ideas.

    -jp

    PS: witnessed on perl v5.8.7 Binary build 813 [148120] for
    MSWin32-x86-multi-thread

    C:\tmp>cat tmp2.pl
    #!/usr/bin/perl

    use strict;
    use warnings;

    my $BAR = q{|};
    my $string = q{|this.string_contains-some-2342:stuff|};

    if (my ($stuff) = $string =~ m{^\| ([^|]+) \|$}x) {
    print "'$stuff' matched\n";
    }

    if (my ($stuff) = $string =~ m{^$BAR ([^|]+) $BAR$}xo) {
    print "'$stuff' matched using half \$BARs\n";
    }

    if (my ($stuff) = $string =~ m{^$BAR ([^$BAR]+) $BAR$}xo) {
    print "'$stuff' matched using all \$BARs\n";
    }

    __END__

    C:\tmp>tmp2.pl
    'this.string_contains-some-2342:stuff' matched
    Use of uninitialized value in concatenation (.) or string at
    C:\tmp\tmp2.pl line 14.
    '' matched using half $BARs
    Use of uninitialized value in concatenation (.) or string at
    C:\tmp\tmp2.pl line 18.
    '' matched using all $BARs
     
    DJ Stunks, Apr 5, 2006
    #1
    1. Advertising

  2. DJ Stunks

    Paul Lalli Guest

    DJ Stunks wrote:
    > In order to not have to perpetually be escaping a vertical bar field
    > separator in my regular expressions I tried to use a variable instead
    > (as in _PBP_).


    I can only assume you misread the relevant part of PBP. Simply putting
    a regular expression character in a variable does not prevent it from
    being interpreted as a regular expression character.

    > However, I'm getting some strange behavior. Specifically, perl
    > believes the match has succeeded, but the capture is not performed.
    >
    > Here's a short but complete script which demonstrates the issue.
    > Please let me know if you have any ideas.
    >
    > -jp
    >
    > PS: witnessed on perl v5.8.7 Binary build 813 [148120] for
    > MSWin32-x86-multi-thread
    >
    > C:\tmp>cat tmp2.pl
    > #!/usr/bin/perl
    >
    > use strict;
    > use warnings;
    >
    > my $BAR = q{|};
    > my $string = q{|this.string_contains-some-2342:stuff|};
    >
    > if (my ($stuff) = $string =~ m{^\| ([^|]+) \|$}x) {


    Here, you escaped the | characters

    > print "'$stuff' matched\n";
    > }
    >
    > if (my ($stuff) = $string =~ m{^$BAR ([^|]+) $BAR$}xo) {


    Here, the regexp is sent through double-quotish interpolation first,
    meaning $BAR gets replaced withe |. Then the result is sent through
    the regexp parser, making | be interpreted as the regexp alternation
    character.

    > print "'$stuff' matched using half \$BARs\n";
    > }
    >
    > if (my ($stuff) = $string =~ m{^$BAR ([^$BAR]+) $BAR$}xo) {


    Same thing here... just more of them. :)

    > print "'$stuff' matched using all \$BARs\n";
    > }
    >
    > __END__


    If your variables are going to contain regular expression characters,
    you need to quote-meta them:

    if (/\Q$BAR\E ... /) { ... }

    I don't have my copy of PBP on me, so I can't help to explain what part
    of it led you to this faulty belief...

    Paul Lalli
     
    Paul Lalli, Apr 5, 2006
    #2
    1. Advertising

  3. DJ Stunks

    DJ Stunks Guest

    Paul Lalli wrote:
    > DJ Stunks wrote:
    > > In order to not have to perpetually be escaping a vertical bar field
    > > separator in my regular expressions I tried to use a variable instead
    > > (as in _PBP_).

    >
    > I can only assume you misread the relevant part of PBP. Simply putting
    > a regular expression character in a variable does not prevent it from
    > being interpreted as a regular expression character.


    of course.

    Thanks. I must have misunderstood Pastor Conway :p

    -jp
     
    DJ Stunks, Apr 5, 2006
    #3
  4. Paul Lalli wrote:
    > DJ Stunks wrote:
    > > In order to not have to perpetually be escaping a vertical bar field
    > > separator in my regular expressions I tried to use a variable instead
    > > (as in _PBP_).

    >
    > I can only assume you misread the relevant part of PBP. Simply putting
    > a regular expression character in a variable does not prevent it from
    > being interpreted as a regular expression character.
    >
    > > However, I'm getting some strange behavior. Specifically, perl
    > > believes the match has succeeded, but the capture is not performed.
    > >
    > > Here's a short but complete script which demonstrates the issue.
    > > Please let me know if you have any ideas.
    > >
    > > -jp
    > >
    > > PS: witnessed on perl v5.8.7 Binary build 813 [148120] for
    > > MSWin32-x86-multi-thread
    > >
    > > C:\tmp>cat tmp2.pl
    > > #!/usr/bin/perl
    > >
    > > use strict;
    > > use warnings;
    > >
    > > my $BAR = q{|};
    > > my $string = q{|this.string_contains-some-2342:stuff|};
    > >
    > > if (my ($stuff) = $string =~ m{^\| ([^|]+) \|$}x) {

    >
    > Here, you escaped the | characters
    >
    > > print "'$stuff' matched\n";
    > > }
    > >
    > > if (my ($stuff) = $string =~ m{^$BAR ([^|]+) $BAR$}xo) {

    >
    > Here, the regexp is sent through double-quotish interpolation first,
    > meaning $BAR gets replaced withe |. Then the result is sent through
    > the regexp parser, making | be interpreted as the regexp alternation
    > character.
    >
    > > print "'$stuff' matched using half \$BARs\n";
    > > }
    > >
    > > if (my ($stuff) = $string =~ m{^$BAR ([^$BAR]+) $BAR$}xo) {

    >
    > Same thing here... just more of them. :)
    >
    > > print "'$stuff' matched using all \$BARs\n";
    > > }
    > >
    > > __END__

    >
    > If your variables are going to contain regular expression characters,
    > you need to quote-meta them:
    >
    > if (/\Q$BAR\E ... /) { ... }
    >
    > I don't have my copy of PBP on me, so I can't help to explain what part
    > of it led you to this faulty belief...



    yeah, earlier this week, someone was getting double parens because they
    didn't realize that after variable interpolation, the regex engine
    still treats parens as capturing. i tried the following code:

    use strict;
    use warnings;

    my $BAR = qr{\|};
    my $string = q{|this.string_contains-some-2342:stuff|};


    my ($stuff) = $string =~ m/^\| ([^|]+) \|$/x;
    print "'$stuff' matched\n" if $stuff;


    my ($stuff2) = $string =~ m/^$BAR ([^|]+) $BAR$/xo;
    print "'$stuff2' matched using half \$BARs\n" if $stuff2;


    my ($stuff3) = $string =~ m/^$BAR ([^$BAR]+) $BAR$/xo;
    print "'$stuff3' matched using all \$BARs\n" if $stuff3;

    __END__

    and got:
    [nagano 23] ~/1-perl > try_varpipe.pl
    'this.string_contains-some-2342:stuff' matched
    'this.string_contains-some-2342:stuff' matched using half $BARs
    [nagano 24] ~/1-perl >

    it looks like character classes behave in such a way as to prevent the
    last regex to match.
     
    it_says_BALLS_on_your forehead, Apr 5, 2006
    #4
  5. it_says_BALLS_on_your forehead, Apr 5, 2006
    #5
  6. DJ Stunks

    DJ Stunks Guest

    it_says_BALLS_on_your forehead wrote:
    > i tried the following code:
    >
    > use strict;
    > use warnings;
    >
    > my $BAR = qr{\|};
    > my $string = q{|this.string_contains-some-2342:stuff|};
    >
    >
    > my ($stuff) = $string =~ m/^\| ([^|]+) \|$/x;
    > print "'$stuff' matched\n" if $stuff;
    >
    >
    > my ($stuff2) = $string =~ m/^$BAR ([^|]+) $BAR$/xo;
    > print "'$stuff2' matched using half \$BARs\n" if $stuff2;
    >
    >
    > my ($stuff3) = $string =~ m/^$BAR ([^$BAR]+) $BAR$/xo;
    > print "'$stuff3' matched using all \$BARs\n" if $stuff3;
    >
    > __END__
    >
    > and got:
    > [nagano 23] ~/1-perl > try_varpipe.pl
    > 'this.string_contains-some-2342:stuff' matched
    > 'this.string_contains-some-2342:stuff' matched using half $BARs
    > [nagano 24] ~/1-perl >
    >
    > it looks like character classes behave in such a way as to prevent the
    > last regex to match.


    I modified my test script as suggested by Paul and now it works fine,
    including the negated character class.

    I think it's the qr// that caused your third test to fail, BALLS.

    C:\tmp>cat tmp2.pl
    #!/usr/bin/perl

    use strict;
    use warnings;

    my $BAR = quotemeta q{|};
    my $string = q{|this.string_contains-some-2342:stuff|};

    if (my ($stuff) = $string =~ m{^\| ([^|]+) \|$}x) {
    print "'$stuff' matched\n";
    }

    if (my ($stuff) = $string =~ m{^$BAR ([^|]+) $BAR$}xo) {
    print "'$stuff' matched using half \$BARs\n";
    }

    if (my ($stuff) = $string =~ m{^$BAR ([^$BAR]+) $BAR$}xo) {
    print "'$stuff' matched using all \$BARs\n";
    }

    __END__

    C:\tmp>tmp2.pl
    'this.string_contains-some-2342:stuff' matched
    'this.string_contains-some-2342:stuff' matched using half $BARs
    'this.string_contains-some-2342:stuff' matched using all $BARs

    -jp
     
    DJ Stunks, Apr 5, 2006
    #6
  7. DJ Stunks wrote:
    > it_says_BALLS_on_your forehead wrote:
    > > i tried the following code:
    > >
    > > use strict;
    > > use warnings;
    > >
    > > my $BAR = qr{\|};
    > > my $string = q{|this.string_contains-some-2342:stuff|};
    > >
    > >
    > > my ($stuff) = $string =~ m/^\| ([^|]+) \|$/x;
    > > print "'$stuff' matched\n" if $stuff;
    > >
    > >
    > > my ($stuff2) = $string =~ m/^$BAR ([^|]+) $BAR$/xo;
    > > print "'$stuff2' matched using half \$BARs\n" if $stuff2;
    > >
    > >
    > > my ($stuff3) = $string =~ m/^$BAR ([^$BAR]+) $BAR$/xo;
    > > print "'$stuff3' matched using all \$BARs\n" if $stuff3;
    > >
    > > __END__
    > >
    > > and got:
    > > [nagano 23] ~/1-perl > try_varpipe.pl
    > > 'this.string_contains-some-2342:stuff' matched
    > > 'this.string_contains-some-2342:stuff' matched using half $BARs
    > > [nagano 24] ~/1-perl >
    > >
    > > it looks like character classes behave in such a way as to prevent the
    > > last regex to match.

    >
    > I modified my test script as suggested by Paul and now it works fine,
    > including the negated character class.
    >
    > I think it's the qr// that caused your third test to fail, BALLS.


    ahh, i think you're right. i replaced that line with:
    my $BAR = q{\|};

    and all three worked :).
     
    it_says_BALLS_on_your forehead, Apr 5, 2006
    #7
  8. DJ Stunks

    Dr.Ruud Guest

    it_says_BALLS_on_your forehead schreef:
    > DJ Stunks:
    >> it_says_BALLS_on_your forehead:


    >>> my $BAR = qr{\|};

    >>
    >> I think it's the qr// that caused your third test to fail, BALLS.

    >
    > ahh, i think you're right. i replaced that line with:
    > my $BAR = q{\|};
    >
    > and all three worked :).


    Alternative:

    qr/[|]/

    Test:

    echo '-|-' \
    | perl -ne 'chomp; $r=qr/[|]/; \
    print "bar\n" if /\A(.$r.)\z/ and $1 eq $_'

    I assume that the regex-engine optimizes away single-character-sets.

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Apr 5, 2006
    #8
  9. DJ Stunks

    robic0 Guest

    On 5 Apr 2006 12:03:59 -0700, "DJ Stunks" <> wrote:

    >Hello all,
    >
    >In order to not have to perpetually be escaping a vertical bar field
    >separator in my regular expressions I tried to use a variable instead
    >(as in _PBP_).
    >

    Jeez, you may have to go back to /|/. But you higher order folks shouldn't
    be pissed off in your qualified envoronments and narrow escopements then,
    should you?
     
    robic0, Apr 6, 2006
    #9
  10. Dr.Ruud wrote:
    > it_says_BALLS_on_your forehead schreef:
    >>DJ Stunks:
    >>>it_says_BALLS_on_your forehead:

    >
    >>>>my $BAR = qr{\|};
    >>>I think it's the qr// that caused your third test to fail, BALLS.

    >>ahh, i think you're right. i replaced that line with:
    >>my $BAR = q{\|};
    >>
    >>and all three worked :).

    >
    > Alternative:
    >
    > qr/[|]/
    >
    > Test:
    >
    > echo '-|-' \
    > | perl -ne 'chomp; $r=qr/[|]/; \
    > print "bar\n" if /\A(.$r.)\z/ and $1 eq $_'
    >
    > I assume that the regex-engine optimizes away single-character-sets.


    Well, let's see what Benchmark says:

    $ perl -MBenchmark=cmpthese -e'
    cmpthese -10, {
    lit => q{ q/abcdefghijklmn/ =~ /klmn/ },
    cc => q{ q/abcdefghijklmn/ =~ /[k][l][m][n]/ } }
    '
    Rate cc lit
    cc 2192887/s -- -41%
    lit 3741724/s 71% --




    John
    --
    use Perl;
    program
    fulfillment
     
    John W. Krahn, Apr 6, 2006
    #10
  11. DJ Stunks

    Dr.Ruud Guest

    single-character-sets (was: Re: Strange behavior by regex with variable)

    John W. Krahn schreef:
    > Dr.Ruud:


    >> I assume that the regex-engine optimizes away single-character-sets.

    >
    > Well, let's see what Benchmark says:
    >
    > $ perl -MBenchmark=cmpthese -e'
    > cmpthese -10, {
    > lit => q{ q/abcdefghijklmn/ =~ /klmn/ },
    > cc => q{ q/abcdefghijklmn/ =~ /[k][l][m][n]/ } }
    > '
    > Rate cc lit
    > cc 2192887/s -- -41%
    > lit 3741724/s 71% --


    That doesn't show whether the regex-engine optimizes away
    single-character-sets or not.

    I tried to get closer to the metal and rewrote your test to this:

    #!/usr/bin/perl
    use strict;
    use warnings;

    use Benchmark qw/cmpthese/;

    sub say
    { local $\ = "\n";
    print '';
    if (@_) {
    print "<$_>" for @_;
    print '';
    } }

    my $s = 'abcdefghijklmn_jklmn_jklmnopquvwxyz';
    my $qr_lit = qr/klmn/;
    my $qr_cs = qr/[k][l][m][n]/;
    my $do_lit = sub { scalar( () = $_[0] =~ /$qr_lit/g ) };
    my $do_cs = sub { scalar( () = $_[0] =~ /$qr_cs/g ) };

    say $s, $qr_lit, $qr_cs, &$do_lit($s), &$do_cs($s);

    if (1) {
    cmpthese -5,
    {
    lit => q{ $s =~ $qr_lit; }
    , cs => q{ $s =~ $qr_cs ; }
    };
    }

    say $s, $qr_lit, $qr_cs, &$do_lit($s), &$do_cs($s);

    __END__

    and then they both win sometimes.
    :)

    The iterations-per-second are higher with //o:
    lit => q{ $s =~ /$qr_lit/o; }


    (perl v5.8.6, i386-freebsd-64int)

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Apr 6, 2006
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mantorok Redgormor
    Replies:
    70
    Views:
    1,802
    Dan Pop
    Feb 17, 2004
  2. Romain

    Strange global variable behavior

    Romain, Apr 14, 2004, in forum: C Programming
    Replies:
    8
    Views:
    333
    Alan Balmer
    Apr 14, 2004
  3. Replies:
    3
    Views:
    794
    Reedick, Andrew
    Jul 1, 2008
  4. Aaron Watters
    Replies:
    1
    Views:
    235
    Aaron Watters
    Dec 30, 2009
  5. Daniel Berger

    Ruby regex engine behavior question

    Daniel Berger, Sep 13, 2004, in forum: Ruby
    Replies:
    5
    Views:
    174
Loading...

Share This Page