Can someone 'splain why this regex won't work both ways?

Discussion in 'Perl Misc' started by spydox@gmail.com, Apr 14, 2008.

  1. Guest

    I'm trying to find a repeated number in a string, like 122345 finds
    22.

    This works:

    /(\d)\1/

    This doesn't:

    /\1(\d)/

    I guess LLR parsing is to blame, but shouldn't the second example
    first try to FIND a $1 then check to see if there is a \1, and repeat
    that process moving L to R?

    I though Perl sort of went to and fro trying to do matching. To me,
    there IS a /\1(\d)/ in the string since $1 is 2, and there is a \1 = 2
    preceeding it.

    I was a little surprized this didn't work although I can sort of see
    why in a way too. In some ways it seems to me that regexes should be
    *disconnected* from parsing - just answer the question does this
    match?
     
    , Apr 14, 2008
    #1
    1. Advertising

  2. Ben Morrow Guest

    Quoth :
    >
    > I'm trying to find a repeated number in a string, like 122345 finds
    > 22.
    >
    > This works:
    >
    > /(\d)\1/
    >
    > This doesn't:
    >
    > /\1(\d)/
    >
    > I guess LLR parsing is to blame, but shouldn't the second example
    > first try to FIND a $1 then check to see if there is a \1, and repeat
    > that process moving L to R?
    >
    > I though Perl sort of went to and fro trying to do matching. To me,
    > there IS a /\1(\d)/ in the string since $1 is 2, and there is a \1 = 2
    > preceeding it.


    There are two separate operations here which you are confusing. First
    perl parses the regex itself, and compiles it into an internal form.
    Then it matches that regex against the string you provide. The second
    will backtrack, under some circumstances; the first won't.

    Ben
     
    Ben Morrow, Apr 14, 2008
    #2
    1. Advertising

  3. wrote in
    news:093bf887-729d-4400-8750-

    :

    > I'm trying to find a repeated number in a string, like 122345
    > finds 22.
    >
    > This works:
    >
    > /(\d)\1/
    >
    > This doesn't:
    >
    > /\1(\d)/
    >
    > I guess LLR parsing is to blame,


    ....

    > I was a little surprized this didn't work although I can sort of
    > see why in a way too. In some ways it seems to me that regexes
    > should be *disconnected* from parsing - just answer the question
    > does this match?


    I don't look at this as a parsing issue. Rather, it is a "the
    universe must make sense" kind of issue: The first match does not
    exist before the first match. That makes sense to me. It may not
    make sense to you.

    Sinan
    --
    A. Sinan Unur <>
    (remove .invalid and reverse each component for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://www.rehabitation.com/clpmisc/
     
    A. Sinan Unur, Apr 14, 2008
    #3
  4. Guest

    On Apr 14, 2:31 pm, Ben Morrow <> wrote:
    > Quoth :
    >
    >
    >
    >
    >
    > > I'm trying to find a repeated number in a string, like 122345 finds
    > > 22.

    >
    > > This works:

    >
    > > /(\d)\1/

    >
    > > This doesn't:

    >
    > > /\1(\d)/

    >
    > > I guess LLR parsing is to blame, but shouldn't the second example
    > > first try to FIND a $1 then check to see if there is a \1, and repeat
    > > that process moving L to R?

    >
    > > I though Perl sort of went to and fro trying to do matching. To me,
    > > there IS a /\1(\d)/ in the string since $1 is 2, and there is a \1 = 2
    > > preceeding it.

    >
    > There are two separate operations here which you are confusing. First
    > perl parses the regex itself, and compiles it into an internal form.
    > Then it matches that regex against the string you provide. The second
    > will backtrack, under some circumstances; the first won't.
    >
    > Ben


    Understood, and I appreciate the insight. It makes sense.
    Yet, when all else apparently *fails*, in my experience, and I've
    heard MJD and others say this, Perl will "do its best" to match. To
    me, unless it *also* tried backtracking, it gave up too soon..
     
    , Apr 14, 2008
    #4
  5. Guest

    ..
    ..
    ..
    >
    > > I guess LLR parsing is to blame,

    >

    ..
    ..
    >
    > I don't look at this as a parsing issue. Rather, it is a "the
    > universe must make sense" kind of issue: The first match does not
    > exist before the first match. That makes sense to me. It may not
    > make sense to you.
    >


    To me, like conventional pattern-recognition, of say two tanks next to
    each other, the system should accept it whether the match is described
    either way:

    find a tank with another identical tank to it's left

    *or*

    find a tank with another identical tank to it's right


    The system should have no *context-sensitivity* where only one of the
    two matches. Sure, internally an algorithm may be scanning L to R or R
    to L or whatever, but the user should not even be concerned with that,
    at least in this case. I still think it gave up too soon- it should
    have tried R to L (backtracking) when L to R failed.

    Just IMHO, thank-you for your thoughts. This area seems just a bit
    gray to me I'd be very interested in Damain or Mark's thoughts.
     
    , Apr 14, 2008
    #5
  6. Ben Morrow Guest

    Quoth :
    > On Apr 14, 2:31 pm, Ben Morrow <> wrote:
    > > Quoth :
    > > >
    > > > I'm trying to find a repeated number in a string, like 122345 finds
    > > > 22.

    > >
    > > > This works:

    > >
    > > > /(\d)\1/

    > >
    > > > This doesn't:

    > >
    > > > /\1(\d)/

    > >
    > > > I guess LLR parsing is to blame, but shouldn't the second example
    > > > first try to FIND a $1 then check to see if there is a \1, and repeat
    > > > that process moving L to R?

    > >
    > > > I though Perl sort of went to and fro trying to do matching. To me,
    > > > there IS a /\1(\d)/ in the string since $1 is 2, and there is a \1 = 2
    > > > preceeding it.

    > >
    > > There are two separate operations here which you are confusing. First
    > > perl parses the regex itself, and compiles it into an internal form.
    > > Then it matches that regex against the string you provide. The second
    > > will backtrack, under some circumstances; the first won't.

    >
    > Understood, and I appreciate the insight. It makes sense.
    > Yet, when all else apparently *fails*, in my experience, and I've
    > heard MJD and others say this, Perl will "do its best" to match. To
    > me, unless it *also* tried backtracking, it gave up too soon..


    No, you're still not understanding. Perl will only backtrack *while
    trying to match*. Compiling the regex comes long before that.

    Ben
     
    Ben Morrow, Apr 14, 2008
    #6
  7. Willem Guest

    wrote:
    ) Understood, and I appreciate the insight. It makes sense.
    ) Yet, when all else apparently *fails*, in my experience, and I've
    ) heard MJD and others say this, Perl will "do its best" to match. To
    ) me, unless it *also* tried backtracking, it gave up too soon..

    That's not what backtracking means.


    SaSW, Willem
    --
    Disclaimer: I am in no way responsible for any of the statements
    made in the above text. For all I know I might be
    drugged or something..
    No I'm not paranoid. You all think I'm paranoid, don't you !
    #EOT
     
    Willem, Apr 14, 2008
    #7
  8. J.D. Baldwin Guest

    In the previous article, <> wrote:
    > > > I guess LLR parsing is to blame,

    > >

    > .
    > .
    > >
    > > I don't look at this as a parsing issue. Rather, it is a "the
    > > universe must make sense" kind of issue: The first match does not
    > > exist before the first match. That makes sense to me. It may not
    > > make sense to you.
    > >

    >
    > To me, like conventional pattern-recognition, of say two tanks next to
    > each other, the system should accept it whether the match is described
    > either way:
    >
    > find a tank with another identical tank to it's left
    >
    > *or*
    >
    > find a tank with another identical tank to it's right


    A better phrasing:

    find a tank, then find another one to its right

    *or*

    find another one to its left, then find a tank

    One of these phrasings makes sense; the other does not. Or, rather,
    the other doesn't and one of the phrasings makes sense.

    If you want a more formal justification, here's what the Camel Book
    says about these. Note the two instances of the word "later":

    1.7.4. Backreferences

    [...] A pair of parentheses around a part of a regular
    expression causes whatever was matched by that part to be
    remembered for later use. It doesn't change what the part
    matches, so /\d+/ and /(\d+)/ will still match as many digits
    as possible, but in the latter case they will be remembered in
    a special variable to be backreferenced later.
    --
    _+_ From the catapult of |If anyone disagrees with any statement I make, I
    _|70|___:)=}- J.D. Baldwin |am quite prepared not only to retract it, but also
    \ / |to deny under oath that I ever made it. -T. Lehrer
    ***~~~~-----------------------------------------------------------------------
     
    J.D. Baldwin, Apr 14, 2008
    #8
  9. wrote in
    news:e6278092-e663-4ea6-8f07-40d65faeb551
    @f63g2000hsf.googlegroups.co
    m:

    [ please do not snip attributions ]

    >> > I guess LLR parsing is to blame,

    >>
    >> I don't look at this as a parsing issue. Rather, it is a "the
    >> universe must make sense" kind of issue: The first match does not
    >> exist before the first match. That makes sense to me. It may not
    >> make sense to you.
    >>

    >
    > To me, like conventional pattern-recognition, of say two tanks
    > next to each other, the system should accept it whether the match
    > is described either way:
    >
    > find a tank with another identical tank to it's left
    >
    > *or*
    >
    > find a tank with another identical tank to it's right
    >
    >
    > The system should have no *context-sensitivity* where only one of
    > the two matches. Sure, internally an algorithm may be scanning L
    > to R or R to L or whatever, but the user should not even be
    > concerned with that, at least in this case. I still think it gave
    > up too soon- it should have tried R to L (backtracking) when L to
    > R failed.


    What you seem to want is a "match two identical characters"
    operator. For this particular case, you can achieve that by using:

    =for example

    my @strings = qw( 1222345 1233345 );

    s/00|11|22|33|44|55|66|77|88|99// for @strings;

    print "$_\n" for @strings;

    =cut

    When you use a character class, every element of that class is
    considered equivalent to every other one. So, for example, when you
    write

    /\d{2}/

    that does find two characters that are in the same equivalence
    class.

    The tank analogy works perftectly here because there are no two
    identical tanks in the world. Instead, there are equivalence classes
    of tanks. Tanks that are the same model, tanks in the same unit etc.

    If what you want is to say,

    find a tank, then find another tank that is the same
    model as the one you just found

    well, that is equivalent to /(\d)\1/

    J. D. Baldwin gives perfect examples of why /\1(\d)/ does not make
    sense: Finding another tank in the same equivalence class as the one
    you first found comes after first finding a tank.

    > Just IMHO, thank-you for your thoughts. This area seems just a bit
    > gray to me I'd be very interested in Damain or Mark's thoughts.


    s/Damain/Damian/

    My feeble mind looks at the following:

    #!/usr/bin/perl

    use strict;
    use warnings;

    use 5.010;

    for ( my @a = qw( 1222345 1233345 ) ) {
    s/(?<tank>\d)\K\k<tank>// and print "$_\n";
    }

    for ( my @a = qw( 1222345 1233345 ) ) {
    s/(?<tank>\d)\K\k<tank>+// and print "$_\n";
    }

    for ( my @a = qw( 1222345 1233345 ) ) {
    s/(?<tank>\d)\k<tank>// and print "$_\n";
    }

    __END__

    thinks that the third one is the most natural (that is, find a tank,
    then find another tank in the same equivalence class) to the other
    ones.

    Sinan

    --
    A. Sinan Unur <>
    (remove .invalid and reverse each component for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://www.rehabitation.com/clpmisc/
     
    A. Sinan Unur, Apr 14, 2008
    #9
  10. Guest

    Ben Morrow <> wrote:
    > Quoth :
    > > On Apr 14, 2:31 pm, Ben Morrow <> wrote:
    > > > Quoth :
    > > > >
    > > > > I'm trying to find a repeated number in a string, like 122345 finds
    > > > > 22.
    > > >
    > > > > This works:
    > > >
    > > > > /(\d)\1/
    > > >
    > > > > This doesn't:
    > > >
    > > > > /\1(\d)/
    > > >
    > > > > I guess LLR parsing is to blame, but shouldn't the second example
    > > > > first try to FIND a $1 then check to see if there is a \1, and
    > > > > repeat that process moving L to R?
    > > >
    > > > > I though Perl sort of went to and fro trying to do matching. To me,
    > > > > there IS a /\1(\d)/ in the string since $1 is 2, and there is a \1
    > > > > = 2 preceeding it.
    > > >
    > > > There are two separate operations here which you are confusing. First
    > > > perl parses the regex itself, and compiles it into an internal form.
    > > > Then it matches that regex against the string you provide. The second
    > > > will backtrack, under some circumstances; the first won't.

    > >
    > > Understood, and I appreciate the insight. It makes sense.
    > > Yet, when all else apparently *fails*, in my experience, and I've
    > > heard MJD and others say this, Perl will "do its best" to match. To
    > > me, unless it *also* tried backtracking, it gave up too soon..

    >
    > No, you're still not understanding. Perl will only backtrack *while
    > trying to match*. Compiling the regex comes long before that.


    I think that that is what he is talking about, the when trying to match
    part. His use of "parsing" in the original question was ill-advised, but I
    think you latched onto the the bad phrasing and rather than the intended
    question, and now won't let him correct his poor phrasing. First perl
    parses and compiles the regular expression, then it uses that compiled
    expression to match (or loosely speaking "parse") the target string.

    Perl parses and compiles /\1(.)/ without error or warning (which surprised
    me). But then what does it do with it?

    Conceptually, it could temporarily treat the \1 as ".*", and then when/if
    the capture is matched go back and verify that it is the same as the thing
    previously matched by the tentative .* cum \1. I don't know that I would
    call this backtracking (as the OP seems to be doing), but I can't think of
    anything obviously better to call it. Or it could reorder things give
    an identical compiled regex as /(.)\1/. I don't know if these two things
    would give the same answer as each other in all cases (if so, the latter
    would surely be faster).

    I think that that is what the OP thought it should do. Obviously, Perl
    doesn't do either of those thing. I can't figure out what it does do. I
    thought it would treat \1 preceding any capture as the empty string, but
    apparently it doesn't do that, either. It seems to act as something
    unmatchable.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    The costs of publication of this article were defrayed in part by the
    payment of page charges. This article must therefore be hereby marked
    advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
    this fact.
     
    , Apr 14, 2008
    #10
  11. On 2008-04-14 18:57, <> wrote:
    >> I don't look at this as a parsing issue. Rather, it is a "the
    >> universe must make sense" kind of issue: The first match does not
    >> exist before the first match. That makes sense to me. It may not
    >> make sense to you.
    >>

    >
    > To me, like conventional pattern-recognition, of say two tanks next to
    > each other, the system should accept it whether the match is described
    > either way:
    >
    > find a tank with another identical tank to it's left
    >
    > *or*
    >
    > find a tank with another identical tank to it's right
    >
    >
    > The system should have no *context-sensitivity* where only one of the
    > two matches. Sure, internally an algorithm may be scanning L to R or R
    > to L or whatever, but the user should not even be concerned with that,
    > at least in this case. I still think it gave up too soon- it should
    > have tried R to L (backtracking) when L to R failed.


    Backtracking doesn't mean scanning right to left. Backtracking means to
    go back to the last point where you had a choice and try the other
    alternative(s).

    So, for example if you have a pattern /foo(bar|baz)/, after matching
    "foo", you have a choice between trying to match "bar" or "baz". The
    regex engine will try to match "bar" first, and if that fails, it will
    backtrack to the point before it tried that and then try to match "baz"
    instead.

    But in a pattern like /\1(a)/ there is no choice: It needs to start by
    matching the string in the first capture group, but that hasn't been
    defined yet, so it must fail. (Well, it could try all possible strings,
    but that would be extremely inefficient).

    hp
     
    Peter J. Holzer, Apr 14, 2008
    #11
  12. Guest

    wrote:
    > In the previous article, <> wrote:
    > >
    > > find a tank with another identical tank to it's left
    > >
    > > *or*
    > >
    > > find a tank with another identical tank to it's right

    >
    > A better phrasing:
    >
    > find a tank, then find another one to its right
    >
    > *or*
    >
    > find another one to its left, then find a tank
    >
    > One of these phrasings makes sense; the other does not. Or, rather,
    > the other doesn't and one of the phrasings makes sense.
    >
    > If you want a more formal justification, here's what the Camel Book
    > says about these. Note the two instances of the word "later":


    I think that that is his point, an objection to the notion that
    left and right typographically equate to "earlier" and "later"
    chronologically.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    The costs of publication of this article were defrayed in part by the
    payment of page charges. This article must therefore be hereby marked
    advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
    this fact.
     
    , Apr 14, 2008
    #12
  13. [A complimentary Cc of this posting was sent to

    <>], who wrote in article <>:
    >
    > I'm trying to find a repeated number in a string, like 122345 finds
    > 22.
    >
    > This works:
    >
    > /(\d)\1/
    >
    > This doesn't:
    >
    > /\1(\d)/


    This depends on what you mean by "works". It works in the sense that
    it does not match (as it should not). I do not find it documented in
    perlre, but \3 will fail to match if group 3 did not match "yet".

    Hope this helps,
    Ilya

    P.S. perl -Mre=debugcolor -wle "q(aa) =~ /\1(a)/"
     
    Ilya Zakharevich, Apr 14, 2008
    #13
  14. Ben Morrow Guest

    Quoth :
    > Ben Morrow <> wrote:
    > >
    > > No, you're still not understanding. Perl will only backtrack *while
    > > trying to match*. Compiling the regex comes long before that.

    >
    > I think that that is what he is talking about, the when trying to match
    > part. His use of "parsing" in the original question was ill-advised, but I
    > think you latched onto the the bad phrasing and rather than the intended
    > question, and now won't let him correct his poor phrasing. First perl
    > parses and compiles the regular expression, then it uses that compiled
    > expression to match (or loosely speaking "parse") the target string.


    You're right, I was misunderstanding the OP's misunderstanding. :)

    > Perl parses and compiles /\1(.)/ without error or warning (which surprised
    > me). But then what does it do with it?


    I was assuming (without having tried it) that the regex was failing at
    the compile stage. It seems I was wrong... :(

    > Conceptually, it could temporarily treat the \1 as ".*", and then when/if
    > the capture is matched go back and verify that it is the same as the thing
    > previously matched by the tentative .* cum \1.


    Something like this can be done with

    /(.*)(a)(??{ $1 eq $2 ? "(?:)" : "(?!)" })/

    using a code assertion to insert either a 'succeed' or a 'fail and
    backtrack' item into the regex at runtime. Not that I'd recommend this,
    of course... :)

    > I don't know that I would
    > call this backtracking (as the OP seems to be doing), but I can't think of
    > anything obviously better to call it. Or it could reorder things give
    > an identical compiled regex as /(.)\1/. I don't know if these two things
    > would give the same answer as each other in all cases (if so, the latter
    > would surely be faster).
    >
    > I think that that is what the OP thought it should do. Obviously, Perl
    > doesn't do either of those thing. I can't figure out what it does do. I
    > thought it would treat \1 preceding any capture as the empty string, but
    > apparently it doesn't do that, either. It seems to act as something
    > unmatchable.


    All the $N start as undef, which is unmatchable as you found (-Mre=debug
    is useful for finding out what is going on). Whenever perl backtracks to
    retry part of a match, it clears any $N set by the part of the match it
    is backtracking over, so /\1(.)/ couldn't match even if it did
    backtrack, as $1 would be undef again by the time it got to retry the
    \1. (Perl doesn't in fact backtrack, as it knows nothing has changed so
    this would be an infinite loop.)

    It is, however, possible to get \1 to match when it appears earlier in
    the expression than the first brackets (which is why it's not a syntax
    error); you just have to make sure it gets set first. For instance,

    "abac" =~ /^(?:\1c|(a)b)+$/

    matches. The first time through the +, $1 is undef so the \1c part fails;
    but the (a)b part succeeds so $1 gets set. Then it goes round the + loop
    again, and this time $1 is 'a' so the first branch can succeed.

    Ben
     
    Ben Morrow, Apr 15, 2008
    #14
  15. On Apr 14, 4:28 pm, Ben Morrow <> wrote:
    > ...
    >
    > It is, however, possible to get \1 to match when it appears earlier in
    > the expression than the first brackets (which is why it's not a syntax
    > error); you just have to make sure it gets set first. For instance,
    >
    > "abac" =~ /^(?:\1c|(a)b)+$/
    >
    > matches. The first time through the +, $1 is undef so the \1c part fails;
    > but the (a)b part succeeds so $1 gets set. Then it goes round the + loop
    > again, and this time $1 is 'a' so the first branch can succeed.
    >


    Slightly different example sans alternation could work too if $N is
    optional:

    "abab" =~ /(?:(\2)?(b))+/

    I'd have the same enthusiasm for a root canal though...

    --
    Charles DeRykus
     
    comp.llang.perl.moderated, Apr 16, 2008
    #15
  16. [A complimentary Cc of this posting was sent to
    comp.llang.perl.moderated
    <>], who wrote in article <>:
    > Slightly different example sans alternation could work too if $N is
    > optional:
    >
    > "abab" =~ /(?:(\2)?(b))+/
    >
    > I'd have the same enthusiasm for a root canal though...


    Extra parens are considered harmfull:

    perl -wle "q(abab) =~ /(?: \1? (b) )+/x or die"

    (not much beauty gained or lost, of course...).

    Yours,
    Ilya
     
    Ilya Zakharevich, Apr 17, 2008
    #16
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mr. SweatyFinger
    Replies:
    2
    Views:
    2,031
    Smokey Grindel
    Dec 2, 2006
  2. =?Utf-8?B?RGF2aWQgVGhpZWxlbg==?=

    bind data both ways in a repeater control

    =?Utf-8?B?RGF2aWQgVGhpZWxlbg==?=, Jan 17, 2007, in forum: ASP .Net
    Replies:
    3
    Views:
    593
    Walter Wang [MSFT]
    Jan 18, 2007
  3. Chris
    Replies:
    1
    Views:
    368
    Chris
    Jun 20, 2007
  4. bpatton

    please splain dis scoping issure

    bpatton, Jun 22, 2007, in forum: Perl Misc
    Replies:
    2
    Views:
    120
    J├╝rgen Exner
    Jun 25, 2007
  5. Trev

    regex - why won't this work?

    Trev, Oct 26, 2006, in forum: Javascript
    Replies:
    7
    Views:
    122
    mick white
    Oct 26, 2006
Loading...

Share This Page