Interpolation of qr-regexes containing backreferences

Discussion in 'Perl Misc' started by Haakon Riiser, Jan 22, 2004.

  1. I just noticed that backreferences in qr-regexes behave differently
    from what I expected when they are interpolated into a new regex.
    I expected that the meaning of the backreference shouldn't change
    when interpolated into a new regex. I.e., one should be able to
    do things like:

    $re1 = qr{(.)\1};
    $re2 = qr{($re1$re1)};

    which I would expect to be equivalent to

    $re2 = qr{((.)\2(.)\3)};

    Perl 5.8.3 instead does this:

    $re2 = qr{((.)\1(.)\1)};

    I searched for the problem on Google, and found that it has been
    known for at least three years. Since it's still here, does that
    mean that there's another solution that does not require me to
    drop the interpolation and write the entire regex as one big chunk?

    Thanks in advance for any replies.

    --
    Haakon
     
    Haakon Riiser, Jan 22, 2004
    #1
    1. Advertising

  2. Haakon Riiser

    Ben Morrow Guest

    Haakon Riiser <> wrote:
    > I just noticed that backreferences in qr-regexes behave differently
    > from what I expected when they are interpolated into a new regex.
    > I expected that the meaning of the backreference shouldn't change
    > when interpolated into a new regex. I.e., one should be able to
    > do things like:
    >
    > $re1 = qr{(.)\1};
    > $re2 = qr{($re1$re1)};
    >
    > which I would expect to be equivalent to
    >
    > $re2 = qr{((.)\2(.)\3)};
    >
    > Perl 5.8.3 instead does this:
    >
    > $re2 = qr{((.)\1(.)\1)};


    You could try (untested):

    my $re1 = qr[(.)(??{$^N})];
    my $re2 = qr[($re1$re1)];

    Ben

    --
    perl -e'print map {/.(.)/s} sort unpack "a2"x26, pack "N"x13,
    qw/1632265075 1651865445 1685354798 1696626283 1752131169 1769237618
    1801808488 1830841936 1886550130 1914728293 1936225377 1969451372
    2047502190/' #
     
    Ben Morrow, Jan 22, 2004
    #2
    1. Advertising

  3. [Ben Morrow]

    > Haakon Riiser <> wrote:
    >> [...] I.e., one should be able to do things like:
    >>
    >> $re1 = qr{(.)\1};
    >> $re2 = qr{($re1$re1)};
    >>
    >> which I would expect to be equivalent to
    >>
    >> $re2 = qr{((.)\2(.)\3)};
    >>

    >
    > You could try (untested):
    >
    > my $re1 = qr[(.)(??{$^N})];
    > my $re2 = qr[($re1$re1)];


    Thanks, this works great! I've usually tried to avoid "highly
    experimental" regex features such as (??{ ... }), but it's been
    marked highly experimental for a few years now, so how dangerous
    could it be?

    I should probably reread that section of the regex manual since
    I didn't pay too much attention to it the first time, it being
    experimental and all. :)

    --
    Haakon
     
    Haakon Riiser, Jan 22, 2004
    #3
  4. [Ben Morrow]

    > You could try (untested):
    >
    > my $re1 = qr[(.)(??{$^N})];
    > my $re2 = qr[($re1$re1)];


    One question regarding the behavior of (??{ ... }):
    Take the following code: (Notice that there are two versions of
    the $quoted_literal regex. The first one uses (??{ ... }) and $^N
    and the other one uses the delimiter directly.)

    use warnings;

    $quoted_literal = qr/
    (")
    (??{ "[^$^N]*$^N" })
    /x;

    $quoted_literal = qr/
    "
    [^"]*
    "
    /x;

    $data = 'this is "hello" world';
    @list = $data =~ /($quoted_literal|[^"]*)/g;
    for ($i = 0; $i < @list; $i++) {
    printf "[$i] '\%s'\n", defined $list[$i] ? $list[$i] : "UNDEFINED";
    }

    If I run this program as it is (using the simple direct version of
    $quoted_literal) the output is

    [0] 'this is '
    [1] '"hello"'
    [2] ' world'
    [3] ''

    If the simple version of $quoted_literal is removed, i.e. making the
    script use the (??{ ... }) / $^N version, the result is completely
    different:

    [0] 'this is '
    [1] 'UNDEFINED'
    [2] '"hello"'
    [3] '"'
    [4] ' world'
    [5] 'UNDEFINED'
    [6] ''
    [7] 'UNDEFINED'

    As I understood it, the two versions of $quoted_literal should
    match exactly the same text, so I can't figure out why the results
    aren't the same. Any help in understanding why this happens,
    and preferably fixing it, is greatly appreciated.

    --
    Haakon
     
    Haakon Riiser, Jan 22, 2004
    #4
  5. Haakon Riiser

    Ben Morrow Guest

    Haakon Riiser <> wrote:
    > [Ben Morrow]
    >
    > > You could try (untested):
    > >
    > > my $re1 = qr[(.)(??{$^N})];
    > > my $re2 = qr[($re1$re1)];

    >
    > One question regarding the behavior of (??{ ... }):
    > Take the following code: (Notice that there are two versions of
    > the $quoted_literal regex. The first one uses (??{ ... }) and $^N
    > and the other one uses the delimiter directly.)
    >
    > use warnings;
    >
    > $quoted_literal = qr/
    > (")
    > (??{ "[^$^N]*$^N" })
    > /x;
    >
    > $quoted_literal = qr/
    > "
    > [^"]*
    > "
    > /x;
    >
    > $data = 'this is "hello" world';
    > @list = $data =~ /($quoted_literal|[^"]*)/g;
    > for ($i = 0; $i < @list; $i++) {
    > printf "[$i] '\%s'\n", defined $list[$i] ? $list[$i] : "UNDEFINED";
    > }
    >
    > If I run this program as it is (using the simple direct version of
    > $quoted_literal) the output is
    >
    > [0] 'this is '
    > [1] '"hello"'
    > [2] ' world'
    > [3] ''
    >
    > If the simple version of $quoted_literal is removed, i.e. making the
    > script use the (??{ ... }) / $^N version, the result is completely
    > different:
    >
    > [0] 'this is '
    > [1] 'UNDEFINED'
    > [2] '"hello"'
    > [3] '"'
    > [4] ' world'
    > [5] 'UNDEFINED'
    > [6] ''
    > [7] 'UNDEFINED'
    >
    > As I understood it, the two versions of $quoted_literal should
    > match exactly the same text, so I can't figure out why the results
    > aren't the same. Any help in understanding why this happens,
    > and preferably fixing it, is greatly appreciated.


    The regex with (??{}) in it has an extra set of parentheses. If you
    take the second output again, and number the rows:

    > [0] 'this is ' $1
    > [1] 'UNDEFINED' $2
    > [2] '"hello"' $1
    > [3] '"' $2
    > [4] ' world' $1
    > [5] 'UNDEFINED' $2
    > [6] '' $1
    > [7] 'UNDEFINED' $2


    it should be clear. BTW, you would almost certainly be better off
    using Text::Balanced for this sort of thing.

    Ben

    --
    EAT
    KIDS (...er, whoops...)
    FOR
    99p
     
    Ben Morrow, Jan 22, 2004
    #5
  6. [Ben Morrow]

    > The regex with (??{}) in it has an extra set of parentheses. If
    > you take the second output again, and number the rows:
    >
    >> [0] 'this is ' $1
    >> [1] 'UNDEFINED' $2
    >> [2] '"hello"' $1
    >> [3] '"' $2
    >> [4] ' world' $1
    >> [5] 'UNDEFINED' $2
    >> [6] '' $1
    >> [7] 'UNDEFINED' $2

    >
    > it should be clear.


    Argh, I can't believe I didn't spot that one. Time to take a
    break I guess. :)

    > BTW, you would almost certainly be better off using
    > Text::Balanced for this sort of thing.


    That would require me to totally rewrite my tokenizer. I was
    working on a small parser (using the wonderful Parse::Yapp),
    and did the entire tokenizing with a single regex-match.

    @tokens = $raw_data =~ m{
    $comment | ( $quoted_literal | $special | $op | $unquoted_literal )
    }gx;

    The language is quite simple, so it is possible to do every regex
    without using internal capturing. The only construct that would
    be simplified with backreferences was $quoted_literal, which
    supports three types of strings: double quoted, single quoted,
    and user-defined delimiter.

    " ... "
    ' ... '
    ^c ... c

    where c can be any character, and the delimiters can be escaped
    by putting two of them next to each other:

    'foo ''bar'' baz' == foo 'bar' baz

    Since the third string type supports any character as a delimiter,
    it would be nice if I could use backreferences. Now that that's
    out of the question, I chose instead to generate a bunch of regexes
    (one for each ASCII character) using sprintf. Not as elegant,
    but it works, and it's probably faster than the equivalent solution
    with backreferences would have been.

    --
    Haakon
     
    Haakon Riiser, Jan 23, 2004
    #6
  7. [A complimentary Cc of this posting was sent to
    Haakon Riiser
    <>], who wrote in article <>:
    > $re1 = qr{(.)\1};
    > $re2 = qr{($re1$re1)};
    >
    > which I would expect to be equivalent to
    >
    > $re2 = qr{((.)\2(.)\3)};


    What makes you expect this? qr() is an analogue of qq() etc...

    > Perl 5.8.3 instead does this:
    >
    > $re2 = qr{((.)\1(.)\1)};


    As designed...

    Hope this helps,
    Ilya
     
    Ilya Zakharevich, Jan 23, 2004
    #7
  8. [Ilya Zakharevich]

    >>> What makes you expect this? qr() is an analogue of qq() etc...

    >
    >> That's not how I would design it.

    >
    > Who cares? What is important is how it *is* designed.


    Who cares how it is designed? You asked me what made *me* expect
    that qr regexes can be interpolated with predictable behavior.
    The answer was, of course, that this would make sense to me,
    while the current design makes no sense since you can accomplish
    the same thing by interpolating a string representation of the
    regexe, while the more useful case of localized regex scope w/o
    capturing side effects is impossible to achieve.

    >> I think that if you need the
    >> regex to be interpolated exactly as written, use q// or qq//.

    >
    > "Exactly as written"??? And what you think would it be, q// or qq//?
    > (One canot replace qr() by qq(), any more than replace qq() by q().)


    I shouldn't have to explain what I mean by "exactly as written".
    In the case with q, that means character-by-character. With qq,
    it means that the result of the string processing (translation of
    of character escapes such as \n and \t, and variable interpolation)
    is interpolated directly.

    >> Interpolation of qr// should rewrite the regex, if necessary,
    >> so that it matches the same text as it would match when used on
    >> its own.

    >
    > That's (??{}). Why do you want to merge two different cases into one?


    As I said in the previous post,

    Interpolation of qr// should rewrite the regex, if necessary,
    so that it matches the same text as it would match when used on
    its own. This is much more useful, since you can then build
    up a large regex from several small qr chunks, without having
    to worry that modifications to one of the building blocks will
    suddenly break regexes interpolated after it.

    I think that the string type interpolation of qr that you think
    is so well designed is an ugly kludge that makes big regexes
    hard to maintain. I can't see *any* reason as to why you can't
    simply create the regex as a regular string and interpolate that,
    if you so desperately need separate regex building blocks that
    can refer to each other. qr regexes could then be used when you
    need the regexes to be completely shielded from each other (which
    in my experience is *much* more common than wanting spaghetti
    code regexes), and we wouldn't have to resort to (??{}) to get
    something as common as backreferences.

    I sure hope Ben Morrow was right when he said that qr interpolation
    works the way I like it in Perl 6.

    --
    Haakon
     
    Haakon Riiser, Jan 25, 2004
    #8
  9. [A complimentary Cc of this posting was sent to
    Haakon Riiser
    <>], who wrote in article <>:
    > >>> What makes you expect this? qr() is an analogue of qq() etc...


    > >> That's not how I would design it.


    > > Who cares? What is important is how it *is* designed.


    > Who cares how it is designed? You asked me what made *me* expect
    > that qr regexes can be interpolated with predictable behavior.


    Do not put words in my mouth, please.

    > The answer was, of course, that this would make sense to me,


    So what documentation way does not matter, right?

    > while the current design makes no sense since you can accomplish
    > the same thing by interpolating a string representation of the
    > regexe, while the more useful case of localized regex scope w/o
    > capturing side effects is impossible to achieve.


    I see that you not only do not read the docs, but also do not read the
    answers to your questions on this newsgroup.

    [Omiting meaningless suggestions already refuted in the preceeding
    discussion.]

    Hope this helps,
    Ilya
     
    Ilya Zakharevich, Jan 25, 2004
    #9
  10. Haakon Riiser

    gnari Guest

    "Ilya Zakharevich" <> wrote in message
    news:bv13d0$18g2$...
    > [A complimentary Cc of this posting was sent to
    > Haakon Riiser
    > <>], who wrote in article

    <>:
    > > >>> What makes you expect this? qr() is an analogue of qq() etc...

    >
    > > >> That's not how I would design it.

    >
    > > > Who cares? What is important is how it *is* designed.

    >
    > > Who cares how it is designed? You asked me what made *me* expect
    > > that qr regexes can be interpolated with predictable behavior.

    >
    > Do not put words in my mouth, please.
    >
    > > The answer was, of course, that this would make sense to me,

    >
    > So what documentation way does not matter, right?
    >
    > > while the current design makes no sense since you can accomplish
    > > the same thing by interpolating a string representation of the
    > > regexe, while the more useful case of localized regex scope w/o
    > > capturing side effects is impossible to achieve.

    >
    > I see that you not only do not read the docs, but also do not read the
    > answers to your questions on this newsgroup.


    hey. no need to let this degenerate into a flame war.

    looked to me like the OP was familiar with the way it works,
    but was expressing his view that he would have expected it to
    be implemented differently than it is. some of the follow-ups
    have been interesting, actually, and the the original question was not
    without merit.

    gnari.
     
    gnari, Jan 25, 2004
    #10
  11. [A complimentary Cc of this posting was sent to
    gnari
    <>], who wrote in article <bv197i$95f$>:
    > > I see that you not only do not read the docs, but also do not read the
    > > answers to your questions on this newsgroup.

    >
    > hey. no need to let this degenerate into a flame war.
    >
    > looked to me like the OP was familiar with the way it works,


    Except the knowledge of the bug that qr(\2) does not work, I did not
    observe any familiarity. He claims that the result of qr(whatever) is
    the same as qq(whatever); he claims that some things cannot be done,
    etc.

    > but was expressing his view that he would have expected it to
    > be implemented differently than it is.


    I noticed this. But *why* do you think this view deserves to be
    shared? Different people have different expectations. But the only
    place this matters (after the initial design stage is behind) is: if
    the docs do not clear the ambiguities, the docs must be corrected.

    But it does not look that this is the topic of this discussion...

    Yours,
    Ilya
     
    Ilya Zakharevich, Jan 26, 2004
    #11
  12. [Ilya Zakharevich]

    > [...]


    What are we really discussing here? In my last posts, I have
    merely been stating how I would have designed qr interpolation,
    and I have tried to describe the reasons for it. I now know that
    it wasn't intented to work that way in Perl 5, and of course I
    accept that. There's really nothing to argue over, unless you're
    offended that I don't agree with the current implementation.

    --
    Haakon
     
    Haakon Riiser, Jan 26, 2004
    #12
  13. [Ilya Zakharevich]

    > He claims that the result of qr(whatever) is the same as
    > qq(whatever);


    Yes, I believed it was. If you could give me a simple example of
    the potential differences in $re2 in the following two examples,
    I would appreciate it. (Really, I'm not being sarcastic. :)

    # Example 1: Interpolating a qr-regex
    $re1 = qr(whatever);
    $re2 = qr($re1);

    # Example 2: Interpolating a regex stored as a qq-string
    $re1 = qq(whatever);
    $re2 = qr($re1);

    > he claims that some things cannot be done, etc.


    Yes, I claimed that there was no way to use capturing in an
    interpolated regex without causing some side effects. E.g.,
    if you say

    $re2 = qr($re1);

    then

    $data =~ $re2;

    will capture into $1, $2, ... if you use capturing parentheses in
    $re1. I claimed that it was impossible to use capturing locally in
    $re1 without causing this side effect. If you can prove me wrong,
    I'd be grateful if you can show me how to do it. It would actually
    be of great help to me in the project I'm currently working on.

    --
    Haakon
     
    Haakon Riiser, Jan 26, 2004
    #13
  14. [A complimentary Cc of this posting was sent to
    Haakon Riiser
    <>], who wrote in article <>:
    > > he claims that some things cannot be done, etc.


    > Yes, I claimed that there was no way to use capturing in an
    > interpolated regex without causing some side effects. E.g.,
    > if you say
    >
    > $re2 = qr($re1);
    >
    > then
    >
    > $data =~ $re2;
    >
    > will capture into $1, $2, ... if you use capturing parentheses in
    > $re1. I claimed that it was impossible to use capturing locally in
    > $re1 without causing this side effect. If you can prove me wrong,


    If you specify your problem, I'm sure a lot of people will be glad to
    help you. I, personally, cannot grok what it is exactly you want to
    achieve.

    hoep this helps,
    Ilya
     
    Ilya Zakharevich, Jan 26, 2004
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mark Fletcher
    Replies:
    1
    Views:
    493
    Mark Fletcher
    May 19, 2004
  2. Chris Nolte
    Replies:
    9
    Views:
    4,285
    Jeff Schwab
    May 25, 2004
  3. dhek bhun kho

    java.util.regex: Backreferences?

    dhek bhun kho, Jul 9, 2003, in forum: Java
    Replies:
    2
    Views:
    791
    dhek bhun kho
    Jul 9, 2003
  4. Amy Dillavou

    backreferences

    Amy Dillavou, Sep 28, 2005, in forum: Python
    Replies:
    4
    Views:
    441
    Peter
    Sep 28, 2005
  5. Pankaj

    Backreferences in python ?

    Pankaj, Jan 23, 2006, in forum: Python
    Replies:
    7
    Views:
    3,272
    Sion Arrowsmith
    Jan 24, 2006
Loading...

Share This Page