regex behavior

Discussion in 'Perl Misc' started by Matija Papec, Oct 1, 2003.

  1. Matija Papec

    Matija Papec Guest

    I went through perldoc but didn't found similar regex,
    print join ',', 'a bb ccc dddd' =~ /(\w)+/g;

    the question is, what it exactly matches and why?


    --
    Matija
    Matija Papec, Oct 1, 2003
    #1
    1. Advertising

  2. Abigail wrote:
    >
    > Matija Papec () wrote on MMMDCLXXXIII September MCMXCIII
    > in <URL:news:>:
    > --
    > -- I went through perldoc but didn't found similar regex,
    > -- print join ',', 'a bb ccc dddd' =~ /(\w)+/g;
    > --
    > -- the question is, what it exactly matches and why?
    >
    > /(\w)+/ matches a set of consecutive word characters, capturing
    > the *last* one. //g in list context means, do this as often as
    > possible (without overlap), returning a list of each of the submatches.
    >
    > So, 'a bb ccc dddd' =~ /(\w)+/g; returns for each substring of
    > consecutive word characters the last one, resulting in 'a', 'b', 'c' and 'd'.


    That tests out as you said, so it's MY thinking that's off. :)
    Hopefully, you can clue me in. :)

    I expected it to result in "a,bb,ccc,dddd". Now I realize that
    it's the positioning of the + that causes it to get a single
    character from each group. If the + is inside the (), it
    prints what I expected.

    But... What is causing the original /(\w)+/ to get the LAST
    character from each group instead of the FIRST character from
    each group?

    I changed the input string to 'a bc def ghij' and it printed
    "a,c,f,j" as you noted. But I don't see why it's the LAST
    character per group. At this point, I now expect "a,b,d,g".

    Ignoring the () to populate the result list, the \w+ matches a
    string of one or more characters. On the second match, it will
    grab "bc".

    Now why isn't the () part of that getting the FIRST of those
    characters?

    And what regex would you use to get the FIRST char of each group
    since this one doesn't?

    Mike
    Michael P. Broida, Oct 1, 2003
    #2
    1. Advertising

  3. [posted & mailed]

    On Wed, 1 Oct 2003, Michael P. Broida wrote:

    > But... What is causing the original /(\w)+/ to get the LAST
    > character from each group instead of the FIRST character from
    > each group?


    The location of the + modifier.

    > Ignoring the () to populate the result list, the \w+ matches a
    > string of one or more characters. On the second match, it will
    > grab "bc".


    DON'T ignore the (), they're important here. (\w+) is seen by the regex
    as something like this:

    OPEN $1
    PLUS
    ALNUM
    CLOSE $1

    whereas (\w)+ is seen as

    PLUS
    OPEN $1
    ALNUM
    CLOSE $1

    > Now why isn't the () part of that getting the FIRST of those
    > characters?


    It does... but then the + modifier causes $1 to be repopulated with the
    NEXT character \w matches, and so on.

    > And what regex would you use to get the FIRST char of each group
    > since this one doesn't?


    I'd use /(\w)\w*/g, or perhaps /\b\w/g (if there are no parens in a /.../g
    regex, you get whatever the regex matches returned).

    --
    Jeff Pinyan RPI Acacia Brother #734 2003 Rush Chairman
    "And I vos head of Gestapo for ten | Michael Palin (as Heinrich Bimmler)
    years. Ah! Five years! Nein! No! | in: The North Minehead Bye-Election
    Oh. Was NOT head of Gestapo AT ALL!" | (Monty Python's Flying Circus)
    Jeff 'japhy' Pinyan, Oct 2, 2003
    #3
  4. Matija Papec

    Bill Guest

    > Ignoring the () to populate the result list, the \w+ matches a
    > string of one or more characters. On the second match, it will
    > grab "bc".
    >
    > Now why isn't the () part of that getting the FIRST of those
    > characters?
    >
    > And what regex would you use to get the FIRST char of each group
    > since this one doesn't?
    >
    > Mike



    from `perldoc perlre` :

    By default, a quantified subpattern is "greedy", that is, it will
    match as many times as possible (given a particular starting location)
    while still allowing the rest of the pattern to match. If you want it
    to match the minimum number of times possible, follow the quantifier
    with a "?".
    Bill, Oct 2, 2003
    #4
  5. Jeff 'japhy' Pinyan wrote:
    >
    > On Wed, 1 Oct 2003, Michael P. Broida wrote:
    >
    > > Now why isn't the () part of that getting the FIRST of those
    > > characters?

    >
    > It does... but then the + modifier causes $1 to be repopulated with the
    > NEXT character \w matches, and so on.


    (I e-mailed a different response, then thought about it more.)

    Hmm, that explains it pretty well. I guess my only remaining
    question would be: why does it actually "repopulate"??

    It seems as though, once it matches that single character, it
    would/should save it in $1 as the () directs, and the NEXT
    matched character would go into $2 instead of being thrown
    away, and the next in $3, etc. I mean, the + seems to be
    telling it to repeat the entire (\w) operation, and THAT
    is saving characters.

    Is there an operator precedence kinda thing going on?? Maybe
    the + has to "FINISH" before the () can save a value?? That
    would make it completely understandable to me. <grin>

    Thanks for the answers!
    Mike
    Michael P. Broida, Oct 3, 2003
    #5
  6. David Oswald wrote:
    >
    > > > So, 'a bb ccc dddd' =~ /(\w)+/g; returns for each substring of
    > > > consecutive word characters the last one, resulting in 'a', 'b', 'c' and

    > 'd'.
    > >
    > > That tests out as you said, so it's MY thinking that's off. :)
    > > Hopefully, you can clue me in. :)
    > >
    > > I expected it to result in "a,bb,ccc,dddd". Now I realize that
    > > it's the positioning of the + that causes it to get a single
    > > character from each group. If the + is inside the (), it
    > > prints what I expected.
    > >
    > > But... What is causing the original /(\w)+/ to get the LAST
    > > character from each group instead of the FIRST character from
    > > each group?

    >
    > Because, walking through your string of "a bb ccc dddd" look at what your
    > regexp is doing:
    > Pass one, step one. Find and capture "a". Return "a".
    > Pass 2, step 1: Find and capture first 'b'.
    > Pass2, step 2: Find 2nd 'b', and replace the first 'b' with the second one.
    > Return 2nd 'b'.
    > Pass3, step 1: Find first 'c' and capture it.
    > Pass3, step 2: Find second 'c' and put it where first 'c' had been captured.
    > Pass3, step 3: Find third 'c' and put it where the 2nd 'c' had been
    > captured. Return 3rd 'c'.
    > Pass4..... you should get the idea by now.
    >
    > Think of the capturing parens as your pocket, and it only has room for one
    > thing. The regexp puts the first thing it matches into the pocket. When it
    > finds (due to the quantifier) that it matches the 2nd thing, take the first
    > one out and put the 2nd one in. And so on.


    See my answer to Jeff 'japhy" Pinyan.

    Your explanation makes sense, especially since the results are
    just what (all of) you are espousing. <grin>

    But I guess the part about "replacing" the value doesn't sit well
    with me. I don't see any operator telling it to "replace" things.

    It looks to me as though the (\w) part should save EACH char
    that is matched into a separate $n variable. The + tells the
    matching part to continue, but why doesn't the next pass through
    (\w) save a NEW character in a NEW $n variable ($1,$2,etc)??

    As I said in the other response: if the + operation must FINISH
    before the () can save anything (one char), that would make it
    all understandable to me. Operator precedence would cover that.

    I'm not trying to argue here. :) It undeniably works as you've
    said it does: test results bear that out. But I'm trying to
    understand WHY it works that way and not another way that seems
    to make as much sense to me.

    Mike
    Michael P. Broida, Oct 3, 2003
    #6
  7. [posted & mailed]

    On Fri, 3 Oct 2003, Michael P. Broida wrote:

    >> It does... but then the + modifier causes $1 to be repopulated with the
    >> NEXT character \w matches, and so on.

    >
    > It seems as though, once it matches that single character, it
    > would/should save it in $1 as the () directs, and the NEXT
    > matched character would go into $2 instead of being thrown
    > away, and the next in $3, etc. I mean, the + seems to be
    > telling it to repeat the entire (\w) operation, and THAT
    > is saving characters.


    But you're ignoring how a regex is compiled. Watch:

    perl -mre=debug -e 'qr/(a+)/'
    ...
    Compiling REx `(a+)'
    ...
    1: OPEN1(3)
    3: PLUS(6)
    4: EXACT <a>(0)
    6: CLOSE1(8)
    8: END(0)

    versus:

    perl -mre=debug -e 'qr/(a)+/'
    ...
    Compiling REx `(a)+'
    ...
    1: CURLYN[1] {1,32767}(11)
    3: NOTHING(5)
    5: EXACT <a>(0)
    9: WHILEM(0)
    10: NOTHING(11)
    11: END(0)

    A regex is compiled into an array of instructions, opcodes. Some opcodes
    have additional data stored with them, such as the OPEN and CLOSE opcodes,
    which have a number stored telling them WHICH $<DIGIT> variable to store
    the matched content to. You can't change that. Each pair of capturing
    parentheses refers to a SPECIFIC, SINGLE $<DIGIT>.

    --
    Jeff Pinyan RPI Acacia Brother #734 2003 Rush Chairman
    "And I vos head of Gestapo for ten | Michael Palin (as Heinrich Bimmler)
    years. Ah! Five years! Nein! No! | in: The North Minehead Bye-Election
    Oh. Was NOT head of Gestapo AT ALL!" | (Monty Python's Flying Circus)
    Jeff 'japhy' Pinyan, Oct 3, 2003
    #7
  8. Abigail wrote:
    >
    > Michael P. Broida (michael.p.broida@boeing_oops.com) wrote on
    > MMMDCLXXXIII September MCMXCIII in <URL:news:3F7B532C.7878A3BB@boeing_oops.com>:
    > ,, Abigail wrote:
    > ,, >
    > ,, > Matija Papec () wrote on MMMDCLXXXIII September MCMXCIII
    > ,, > in <URL:news:>:
    > ,, > --
    > ,, > -- I went through perldoc but didn't found similar regex,
    > ,, > -- print join ',', 'a bb ccc dddd' =~ /(\w)+/g;
    > ,, > --
    > ,, > -- the question is, what it exactly matches and why?
    > ,, >
    > ,, > /(\w)+/ matches a set of consecutive word characters, capturing
    > ,, > the *last* one. //g in list context means, do this as often as
    > ,, > possible (without overlap), returning a list of each of the submatches.
    > ,, >
    > ,, > So, 'a bb ccc dddd' =~ /(\w)+/g; returns for each substring of
    > ,, > consecutive word characters the last one, resulting in 'a', 'b', 'c' and 'd'.
    > ,,
    > ,, That tests out as you said, so it's MY thinking that's off. :)
    > ,, Hopefully, you can clue me in. :)
    > ,,
    > ,, I expected it to result in "a,bb,ccc,dddd". Now I realize that
    > ,, it's the positioning of the + that causes it to get a single
    > ,, character from each group. If the + is inside the (), it
    > ,, prints what I expected.
    > ,,
    > ,, But... What is causing the original /(\w)+/ to get the LAST
    > ,, character from each group instead of the FIRST character from
    > ,, each group?
    >
    > Would you expect:
    >
    > $x = $_ for qw /a b c d/
    > print $x;
    >
    > to print 'a' as well?


    It doesn't print anything without a semi-colon on the first line.
    <grin>

    At first glance, I thought it would print each letter. Then I
    looked deeper and realized it's basically assigning and re-assigning
    $x (via $_) during the "for" loop, but only printing it when it's all
    done. Thus it only prints "d".

    But the prior discussion was about a regex, not a "for" loop.
    If your point is that the regex processing works similarly to
    the "for" loop in your example, then I see what you mean.

    If that's NOT what your point was, then you've lost me. <grin>

    Mike
    Michael P. Broida, Oct 6, 2003
    #8
  9. Abigail wrote:
    >
    > Michael P. Broida (michael.p.broida@boeing_oops.com) wrote on
    > MMMDCLXXXVIII September MCMXCIII in <URL:news:3F81D592.208F5420@boeing_oops.com>:
    > '' Abigail wrote:
    > '' >
    > '' > Michael P. Broida (michael.p.broida@boeing_oops.com) wrote on
    > '' > MMMDCLXXXIII September MCMXCIII in <URL:news:3F7B532C.7878A3BB@boeing_oops.com>:
    > '' > ,, Abigail wrote:
    > '' > ,, >
    > '' > ,, > Matija Papec () wrote on MMMDCLXXXIII September MCMXCIII
    > '' > ,, > in <URL:news:>:
    > '' > ,, > --
    > '' > ,, > -- I went through perldoc but didn't found similar regex,
    > '' > ,, > -- print join ',', 'a bb ccc dddd' =~ /(\w)+/g;
    > '' > ,, > --
    > '' > ,, > -- the question is, what it exactly matches and why?
    > '' > ,, >
    > '' > ,, > /(\w)+/ matches a set of consecutive word characters, capturing
    > '' > ,, > the *last* one. //g in list context means, do this as often as
    > '' > ,, > possible (without overlap), returning a list of each of the submatches.
    > '' > ,, >
    > '' > ,, > So, 'a bb ccc dddd' =~ /(\w)+/g; returns for each substring of
    > '' > ,, > consecutive word characters the last one, resulting in 'a', 'b', 'c' and 'd'.
    > '' > ,,
    > '' > ,, That tests out as you said, so it's MY thinking that's off. :)
    > '' > ,, Hopefully, you can clue me in. :)
    > '' > ,,
    > '' > ,, I expected it to result in "a,bb,ccc,dddd". Now I realize that
    > '' > ,, it's the positioning of the + that causes it to get a single
    > '' > ,, character from each group. If the + is inside the (), it
    > '' > ,, prints what I expected.
    > '' > ,,
    > '' > ,, But... What is causing the original /(\w)+/ to get the LAST
    > '' > ,, character from each group instead of the FIRST character from
    > '' > ,, each group?
    > '' >
    > '' > Would you expect:
    > '' >
    > '' > $x = $_ for qw /a b c d/
    > '' > print $x;
    > '' >
    > '' > to print 'a' as well?
    > ''
    > '' It doesn't print anything without a semi-colon on the first line.
    > '' <grin>
    > ''
    > '' At first glance, I thought it would print each letter. Then I
    > '' looked deeper and realized it's basically assigning and re-assigning
    > '' $x (via $_) during the "for" loop, but only printing it when it's all
    > '' done. Thus it only prints "d".
    > ''
    > '' But the prior discussion was about a regex, not a "for" loop.
    > '' If your point is that the regex processing works similarly to
    > '' the "for" loop in your example, then I see what you mean.
    > ''
    > '' If that's NOT what your point was, then you've lost me. <grin>
    >
    > My point is, if you repeatedly assign something to a variable, do you
    > expect the variable to retain the first value it was set to, or the
    > last value? Because that's happening in both the match, and the for loop.


    Ah. No, I wouldn't expect that. But then, I didn't know
    that the *regex* was repeatedly assigning to the variable
    WITHIN the (\w)+ portion. I -DID- expect it to assign a
    new result for each letter group (a, bb, ccc, and dddd)
    due to the //g. I did NOT know it was reassigning for
    the \w within the () for each letter in a single group.

    But now I do know that, thanks to the discussion here. :)

    Thanks everyone!

    Mike
    Michael P. Broida, Oct 7, 2003
    #9
  10. Matija Papec

    Anno Siegel Guest

    Michael P. Broida <michael.p.broida@boeing_oops.com> wrote in comp.lang.perl.misc:
    > Jeff 'japhy' Pinyan wrote:
    > >
    > > On Wed, 1 Oct 2003, Michael P. Broida wrote:
    > >
    > > > Now why isn't the () part of that getting the FIRST of those
    > > > characters?

    > >
    > > It does... but then the + modifier causes $1 to be repopulated with the
    > > NEXT character \w matches, and so on.

    >
    > (I e-mailed a different response, then thought about it more.)
    >
    > Hmm, that explains it pretty well. I guess my only remaining
    > question would be: why does it actually "repopulate"??
    >
    > It seems as though, once it matches that single character, it
    > would/should save it in $1 as the () directs, and the NEXT
    > matched character would go into $2 instead of being thrown
    > away, and the next in $3, etc. I mean, the + seems to be
    > telling it to repeat the entire (\w) operation, and THAT
    > is saving characters.


    Yes, but it only has *one* $n variable to save to, determined by the number
    of the opening parenthesis of the capturing pair. It isn't free to use
    more $n variables for additional matches, because those may be occupied
    by other capturing pairs.

    So there's hardly a choice but to overwrite what's already there.

    Anno
    Anno Siegel, Oct 8, 2003
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mantorok Redgormor
    Replies:
    70
    Views:
    1,724
    Dan Pop
    Feb 17, 2004
  2. Replies:
    3
    Views:
    728
    Reedick, Andrew
    Jul 1, 2008
  3. Aaron Watters
    Replies:
    1
    Views:
    220
    Aaron Watters
    Dec 30, 2009
  4. Daniel Berger

    Ruby regex engine behavior question

    Daniel Berger, Sep 13, 2004, in forum: Ruby
    Replies:
    5
    Views:
    158
  5. DJ Stunks

    Strange behavior by regex with variable

    DJ Stunks, Apr 5, 2006, in forum: Perl Misc
    Replies:
    10
    Views:
    147
    Dr.Ruud
    Apr 6, 2006
Loading...

Share This Page