regex behavior

M

Matija Papec

I went through perldoc but didn't found similar regex,
print join ',', 'a bb ccc dddd' =~ /(\w)+/g;

the question is, what it exactly matches and why?
 
M

Michael P. Broida

Abigail said:
Matija Papec ([email protected]) wrote on MMMDCLXXXIII September MCMXCIII
in <URL:--
-- I went through perldoc but didn't found similar regex,
-- print join ',', 'a bb ccc dddd' =~ /(\w)+/g;
--
-- the question is, what it exactly matches and why?

/(\w)+/ matches a set of consecutive word characters, capturing
the *last* one. //g in list context means, do this as often as
possible (without overlap), returning a list of each of the submatches.

So, 'a bb ccc dddd' =~ /(\w)+/g; returns for each substring of
consecutive word characters the last one, resulting in 'a', 'b', 'c' and 'd'.

That tests out as you said, so it's MY thinking that's off. :)
Hopefully, you can clue me in. :)

I expected it to result in "a,bb,ccc,dddd". Now I realize that
it's the positioning of the + that causes it to get a single
character from each group. If the + is inside the (), it
prints what I expected.

But... What is causing the original /(\w)+/ to get the LAST
character from each group instead of the FIRST character from
each group?

I changed the input string to 'a bc def ghij' and it printed
"a,c,f,j" as you noted. But I don't see why it's the LAST
character per group. At this point, I now expect "a,b,d,g".

Ignoring the () to populate the result list, the \w+ matches a
string of one or more characters. On the second match, it will
grab "bc".

Now why isn't the () part of that getting the FIRST of those
characters?

And what regex would you use to get the FIRST char of each group
since this one doesn't?

Mike
 
J

Jeff 'japhy' Pinyan

[posted & mailed]

But... What is causing the original /(\w)+/ to get the LAST
character from each group instead of the FIRST character from
each group?

The location of the + modifier.
Ignoring the () to populate the result list, the \w+ matches a
string of one or more characters. On the second match, it will
grab "bc".

DON'T ignore the (), they're important here. (\w+) is seen by the regex
as something like this:

OPEN $1
PLUS
ALNUM
CLOSE $1

whereas (\w)+ is seen as

PLUS
OPEN $1
ALNUM
CLOSE $1
Now why isn't the () part of that getting the FIRST of those
characters?

It does... but then the + modifier causes $1 to be repopulated with the
NEXT character \w matches, and so on.
And what regex would you use to get the FIRST char of each group
since this one doesn't?

I'd use /(\w)\w*/g, or perhaps /\b\w/g (if there are no parens in a /.../g
regex, you get whatever the regex matches returned).
 
B

Bill

Ignoring the () to populate the result list, the \w+ matches a
string of one or more characters. On the second match, it will
grab "bc".

Now why isn't the () part of that getting the FIRST of those
characters?

And what regex would you use to get the FIRST char of each group
since this one doesn't?

Mike


from `perldoc perlre` :

By default, a quantified subpattern is "greedy", that is, it will
match as many times as possible (given a particular starting location)
while still allowing the rest of the pattern to match. If you want it
to match the minimum number of times possible, follow the quantifier
with a "?".
 
M

Michael P. Broida

Jeff said:
It does... but then the + modifier causes $1 to be repopulated with the
NEXT character \w matches, and so on.

(I e-mailed a different response, then thought about it more.)

Hmm, that explains it pretty well. I guess my only remaining
question would be: why does it actually "repopulate"??

It seems as though, once it matches that single character, it
would/should save it in $1 as the () directs, and the NEXT
matched character would go into $2 instead of being thrown
away, and the next in $3, etc. I mean, the + seems to be
telling it to repeat the entire (\w) operation, and THAT
is saving characters.

Is there an operator precedence kinda thing going on?? Maybe
the + has to "FINISH" before the () can save a value?? That
would make it completely understandable to me. <grin>

Thanks for the answers!
Mike
 
M

Michael P. Broida

David said:
Because, walking through your string of "a bb ccc dddd" look at what your
regexp is doing:
Pass one, step one. Find and capture "a". Return "a".
Pass 2, step 1: Find and capture first 'b'.
Pass2, step 2: Find 2nd 'b', and replace the first 'b' with the second one.
Return 2nd 'b'.
Pass3, step 1: Find first 'c' and capture it.
Pass3, step 2: Find second 'c' and put it where first 'c' had been captured.
Pass3, step 3: Find third 'c' and put it where the 2nd 'c' had been
captured. Return 3rd 'c'.
Pass4..... you should get the idea by now.

Think of the capturing parens as your pocket, and it only has room for one
thing. The regexp puts the first thing it matches into the pocket. When it
finds (due to the quantifier) that it matches the 2nd thing, take the first
one out and put the 2nd one in. And so on.

See my answer to Jeff 'japhy" Pinyan.

Your explanation makes sense, especially since the results are
just what (all of) you are espousing. <grin>

But I guess the part about "replacing" the value doesn't sit well
with me. I don't see any operator telling it to "replace" things.

It looks to me as though the (\w) part should save EACH char
that is matched into a separate $n variable. The + tells the
matching part to continue, but why doesn't the next pass through
(\w) save a NEW character in a NEW $n variable ($1,$2,etc)??

As I said in the other response: if the + operation must FINISH
before the () can save anything (one char), that would make it
all understandable to me. Operator precedence would cover that.

I'm not trying to argue here. :) It undeniably works as you've
said it does: test results bear that out. But I'm trying to
understand WHY it works that way and not another way that seems
to make as much sense to me.

Mike
 
J

Jeff 'japhy' Pinyan

[posted & mailed]

It seems as though, once it matches that single character, it
would/should save it in $1 as the () directs, and the NEXT
matched character would go into $2 instead of being thrown
away, and the next in $3, etc. I mean, the + seems to be
telling it to repeat the entire (\w) operation, and THAT
is saving characters.

But you're ignoring how a regex is compiled. Watch:

perl -mre=debug -e 'qr/(a+)/'
...
Compiling REx `(a+)'
...
1: OPEN1(3)
3: PLUS(6)
4: EXACT <a>(0)
6: CLOSE1(8)
8: END(0)

versus:

perl -mre=debug -e 'qr/(a)+/'
...
Compiling REx `(a)+'
...
1: CURLYN[1] {1,32767}(11)
3: NOTHING(5)
5: EXACT <a>(0)
9: WHILEM(0)
10: NOTHING(11)
11: END(0)

A regex is compiled into an array of instructions, opcodes. Some opcodes
have additional data stored with them, such as the OPEN and CLOSE opcodes,
which have a number stored telling them WHICH $<DIGIT> variable to store
the matched content to. You can't change that. Each pair of capturing
parentheses refers to a SPECIFIC, SINGLE $<DIGIT>.
 
M

Michael P. Broida

Abigail said:
Michael P. Broida (michael.p.broida@boeing_oops.com) wrote on
MMMDCLXXXIII September MCMXCIII in <URL:,, Abigail wrote:
,, >
,, > Matija Papec ([email protected]) wrote on MMMDCLXXXIII September MCMXCIII
,, > in <URL:,, > --
,, > -- I went through perldoc but didn't found similar regex,
,, > -- print join ',', 'a bb ccc dddd' =~ /(\w)+/g;
,, > --
,, > -- the question is, what it exactly matches and why?
,, >
,, > /(\w)+/ matches a set of consecutive word characters, capturing
,, > the *last* one. //g in list context means, do this as often as
,, > possible (without overlap), returning a list of each of the submatches.
,, >
,, > So, 'a bb ccc dddd' =~ /(\w)+/g; returns for each substring of
,, > consecutive word characters the last one, resulting in 'a', 'b', 'c' and 'd'.
,,
,, That tests out as you said, so it's MY thinking that's off. :)
,, Hopefully, you can clue me in. :)
,,
,, I expected it to result in "a,bb,ccc,dddd". Now I realize that
,, it's the positioning of the + that causes it to get a single
,, character from each group. If the + is inside the (), it
,, prints what I expected.
,,
,, But... What is causing the original /(\w)+/ to get the LAST
,, character from each group instead of the FIRST character from
,, each group?

Would you expect:

$x = $_ for qw /a b c d/
print $x;

to print 'a' as well?

It doesn't print anything without a semi-colon on the first line.
<grin>

At first glance, I thought it would print each letter. Then I
looked deeper and realized it's basically assigning and re-assigning
$x (via $_) during the "for" loop, but only printing it when it's all
done. Thus it only prints "d".

But the prior discussion was about a regex, not a "for" loop.
If your point is that the regex processing works similarly to
the "for" loop in your example, then I see what you mean.

If that's NOT what your point was, then you've lost me. <grin>

Mike
 
M

Michael P. Broida

Abigail said:
Michael P. Broida (michael.p.broida@boeing_oops.com) wrote on
MMMDCLXXXVIII September MCMXCIII in <URL:'' Abigail wrote:
'' >
'' > Michael P. Broida (michael.p.broida@boeing_oops.com) wrote on
'' > MMMDCLXXXIII September MCMXCIII in <URL:'' > ,, Abigail wrote:
'' > ,, >
'' > ,, > Matija Papec ([email protected]) wrote on MMMDCLXXXIII September MCMXCIII
'' > ,, > in <URL:'' > ,, > --
'' > ,, > -- I went through perldoc but didn't found similar regex,
'' > ,, > -- print join ',', 'a bb ccc dddd' =~ /(\w)+/g;
'' > ,, > --
'' > ,, > -- the question is, what it exactly matches and why?
'' > ,, >
'' > ,, > /(\w)+/ matches a set of consecutive word characters, capturing
'' > ,, > the *last* one. //g in list context means, do this as often as
'' > ,, > possible (without overlap), returning a list of each of the submatches.
'' > ,, >
'' > ,, > So, 'a bb ccc dddd' =~ /(\w)+/g; returns for each substring of
'' > ,, > consecutive word characters the last one, resulting in 'a', 'b', 'c' and 'd'.
'' > ,,
'' > ,, That tests out as you said, so it's MY thinking that's off. :)
'' > ,, Hopefully, you can clue me in. :)
'' > ,,
'' > ,, I expected it to result in "a,bb,ccc,dddd". Now I realize that
'' > ,, it's the positioning of the + that causes it to get a single
'' > ,, character from each group. If the + is inside the (), it
'' > ,, prints what I expected.
'' > ,,
'' > ,, But... What is causing the original /(\w)+/ to get the LAST
'' > ,, character from each group instead of the FIRST character from
'' > ,, each group?
'' >
'' > Would you expect:
'' >
'' > $x = $_ for qw /a b c d/
'' > print $x;
'' >
'' > to print 'a' as well?
''
'' It doesn't print anything without a semi-colon on the first line.
'' <grin>
''
'' At first glance, I thought it would print each letter. Then I
'' looked deeper and realized it's basically assigning and re-assigning
'' $x (via $_) during the "for" loop, but only printing it when it's all
'' done. Thus it only prints "d".
''
'' But the prior discussion was about a regex, not a "for" loop.
'' If your point is that the regex processing works similarly to
'' the "for" loop in your example, then I see what you mean.
''
'' If that's NOT what your point was, then you've lost me. <grin>

My point is, if you repeatedly assign something to a variable, do you
expect the variable to retain the first value it was set to, or the
last value? Because that's happening in both the match, and the for loop.

Ah. No, I wouldn't expect that. But then, I didn't know
that the *regex* was repeatedly assigning to the variable
WITHIN the (\w)+ portion. I -DID- expect it to assign a
new result for each letter group (a, bb, ccc, and dddd)
due to the //g. I did NOT know it was reassigning for
the \w within the () for each letter in a single group.

But now I do know that, thanks to the discussion here. :)

Thanks everyone!

Mike
 
A

Anno Siegel

Michael P. Broida said:
(I e-mailed a different response, then thought about it more.)

Hmm, that explains it pretty well. I guess my only remaining
question would be: why does it actually "repopulate"??

It seems as though, once it matches that single character, it
would/should save it in $1 as the () directs, and the NEXT
matched character would go into $2 instead of being thrown
away, and the next in $3, etc. I mean, the + seems to be
telling it to repeat the entire (\w) operation, and THAT
is saving characters.

Yes, but it only has *one* $n variable to save to, determined by the number
of the opening parenthesis of the capturing pair. It isn't free to use
more $n variables for additional matches, because those may be occupied
by other capturing pairs.

So there's hardly a choice but to overwrite what's already there.

Anno
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,576
Members
45,054
Latest member
LucyCarper

Latest Threads

Top