if ('A:B:C' =~ /:(.*?)$/) then why the heck is $1 'B:C' and not just 'C'

O

OwlHoot

To repeat the title, in case it is munged by Google Groups:

if ('A:B:C' =~ /:(.*?)$/) then why the heck is $1 'B:C' and not just
'C'

I've been developing with perl for years; but even simple things in it
still
sometimes throw up surprises.

The regexp /:(.*?)$/ is anchored on the right by $, then comes a non-
greedy
match which, AIUI, is the "shortest string it can get away with",
preceded
by a colon. So I would expect this to pick up just the "C", as it does
with
/([^:]*)$/.

Am I assuming/doing something silly? It is friday afternoon after all.


Cheers

John R Ramsden
 
W

Wolf Behrenhoff

To repeat the title, in case it is munged by Google Groups:

if ('A:B:C' =~ /:(.*?)$/) then why the heck is $1 'B:C' and not just
'C'

I've been developing with perl for years; but even simple things in it
still
sometimes throw up surprises.

The regexp /:(.*?)$/ is anchored on the right by $, then comes a non-
greedy
match which, AIUI, is the "shortest string it can get away with",
preceded
by a colon. So I would expect this to pick up just the "C", as it does
with
/([^:]*)$/.

The regexp matches from the left to the right, even if there is an
anchor on the right side of the string.

Thus the : first tries to match first : in your string, i.e the one
between A and B. Then .*? tries to match any number of chars, starting
from zero because of then ?. But if zero chars are matched, the $ fails.
So the regexp tries to make the number of characters matched by the .*?
longer and longer, and finally the $ matches. The regexp does not need
to go back and select the next : in this case.

..*? means: take as few chars as possible _at this position_
It does not mean: do backtracking and try to find if it could match
fewer chars at some other place in the string

So if you add .* to the beginning, you will get the last : in your string.
/.*:(.*?)$/
In this case the .* would try to eat as many chars as possible, then
search for a :. So this would try the last : first.

Anyway, you could also use (split /:/, 'A:B:C')[-1] here.

Cheers, Wolf
 
S

sln

To repeat the title, in case it is munged by Google Groups:

if ('A:B:C' =~ /:(.*?)$/) then why the heck is $1 'B:C' and not just
'C'

I've been developing with perl for years; but even simple things in it
still
sometimes throw up surprises.

The regexp /:(.*?)$/ is anchored on the right by $, then comes a non-
greedy
match which, AIUI, is the "shortest string it can get away with",
preceded
by a colon. So I would expect this to pick up just the "C", as it does
with
/([^:]*)$/.

Its not the shortest, its the first to satisfy it.
It is anchored on the left and right. The regex is allowing
another ':' when it traverses the string from the left.
/:(.*)$/ has the same result without checking chars between the
first ':' and the end of string.

Notice that /:(.*?):/ does the same thing, it says get all between
the first ':' and the next ':'. However,
'A:B:C:D' =~ /:(.*):/
greedily grabs all between the first and last ':', but
'A:B:C:D' =~ /:(.*?):/
grabs only that between the first 2 ':'s.

Since there is only one end of line, it gets all between the first ':'
and end of line regardless of ?.

-sln
 
K

Keith Thompson

You should ask your question in the body of your message anyway.
Newsreaders vary in how they display subject lines.
I've been developing with perl for years; but even simple things in
it still sometimes throw up surprises.

The regexp /:(.*?)$/ is anchored on the right by $, then comes a non-
greedy match which, AIUI, is the "shortest string it can get away
with", preceded by a colon. So I would expect this to pick up just
the "C", as it does with
/([^:]*)$/.

The regexp matches from the left to the right, even if there is an
anchor on the right side of the string.
[more explanation snipped]

Anyway, you could also use (split /:/, 'A:B:C')[-1] here.

Another possibility is
if ('A:B:C' =~ /:([^:]*)$/)
 
U

Uri Guttman

O> The regexp /:(.*?)$/ is anchored on the right by $, then comes a non-
O> greedy
O> match which, AIUI, is the "shortest string it can get away with",
O> preceded
O> by a colon. So I would expect this to pick up just the "C", as it does
O> with
O> /([^:]*)$/.

as others have said, you didn't get what ? does for quantifiers. perl
will match the leftmost working match. with a greedy quantifier, it will
continue to match chars until it fails and then stop. with the
non-greedy modifier ? it will stop after the first (and locally
shortest) match. it will not globally find the shortest possible match
anywhere in the string. so the key is remembering leftmost correct match
first and then short or greedy based on the modifier.

uri
 
X

Xho Jingleheimerschmidt

OwlHoot said:
To repeat the title, in case it is munged by Google Groups:

if ('A:B:C' =~ /:(.*?)$/) then why the heck is $1 'B:C' and not just
'C'

I've been developing with perl for years; but even simple things in it
still
sometimes throw up surprises.

The regexp /:(.*?)$/ is anchored on the right by $, then

There is no "then". Being anchored at the end does not change the order
of evaluation (or at least, does not do so in a way that effects the
outcome--the optimized engine can do things in whatever order it wants,
as long as behaves as if it were done left to right.)

comes a non-
greedy

Really it is not non-greedy. It is still greedy, it just greedy for
less, rather than greedy for more. It it is still greedy because it
satisfies itself, without looking around at the "wants" of others.
match which, AIUI, is the "shortest string it can get away with",
preceded
by a colon.

The colon is also greedy. It is greedy to match as far left as it can
get away with. And because it comes before the .*? does, its greed wins.

Xho
 
P

Peter J. Holzer

Really it is not non-greedy. It is still greedy, it just greedy for
less, rather than greedy for more. It it is still greedy because it
satisfies itself, without looking around at the "wants" of others.
The colon is also greedy. It is greedy to match as far left as it can
get away with. And because it comes before the .*? does, its greed wins.

Please. "Greedy" in the context of regular expressions is a technical
term with a precisely defined meaning. You are not helping by inventing
a different meaning for the word based on its meaning in common English.

hp
 
X

Xho Jingleheimerschmidt

Peter said:
Please. "Greedy" in the context of regular expressions is a technical
term with a precisely defined meaning. You are not helping by inventing
a different meaning for the word based on its meaning in common English.

Greedy is well defined in the field of computer science, and I am not
the one inventing new meanings for it.

Xho
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,578
Members
45,052
Latest member
LucyCarper

Latest Threads

Top