another strange regexp case

  • Thread starter Kristof Bastiaensen
  • Start date
K

Kristof Bastiaensen

Hi,

here is another regexp behaviour which surprises me.
There may be some logic behind it, but I fail to see it...

irb(main):004:0> /(theone)?/.match(" theone").to_a
=> ["", nil]

irb(main):003:0> /(theone)?/.match("theone").to_a
=> ["theone", "theone"]

irb(main):005:0> / (theone)?/.match(" theone").to_a
=> [" theone", "theone"]

In the first case, it doesn't match "theone", but in
the second and third it does...

Could anyone explain this?

Kristof
 
T

ts

K> irb(main):004:0> /(theone)?/.match(" theone").to_a
K> => ["", nil]

When the regexp engine try to match `t' it fail, because the first
character is ` ' and the regexp succeed because `theone' was optional

K> irb(main):003:0> /(theone)?/.match("theone").to_a
K> => ["theone", "theone"]

it can match `theone' in its first try


Guy Decoux
 
A

Ara.T.Howard

irb(main):004:0> /(theone)?/.match(" theone").to_a
=> ["", nil]

? means 'zero or one'

we start a the beginning of ' theone' and instantly find a match: zero of
them.
irb(main):003:0> /(theone)?/.match("theone").to_a
=> ["theone", "theone"]

same here.
irb(main):005:0> / (theone)?/.match(" theone").to_a
=> [" theone", "theone"]

same here. ;-)


remember regexp engines work (well, some of them) by staring at a position and
consuming chars while the pattern matches, iff all the pattern was used we
have a positive match, otherwise not. so in all these cases we start like so

' theone'
^
^
^
ptr

and drive with the regexp asking "does the regexp match starting here? if so
how many chars did it consume" the consumed chars are returned in $1, $2,
etc. in all the cases above this explains the matching.

note that some regexp engines work in the reverse sense but the effect is
largely the same...
In the first case, it doesn't match "theone", but in the second and third it
does...

so it matched in all cases -- sometimes zero times, sometimes one time. this
is what you asked the regexp to do. i try to follow these rules when
composing regexps:

- always use anchors ^ and $
- never use anything that can match 'zero' things

it's the 'zero' thing that suprised you. your first two regexps match even
the empty string!

obviously this is not always possible but i will maintain this:

if you create a regexp without anchors and with portions that can match zero
things and have not done so out of absolute need - your code has a bug.

kind regards.

-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| A flower falls, even though we love it;
| and a weed grows, even though we do not love it.
| --Dogen
===============================================================================
 
F

Florian Gross

Kristof said:
Hi,
Moin!

here is another regexp behaviour which surprises me.
There may be some logic behind it, but I fail to see it...
irb(main):004:0> /(theone)?/.match(" theone").to_a
=> ["", nil]

I think that this is about how greediness in Regexps works:

A Regexp will try to match as much as possible starting at the current
position, but even a "bad" match at the current position will be better
than a "good" match at a later position in the String.

Maybe it would be possible to do a version of .match that finds the
"best" (== longest, if greedy) match in the whole string. I assume that
it would be based on .scan in some kind of way.

Regards,
Florian Gross
 
K

Kristof Bastiaensen

On Tue, 29 Jun 2004 11:05:43 -0600, Ara.T.Howard wrote:

irb(main):004:0> /(theone)?/.match(" theone").to_a
=> ["", nil]

? means 'zero or one'

we start a the beginning of ' theone' and instantly find a match: zero of
them.
if you create a regexp without anchors and with portions that can match zero
things and have not done so out of absolute need - your code has a bug.

Thanks for the answer. I expected the pattern to expand greedily,
but I forgot it will return the first match, which is the empty
match. You are right, /(theone)?/ is a silly thing to write,
finally I just needed another regexp for my problem.

Thanks,
Kristof
 
R

Robert Klemme

Kristof Bastiaensen said:
On Tue, 29 Jun 2004 11:05:43 -0600, Ara.T.Howard wrote:

irb(main):004:0> /(theone)?/.match(" theone").to_a
=> ["", nil]

? means 'zero or one'

we start a the beginning of ' theone' and instantly find a match: zero of
them.
if you create a regexp without anchors and with portions that can match zero
things and have not done so out of absolute need - your code has a
bug.

Thanks for the answer. I expected the pattern to expand greedily,
but I forgot it will return the first match, which is the empty
match. You are right, /(theone)?/ is a silly thing to write,
finally I just needed another regexp for my problem.

This is a case of the simple general rule "Watch out for regular
expressions that match the empty string". All sorts of problems can arise
when using them and usually you don't want to match an empty string
anyway.

Kind regards

robert
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

Nuby - NEW case/when question 1
bug is ruby regexp 3
case/when question 4
String + Range = Strange 5
regexp widehex glitch 0
fun with "case" 15
Strange behaviour 2
Ruby Hash Keys and Related Questions 6

Members online

No members online now.

Forum statistics

Threads
473,763
Messages
2,569,563
Members
45,039
Latest member
CasimiraVa

Latest Threads

Top