Please explain these regexps: funny lookahead behaviour

J

Jeremy Henty

I'm stumped: why do the regular expressions Z? , (?=Z)? and ((?=Z))?
*not* match the same things?

string = 'a'

[ %r{Z?},
%r{(?=Z)?},
%r{((?=Z))?},
].each do |re|
puts re.match(string) ? "yes" : "no"
end

Result (ruby 1.8.4 on Linux):

yes
no [!!!]
yes

I thought that all three should look for a 'Z', but successfully match
nothing if there is no 'Z'. Why does adding lookahead in the second
make the match fail? Surely lookahead only changes whether the match
consumes anything? It shouldn't change whether there is a match or
not. And why does adding parentheses in the third make the match
succeed again?

Regards,

Jeremy Henty
 
J

Jeremy Henty

J> %r{(?=Z)?},
[...]
J> no [!!!]

probably a bug in the regexp engine for 1.8.4

For comparison, the equivalent Perl code reports all yeses as I had
expected. Any tips on debugging Ruby regexps? And any chance of a
fix going into Ruby 1.8.x ?

Interestingly, Ruby+Oniguruma rejects the second regexp - "target of
repeat operator is invalid: /(?=Z)?/". For the other two it agrees
with vanilla Ruby and Perl.

Regards,

Jeremy Henty
 
T

ts

J> For comparison, the equivalent Perl code reports all yeses as I had
J> expected. Any tips on debugging Ruby regexps? And any chance of a
J> fix going into Ruby 1.8.x ?

Well, you have found the problem

J> Interestingly, Ruby+Oniguruma rejects the second regexp
^^^^^^^^^

Oniguruma will be the next regexp engine for ruby, not sure if the old
regexp engine will be maintained.

In your case the problem is simple

svg% ruby -rjj -e '/(?=Z)?/.dump'
Regexp /(?=Z)?/
0 start_nowidth 2
1 exactn "Z" (1)
2 stop_nowidth
3 on_failure_jump ==> 4
4 end
must : Z
svg%

See `must : Z' which mean that the string *must* contain Z for a match
(it's a wrong optimization), this is why it don't match in your case.

It has too many optimizations, in this case :)
 
J

Jeremy Henty

J> ... any chance of a fix going into Ruby 1.8.x ?

Well, you have found the problem

Well, yay me! Now how do I find a fix? I'm looking at
re_compile_pattern() in regex.c and I'm a little intimidated. Where
do I start?
svg% ruby -rjj -e '/(?=Z)?/.dump'
Regexp /(?=Z)?/
0 start_nowidth 2
1 exactn "Z" (1)
2 stop_nowidth
3 on_failure_jump ==> 4
4 end
must : Z
svg%

See `must : Z' which mean that the string *must* contain Z for a
match (it's a wrong optimization), this is why it don't match in
your case.

That's clear. What do the other regexps compile to? And what's this
-rjj ? I've Googled and searched mailing lists and all I can find is
other people looking for it too (and some posts from someone with the
handle "JJ"). Is it a secret only the elite may share? :)

Thanks for your help,

Jeremy Henty
 
T

ts

J> That's clear. What do the other regexps compile to? And what's this

svg% ruby -rjj -e '/Z?/.dump'
Regexp /Z?/
0 on_failure_jump ==> 2
1 exactn "Z" (1)
2 end
svg%

svg% ruby -rjj -e '/((?=Z))?/.dump'
Regexp /((?=Z))?/
0 on_failure_jump ==> 6
1 start_memory $1
2 start_nowidth 4
3 exactn "Z" (1)
4 stop_nowidth
5 stop_memory $1
6 end
subexpressions : 1
svg%

J> -rjj ?

jj don't want to leave moulon :)
 
T

ts

J> re_compile_pattern() in regex.c and I'm a little intimidated.

Can you test this patch ?

svg% diff -u regex.c.~1.96.2.8.~ regex.c
--- regex.c.~1.96.2.8.~ 2006-04-24 17:15:21.000000000 +0200
+++ regex.c 2006-06-04 13:31:23.000000000 +0200
@@ -1963,7 +1963,7 @@
stackp--;
fixup_alt_jump = *stackp ? *stackp + bufp->buffer - 1 : 0;
laststart = *--stackp + bufp->buffer;
- if (c == '!' || c == '=') laststart = b;
+ if (c == '!' /* || c == '=' */) laststart = b;
break;

case '|':
svg%

svg% ruby -rjj -e '/(?=Z)?/.dump'
Regexp /(?=Z)?/
0 on_failure_jump ==> 4
1 start_nowidth 3
2 exactn "Z" (1)
3 stop_nowidth
4 end
svg%
 
J

Jeremy Henty

Can you test this patch ?

svg% diff -u regex.c.~1.96.2.8.~ regex.c
--- regex.c.~1.96.2.8.~ 2006-04-24 17:15:21.000000000 +0200
+++ regex.c 2006-06-04 13:31:23.000000000 +0200
@@ -1963,7 +1963,7 @@
stackp--;
fixup_alt_jump = *stackp ? *stackp + bufp->buffer - 1 : 0;
laststart = *--stackp + bufp->buffer;
- if (c == '!' || c == '=') laststart = b;
+ if (c == '!' /* || c == '=' */) laststart = b;
break;

case '|':

It works!

$ regexp_wibble
yes
no
yes
[patch and reinstall]
$ regexp_wibble
yes
yes
yes

#!/usr/bin/env ruby

$VERBOSE = true

string = 'a'

[ %r{Z?},
%r{(?=Z)?}, # does not match!!!
%r{((?=Z))?},
].each do |re|
puts re.match(string) ? "yes" : "no"
end

<<< regexp_wibble

And also

$ make check
1356 tests, 15409 assertions, 0 failures, 0 errors

....so nothing appears to have broken.

Thanks,

Jeremy Henty
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,564
Members
45,039
Latest member
CasimiraVa

Latest Threads

Top