Please explain these regexps: funny lookahead behaviour

Jeremy Henty · Jun 3, 2006

I'm stumped: why do the regular expressions Z? , (?=Z)? and ((?=Z))?
*not* match the same things?

string = 'a'

[ %r{Z?},
%r{(?=Z)?},
%r{((?=Z))?},
].each do |re|
puts re.match(string) ? "yes" : "no"
end

Result (ruby 1.8.4 on Linux):

yes
no [!!!]
yes

I thought that all three should look for a 'Z', but successfully match
nothing if there is no 'Z'. Why does adding lookahead in the second
make the match fail? Surely lookahead only changes whether the match
consumes anything? It shouldn't change whether there is a match or
not. And why does adding parentheses in the third make the match
succeed again?

Regards,

Jeremy Henty

ts · Jun 3, 2006

J> %r{(?=Z)?},
[...]
J> no [!!!]

probably a bug in the regexp engine for 1.8.4

Jeremy Henty · Jun 3, 2006

J> %r{(?=Z)?},
[...]
J> no [!!!]

probably a bug in the regexp engine for 1.8.4

For comparison, the equivalent Perl code reports all yeses as I had
expected. Any tips on debugging Ruby regexps? And any chance of a
fix going into Ruby 1.8.x ?

Interestingly, Ruby+Oniguruma rejects the second regexp - "target of
repeat operator is invalid: /(?=Z)?/". For the other two it agrees
with vanilla Ruby and Perl.

Regards,

Jeremy Henty

ts · Jun 3, 2006

J> For comparison, the equivalent Perl code reports all yeses as I had
J> expected. Any tips on debugging Ruby regexps? And any chance of a
J> fix going into Ruby 1.8.x ?

Well, you have found the problem

J> Interestingly, Ruby+Oniguruma rejects the second regexp
^^^^^^^^^

Oniguruma will be the next regexp engine for ruby, not sure if the old
regexp engine will be maintained.

In your case the problem is simple

svg% ruby -rjj -e '/(?=Z)?/.dump'
Regexp /(?=Z)?/
0 start_nowidth 2
1 exactn "Z" (1)
2 stop_nowidth
3 on_failure_jump ==> 4
4 end
must : Z
svg%

See `must : Z' which mean that the string *must* contain Z for a match
(it's a wrong optimization), this is why it don't match in your case.

It has too many optimizations, in this case

Jeremy Henty · Jun 4, 2006

J> ... any chance of a fix going into Ruby 1.8.x ?

Well, you have found the problem

Well, yay me! Now how do I find a fix? I'm looking at
re_compile_pattern() in regex.c and I'm a little intimidated. Where
do I start?

svg% ruby -rjj -e '/(?=Z)?/.dump'
Regexp /(?=Z)?/
0 start_nowidth 2
1 exactn "Z" (1)
2 stop_nowidth
3 on_failure_jump ==> 4
4 end
must : Z
svg%

See `must : Z' which mean that the string *must* contain Z for a
match (it's a wrong optimization), this is why it don't match in
your case.

That's clear. What do the other regexps compile to? And what's this
-rjj ? I've Googled and searched mailing lists and all I can find is
other people looking for it too (and some posts from someone with the
handle "JJ"). Is it a secret only the elite may share?

Thanks for your help,

Jeremy Henty

ts · Jun 4, 2006

J> That's clear. What do the other regexps compile to? And what's this

svg% ruby -rjj -e '/Z?/.dump'
Regexp /Z?/
0 on_failure_jump ==> 2
1 exactn "Z" (1)
2 end
svg%

svg% ruby -rjj -e '/((?=Z))?/.dump'
Regexp /((?=Z))?/
0 on_failure_jump ==> 6
1 start_memory $1
2 start_nowidth 4
3 exactn "Z" (1)
4 stop_nowidth
5 stop_memory $1
6 end
subexpressions : 1
svg%

J> -rjj ?

jj don't want to leave moulon

ts · Jun 4, 2006

J> re_compile_pattern() in regex.c and I'm a little intimidated.

Can you test this patch ?

svg% diff -u regex.c.~1.96.2.8.~ regex.c
--- regex.c.~1.96.2.8.~ 2006-04-24 17:15:21.000000000 +0200
+++ regex.c 2006-06-04 13:31:23.000000000 +0200
@@ -1963,7 +1963,7 @@
stackp--;
fixup_alt_jump = *stackp ? *stackp + bufp->buffer - 1 : 0;
laststart = *--stackp + bufp->buffer;
- if (c == '!' || c == '=') laststart = b;
+ if (c == '!' /* || c == '=' */) laststart = b;
break;

case '|':
svg%

svg% ruby -rjj -e '/(?=Z)?/.dump'
Regexp /(?=Z)?/
0 on_failure_jump ==> 4
1 start_nowidth 3
2 exactn "Z" (1)
3 stop_nowidth
4 end
svg%

Jeremy Henty · Jun 4, 2006

Can you test this patch ?

svg% diff -u regex.c.~1.96.2.8.~ regex.c
--- regex.c.~1.96.2.8.~ 2006-04-24 17:15:21.000000000 +0200
+++ regex.c 2006-06-04 13:31:23.000000000 +0200
@@ -1963,7 +1963,7 @@
stackp--;
fixup_alt_jump = *stackp ? *stackp + bufp->buffer - 1 : 0;
laststart = *--stackp + bufp->buffer;
- if (c == '!' || c == '=') laststart = b;
+ if (c == '!' /* || c == '=' */) laststart = b;
break;

case '|':

It works!

$ regexp_wibble
yes
no
yes
[patch and reinstall]
$ regexp_wibble
yes
yes
yes

#!/usr/bin/env ruby

$VERBOSE = true

string = 'a'

[ %r{Z?},
%r{(?=Z)?}, # does not match!!!
%r{((?=Z))?},
].each do |re|
puts re.match(string) ? "yes" : "no"
end

<<< regexp_wibble

And also

$ make check
1356 tests, 15409 assertions, 0 failures, 0 errors

....so nothing appears to have broken.

Thanks,

Jeremy Henty

I need help in understanding these files on my phone, Could someone help me understand these files? Urgent help needed. Please help.	3	Jun 4, 2023
Help simplify complex regexp needing positive lookahead and reluctant quantifers	1	Mar 20, 2005
parsing function parameters	0	Aug 3, 2011
Prototype 1.6--Somebody Stop These People	6	Dec 24, 2009
Strange behaviour with Ruby's RE operator...	1	Apr 18, 2006
Lalr(n) parsing with reg	1	Apr 25, 2005
Announcing Reg 0.4.0	21	Apr 23, 2005
MiniQuiz : Renesting Nodes (OWLScratch)	1	Jun 22, 2005

Please explain these regexps: funny lookahead behaviour

Jeremy Henty

ts

Jeremy Henty

ts

Jeremy Henty

ts

ts

Jeremy Henty

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads