Ruby regex engine behavior question

Daniel Berger · Sep 13, 2004

I read this in a journal entry:

"[In the Ruby 1.6 regex engine] \G doesn't prohibit regex bump-along
(it's 'start of current match' rather than 'end of last match'), which
makes relatively useless to write complex parsers with."

Can anyone comment on this? I'm not quite certain what he means. And
is it still the same in 1.8?

Regards,

Dan

ts · Sep 13, 2004

D> "[In the Ruby 1.6 regex engine] \G doesn't prohibit regex bump-along
^^^^^^^

are you sure of this ?

D> (it's 'start of current match' rather than 'end of last match'), which
D> makes relatively useless to write complex parsers with."

Guy Decoux

Daniel Berger · Sep 14, 2004

ts said:
D> "[In the Ruby 1.6 regex engine] \G doesn't prohibit regex bump-along
^^^^^^^

are you sure of this ?

D> (it's 'start of current match' rather than 'end of last match'), which
D> makes relatively useless to write complex parsers with."

Guy Decoux

No. That's why I'm asking. I'm merely quoting the entry I saw. Thoughts?

Dan

nobu.nokada · Sep 14, 2004

Hi,

At Tue, 14 Sep 2004 01:04:58 +0900,
Daniel Berger wrote in [ruby-talk:112395]:

"[In the Ruby 1.6 regex engine] \G doesn't prohibit regex bump-along
(it's 'start of current match' rather than 'end of last match'), which
makes relatively useless to write complex parsers with."

I don't understand he means too. Th 'start' and the 'end'
should be same, since global match starts to match from the end
of last match.

Daniel Berger · Sep 14, 2004

ts said:
D> "[In the Ruby 1.6 regex engine] \G doesn't prohibit regex bump-along
^^^^^^^

are you sure of this ?

D> (it's 'start of current match' rather than 'end of last match'), which
D> makes relatively useless to write complex parsers with."

Guy Decoux

The OP has further clarified. To quote:

When trying to match abcde with /\Gx?/g, the first match is
successful, because no x is found but the question mark allows zero
characters to be consumed. This match ends after zero characters into
the string — at start-of-string. In order to avoid infinite loops on a
zero-length matches, the engine then retries the match one position
down the string.

In Perl, \G means end-of-last-match, and since end-of-last-match was
at start-of-string, \G can't possibly match at one character into the
string:

$ perl -le'$_="abcde"; s/\Gx?/!/; print'
!abcde

In Ruby (both 1.6 and 1.8, I found), \G merely means
start-of-current-match, which, of course, is satisfiable at that
point:

$ ruby1.6 -e'puts "abcde".gsub(/\Gx?/,"!")'
!a!b!c!d!e!
$ ruby1.8 -e'puts "abcde".gsub(/\Gx?/,"!")'
!a!b!c!d!e!

Perl's \G is a powerful tool to write parsers because the regex engine
is prohibited from skipping characters to find a match — you can work
your way through a string with a multitude of patterns using /c (to
avoid resetting the end-of-last-match on match failure) applied
against the same string in turn, without them sabotaging each other.

End quote.

Thoughts?

Dan

ts · Sep 14, 2004

D> In Perl, \G means end-of-last-match, and since end-of-last-match was
D> at start-of-string, \G can't possibly match at one character into the
D> string:

This is one way to say it, another is

* on a zero length match, perl prohibit the second zero length match

* on a zero length match, ruby move its internal cursor

Guy Decoux

Regex in Ruby question	5	Feb 22, 2008
RegEx engine returning empty matches between valid tokens.	2	Feb 5, 2006
Ruby Weekly News 13th - 26th June 2005	0	Jun 27, 2005
Ruby Weekly News 2nd - 8th January 2006	0	Jan 10, 2006
Help repost to Ruby Dev -- 2 improvements to Ruby Magic	1	Jun 6, 2007
Ruby Weekly News 17th - 23rd January 2005	3	Jan 23, 2005
ruby-dev summary 27761-28026	8	Dec 19, 2005
Ruby Weekly News 5th - 11th June 2006	0	Jun 14, 2006

Ruby regex engine behavior question

Daniel Berger

ts

Daniel Berger

nobu.nokada

Daniel Berger

ts

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads