Slow ruby regexes

Emmanuel · Apr 10, 2007

Hello i've been reading this article, wich has a few benchmarks
(inclding a ruby related one) and many theorical explanations of regex
matching algorithms:

http://swtch.com/~rsc/regexp/regexp1.html

I became a little worried since i'm making hevy use of regexes in a
program of mine that shortly i'll have to run on each of thousands of
text files.

* Do anyone now if what is said relates to ruby 1.8.6
* If so, is it going to change ?

Thank you!

MenTaLguY · Apr 10, 2007

I became a little worried since i'm making hevy use of regexes in a
program of mine that shortly i'll have to run on each of thousands of
text files.

Ruby's regexp implementation is a backtracking matcher similar to Perl's -- while the article's criticism is valid, you should be fine if you write your regular expressions to minimize the amount of backtracking required.

(The author's point is that such care is unnecessary with an NFA-based regexp implementation.)

-mental

SonOfLilit · Apr 10, 2007

Read wikipedia on Regex. It explains better than I can why one is used
over the other (more power).

For example, a regex was shown in this list lately that tells you if a
number is prime. That needs backtracking.

Aur

MenTaLguY · Apr 10, 2007

Read wikipedia on Regex. It explains better than I can why one is used
over the other (more power).

It's unfortunate, though, that an NFA-based approach isn't selectively used for those expressions which don't use features like backreferences (i.e. the majority of regular expressions people write). The performance is so much better it's not funny.

-mental

Robert Dober · Apr 10, 2007

It's unfortunate, though, that an NFA-based approach isn't selectively used for those expressions which don't use features like backreferences (i.e. the majority of regular expressions people write). The performance is so much better it's not funny.

Well the good news is that Ruby 2 Regexs will be much better
performing, I do not know if they will be NFA, though.
The bad news is that the power of regexs will just not let you escape
with bad design all the time, so knowing a lot about regexes is a
necessary thing to use them well.

But you are right of course that Ruby's current Regex implementation
is not state of the art, useful, very useful nevertheless.

Cheers
Robert

MenTaLguY · Apr 10, 2007

The bad news is that the power of regexs will just not let you escape
with bad design all the time, so knowing a lot about regexes is a
necessary thing to use them well.

No, the bad news is that the bad design of most "regex" implementations will not let you escape with using "regexes" as if they were true regular expressions; knowing a lot about the implementation of the regex engine is a necessary thing to use them well.

But you are right of course that Ruby's current Regex implementation
is not state of the art, useful, very useful nevertheless.

NFAs were state-of-the-art in the 1960s. Backreferences are nice, but in other respects we've actually regressed since then.

I'm not suggesting we abandon backreferences and so forth (which are useful sometimes), but rather that NFAs should be used for evaluating the majority of "regexes" which don't use non-regular features.

-mental

Marcin Raczkowski · Apr 11, 2007

No, the bad news is that the bad design of most "regex" implementations
will not let you escape with using "regexes" as if they were true regular
expressions; knowing a lot about the implementation of the regex engine is
a necessary thing to use them well.

NFAs were state-of-the-art in the 1960s. Backreferences are nice, but in
other respects we've actually regressed since then.

I'm not suggesting we abandon backreferences and so forth (which are useful
sometimes), but rather that NFAs should be used for evaluating the majority
of "regexes" which don't use non-regular features.

-mental

Click to expand...

Well topic is rather interesting, i'll try to check it soon, and i'll try to
write ruby interface for c library that author of article reference, i was
looking for good case to test including c into ruby.
I'm not promising anything tho

regexps are great tool for text processing, but serious log analyzers (like
one my company use) are written in c and base on lots of low-level string
manipulation, raw binary access to DB et cecera, not to mention that ruby is
much much slower then c anyway.

Greets
Marcin Raczkowski

Robert Dober · Apr 11, 2007

No, the bad news is that the bad design of most "regex" implementations will not let you escape with using "regexes" as if they were true regular expressions; knowing a lot about the implementation of the regex engine is a necessary thing to use them well.

NFAs were state-of-the-art in the 1960s. Backreferences are nice, but in other respects we've actually regressed since then.

I'm not suggesting we abandon backreferences and so forth (which are useful sometimes), but rather that NFAs should be used for evaluating the majority of "regexes" which don't use non-regular features.

-mental

Robert Dober · Apr 11, 2007

oops wrong button here

That might be, I am not qualified to discuss this issue.
What I actually meant is that given the need for backtracking
abilities it will just not be possible to "guess" what the user meant
when he/she uses too general an expression and I am talking about NFA
here.
Well just wanted to clear that point up.Cheers

MenTaLguY · Apr 11, 2007

What I actually meant is that given the need for backtracking
abilities it will just not be possible to "guess" what the user meant
when he/she uses too general an expression and I am talking about NFA
here.

I'm a little confused -- the regexp features which require backtracking correspond to well-defined elements of the syntax. There isn't any need to guess whether the user requires backtracking or not, one can simply see whether such expressions are present in a particular regexp.

For "pure" regular expressions (which can be matched by NFAs), backtracking and NFA-based evaluation yield equivalent results, except that the NFA approach is O(n) in the number of characters, rather than backtracking's worst-case exponential complexity.

-mental

Robert Dober · Apr 11, 2007

I'm a little confused -- the regexp features which require backtracking correspond to well-defined elements of the syntax. There isn't any need to guess whether the user requires backtracking or not, one can simply see whether such expressions are present in a particular regexp.

For "pure" regular expressions (which can be matched by NFAs), backtracking and NFA-based evaluation yield equivalent results, except that the NFA approach is O(n) in the number of characters, rather than backtracking's worst-case exponential complexity.

No it must be me who is confused, I thought that backtracking cannot
be avoided in regexs like e.g. %r{[a]*.*a} when trying for a greedy
match, let me see the NFA should be something like
X start second
a start term
. second second

how could backtracking avoid when no a is found in the <second> state?
The DFA of course could never backtrack but is there a DFA for this regex?

Hope to learn a little bit more about what I seem to have forgotten

Cheers
Robert
%r

Robert Dober · Apr 11, 2007

<snip>

stupid me make this simply /a*.*a/ sorry

Ola Bini · Apr 11, 2007

Robert said:
<snip>

stupid me make this simply /a*.*a/ sorry

Actually, to make it non-backtracking, you just need to update all the
possible states at the same time. It's quite easy. Just look for
Thompson NFA for a closer look at it.
The problem is that it gets problematic with backREFERENCES. That's the
main problem that is currently solved with backtracking implementations.

What Mentalguy said is basically that there should be better ways of
implementing regex engines with backreferences. Since you don't use
backreferences so often, you could have Thompson NFA for those cases
where it's fast, and switch to a backtracking engine only when you need it.

Cheers

--
Ola Bini (http://ola-bini.blogspot.com)
JvYAML, RbYAML, JRuby and Jatha contributor
System Developer, Karolinska Institutet (http://www.ki.se)
OLogix Consulting (http://www.ologix.com)

"Yields falsehood when quined" yields falsehood when quined.

Robert Dober · Apr 11, 2007

Actually, to make it non-backtracking, you just need to update all the
possible states at the same time. It's quite easy. Just look for
Thompson NFA for a closer look at it.
The problem is that it gets problematic with backREFERENCES. That's the
main problem that is currently solved with backtracking implementations.
Thx Ola, I'll look it up.
What Mentalguy said is basically that there should be better ways of
implementing regex engines with backreferences. Since you don't use
backreferences so often, you could have Thompson NFA for those cases
where it's fast, and switch to a backtracking engine only when you need it.

Yeah I was not concerned with the backreferences, just trying to
understand the basic stuff

Cheers

--

Robert

Robert Dober · Apr 11, 2007

Thx Ola, I'll look it up.

Done

just wanted to share the link
http://swtch.com/~rsc/regexp/regexp1.html
Robert

MenTaLguY · Apr 11, 2007

No it must be me who is confused, I thought that backtracking cannot
be avoided in regexs like e.g. %r{[a]*.*a} when trying for a greedy
match

using a Ragel-ish notation, the NFA for /^a*.*a/ should look like this (after collapsing most of the epsilon transitions introduced by Thompson's Algorithm):

start: ('' ->a_star | '' ->any_star),
a_star: ('a' ->a_star | any ->any_star),
any_star: (any ->any_star | 'a' =>final),
final: (any ->final)

Here, -> is a regular transition, and => is a labeled transition which sets a counter (initialized to -1) to the current input position if that is larger than the counter's current value. When we have explored all possible matches, the value of the counter will be the position of the last character in the (greedy) match.

The DFA of course could never backtrack but is there a DFA for this regex?

Every NFA can be rewritten as a DFA. Each state in the resulting DFA corresponds to a superposition of states in the original NFA (again in pseudo-Ragel, where {a,b,c} denotes the superposition of NFA states a, b, and c):

{start,a_star,any_star}: (
'a' =>{a_star,any_star,final} |
(any-'a') ->{any_star}
),

{any_star}: (
'a' =>{any_star,final} |
(any-'a') ->{any_star}
),

{a_star,any_star,final}: (
'a' =>{a_star,any_star,final} |
(any-'a') ->{any_star,final}
),

{any_star,final}: (
'a' =>{any_star,final} |
(any-'a') ->{any_star,final}
)

The resulting FA is deterministic, and {a_star,any_star,final} and {any_star,final} are both accepting states.

If we need to capture subexpressions, we can introduce additional counters and labeled transitions with priorities to resolve greedy/non-greedy ambiguities (as is often done in Ragel to resolve ambiguities). Prioritized transitions are also necessary for regexps which are not anchored at the beginning.

-mental

Robert Klemme · Apr 12, 2007

No it must be me who is confused, I thought that backtracking cannot
be avoided in regexs like e.g. %r{[a]*.*a} when trying for a greedy
match

Click to expand...

using a Ragel-ish notation, the NFA for /^a*.*a/ should look like this (after collapsing most of the epsilon transitions introduced by Thompson's Algorithm):

start: ('' ->a_star | '' ->any_star),
a_star: ('a' ->a_star | any ->any_star),
any_star: (any ->any_star | 'a' =>final),
final: (any ->final)

Here, -> is a regular transition, and => is a labeled transition which sets a counter (initialized to -1) to the current input position if that is larger than the counter's current value. When we have explored all possible matches, the value of the counter will be the position of the last character in the (greedy) match.

The DFA of course could never backtrack but is there a DFA for this regex?

Click to expand...

Every NFA can be rewritten as a DFA. Each state in the resulting DFA corresponds to a superposition of states in the original NFA (again in pseudo-Ragel, where {a,b,c} denotes the superposition of NFA states a, b, and c):

While this is true in theory there is a number of modern RX features
that are extremely difficult or impossible to do with a DFA. So as it
stands now for pratical purposes most modern RX engines are NFA or
hybrids, sometimes (Perl) with heavy optimizations to prevent certain
ugly effects (see "Mastering Regular Expressions" for details).

Kind regards

robert

Robert Klemme · Apr 12, 2007

No, the bad news is that the bad design of most "regex" implementations will not let you escape with using "regexes" as if they were true regular expressions; knowing a lot about the implementation of the regex engine is a necessary thing to use them well.

NFAs were state-of-the-art in the 1960s. Backreferences are nice, but in other respects we've actually regressed since then.

I'm not suggesting we abandon backreferences and so forth (which are useful sometimes), but rather that NFAs should be used for evaluating the majority of "regexes" which don't use non-regular features.

I think you got NFA and DFA mixed up: DFA's are the ones that do not
suffer backtracking issues.

robert

Robert Dober · Apr 12, 2007

No it must be me who is confused, I thought that backtracking cannot
be avoided in regexs like e.g. %r{[a]*.*a} when trying for a greedy
match

Click to expand...

using a Ragel-ish notation, the NFA for /^a*.*a/ should look like this (after collapsing most of the epsilon transitions introduced by Thompson's Algorithm):

start: ('' ->a_star | '' ->any_star),
a_star: ('a' ->a_star | any ->any_star),
any_star: (any ->any_star | 'a' =>final),
final: (any ->final)

Here, -> is a regular transition, and => is a labeled transition which sets a counter (initialized to -1) to the current input position if that is larger than the counter's current value. When we have explored all possible matches, the value of the counter will be the position of the last character in the (greedy) match.

The DFA of course could never backtrack but is there a DFA for this regex?

Click to expand...

Every NFA can be rewritten as a DFA. Each state in the resulting DFA corresponds to a superposition of states in the original NFA (again in pseudo-Ragel, where {a,b,c} denotes the superposition of NFA states a, b, and c):

{start,a_star,any_star}: (
'a' =>{a_star,any_star,final} |
(any-'a') ->{any_star}
),

{any_star}: (
'a' =>{any_star,final} |
(any-'a') ->{any_star}
),

{a_star,any_star,final}: (
'a' =>{a_star,any_star,final} |
(any-'a') ->{any_star,final}
),

{any_star,final}: (
'a' =>{any_star,final} |
(any-'a') ->{any_star,final}
)

The resulting FA is deterministic, and {a_star,any_star,final} and {any_star,final} are both accepting states.

If we need to capture subexpressions, we can introduce additional counters and labeled transitions with priorities to resolve greedy/non-greedy ambiguities (as is often done in Ragel to resolve ambiguities). Prioritized transitions are also necessary for regexps which are not anchored at the beginning.

-mental

Thanx for the explanation

, I guess I learnt this algorithm once
upon a time.

.
Anyway relearning is learning too.
Now there does not seem to be any practical need to do this
transformation though as the Thompson NFA algorithm kinda does the
same thing in runtime, not creating states that are sets of states but
just keeping a set of states.
Thx again for your time and patience, and yes I have to change my
initial statement

there are no bad news

) [ still ignoring backreferences ]

Cheers
Robert

Robert Dober · Apr 12, 2007

Guys, I think your possibly mixing up backtracking and backreferences. It is
backreferences (referencing a earlier match, often to use as part of the
definition of a latter match) that NFA is not able to do.

No Tim I was not mixing it up, I said that I was not interested in the
backreferencing history.
You see I am just forgetting the basic stuff but Ola and Mental kindly
gave me a refresher.

both approaches use
backtracking.

The DFA cannot use backtracking, surely I am missing something *again*!

The NFA approach is able to parallelise the choices, reducing the
overheads of backtracking. So, given n different possible paths to follow,
instead of, in the worst case, trying n-1 paths and having to backtrack after
each one, you can try all n paths at once. At least, I htink thats what the
article was saying form a very quick read and from what I can remember from my
computability course and finite automata (oh so many years ago now!).

For me it is 25 years and for you?
Yeah I understood that in the same way, actually it is quite simple

Tim

Cheers
Robert

Ruby Regexp implementation?	2	Mar 12, 2010
Slow regular-expression engine	5	Jul 31, 2009
Ruby Weekly News 18th - 24th April 2005	1	Apr 26, 2005
Ruby Weekly News 17th - 23rd January 2005	3	Jan 23, 2005
Ruby Weekly News 6th - 12th June 2005	0	Jun 14, 2005
[Minor Change Proposal]Allow back reference with nest level in Onigurumafor Ruby again	1	May 14, 2007
Ruby Weekly News 11th - 17th April 2005	5	Apr 18, 2005
Ruby Weekly News 12th - 18th September 2005	0	Sep 20, 2005

Slow ruby regexes

Emmanuel

MenTaLguY

SonOfLilit

MenTaLguY

Robert Dober

MenTaLguY

Marcin Raczkowski

Robert Dober

Robert Dober

MenTaLguY

Robert Dober

Robert Dober

Ola Bini

Robert Dober

Robert Dober

MenTaLguY

Robert Klemme

Robert Klemme

Robert Dober

Robert Dober

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads