Slow ruby regexes

E

Emmanuel

Hello i've been reading this article, wich has a few benchmarks
(inclding a ruby related one) and many theorical explanations of regex
matching algorithms:

http://swtch.com/~rsc/regexp/regexp1.html

I became a little worried since i'm making hevy use of regexes in a
program of mine that shortly i'll have to run on each of thousands of
text files.

* Do anyone now if what is said relates to ruby 1.8.6
* If so, is it going to change ?

Thank you!
 
M

MenTaLguY

I became a little worried since i'm making hevy use of regexes in a
program of mine that shortly i'll have to run on each of thousands of
text files.

Ruby's regexp implementation is a backtracking matcher similar to Perl's -- while the article's criticism is valid, you should be fine if you write your regular expressions to minimize the amount of backtracking required.

(The author's point is that such care is unnecessary with an NFA-based regexp implementation.)

-mental
 
S

SonOfLilit

Read wikipedia on Regex. It explains better than I can why one is used
over the other (more power).

For example, a regex was shown in this list lately that tells you if a
number is prime. That needs backtracking.


Aur
 
M

MenTaLguY

Read wikipedia on Regex. It explains better than I can why one is used
over the other (more power).

It's unfortunate, though, that an NFA-based approach isn't selectively used for those expressions which don't use features like backreferences (i.e. the majority of regular expressions people write). The performance is so much better it's not funny.

-mental
 
R

Robert Dober

It's unfortunate, though, that an NFA-based approach isn't selectively used for those expressions which don't use features like backreferences (i.e. the majority of regular expressions people write). The performance is so much better it's not funny.

Well the good news is that Ruby 2 Regexs will be much better
performing, I do not know if they will be NFA, though.
The bad news is that the power of regexs will just not let you escape
with bad design all the time, so knowing a lot about regexes is a
necessary thing to use them well.

But you are right of course that Ruby's current Regex implementation
is not state of the art, useful, very useful nevertheless.

Cheers
Robert
 
M

MenTaLguY

The bad news is that the power of regexs will just not let you escape
with bad design all the time, so knowing a lot about regexes is a
necessary thing to use them well.

No, the bad news is that the bad design of most "regex" implementations will not let you escape with using "regexes" as if they were true regular expressions; knowing a lot about the implementation of the regex engine is a necessary thing to use them well.
But you are right of course that Ruby's current Regex implementation
is not state of the art, useful, very useful nevertheless.

NFAs were state-of-the-art in the 1960s. Backreferences are nice, but in other respects we've actually regressed since then.

I'm not suggesting we abandon backreferences and so forth (which are useful sometimes), but rather that NFAs should be used for evaluating the majority of "regexes" which don't use non-regular features.

-mental
 
M

Marcin Raczkowski

No, the bad news is that the bad design of most "regex" implementations
will not let you escape with using "regexes" as if they were true regular
expressions; knowing a lot about the implementation of the regex engine is
a necessary thing to use them well.


NFAs were state-of-the-art in the 1960s. Backreferences are nice, but in
other respects we've actually regressed since then.

I'm not suggesting we abandon backreferences and so forth (which are useful
sometimes), but rather that NFAs should be used for evaluating the majority
of "regexes" which don't use non-regular features.

-mental

Well topic is rather interesting, i'll try to check it soon, and i'll try to
write ruby interface for c library that author of article reference, i was
looking for good case to test including c into ruby.
I'm not promising anything tho :)

regexps are great tool for text processing, but serious log analyzers (like
one my company use) are written in c and base on lots of low-level string
manipulation, raw binary access to DB et cecera, not to mention that ruby is
much much slower then c anyway.

Greets
Marcin Raczkowski
 
R

Robert Dober

No, the bad news is that the bad design of most "regex" implementations will not let you escape with using "regexes" as if they were true regular expressions; knowing a lot about the implementation of the regex engine is a necessary thing to use them well.


NFAs were state-of-the-art in the 1960s. Backreferences are nice, but in other respects we've actually regressed since then.

I'm not suggesting we abandon backreferences and so forth (which are useful sometimes), but rather that NFAs should be used for evaluating the majority of "regexes" which don't use non-regular features.

-mental
 
R

Robert Dober

oops wrong button here :(

That might be, I am not qualified to discuss this issue.
What I actually meant is that given the need for backtracking
abilities it will just not be possible to "guess" what the user meant
when he/she uses too general an expression and I am talking about NFA
here.
Well just wanted to clear that point up.Cheers
 
M

MenTaLguY

What I actually meant is that given the need for backtracking
abilities it will just not be possible to "guess" what the user meant
when he/she uses too general an expression and I am talking about NFA
here.

I'm a little confused -- the regexp features which require backtracking correspond to well-defined elements of the syntax. There isn't any need to guess whether the user requires backtracking or not, one can simply see whether such expressions are present in a particular regexp.

For "pure" regular expressions (which can be matched by NFAs), backtracking and NFA-based evaluation yield equivalent results, except that the NFA approach is O(n) in the number of characters, rather than backtracking's worst-case exponential complexity.

-mental
 
R

Robert Dober

I'm a little confused -- the regexp features which require backtracking correspond to well-defined elements of the syntax. There isn't any need to guess whether the user requires backtracking or not, one can simply see whether such expressions are present in a particular regexp.

For "pure" regular expressions (which can be matched by NFAs), backtracking and NFA-based evaluation yield equivalent results, except that the NFA approach is O(n) in the number of characters, rather than backtracking's worst-case exponential complexity.
No it must be me who is confused, I thought that backtracking cannot
be avoided in regexs like e.g. %r{[a]*.*a} when trying for a greedy
match, let me see the NFA should be something like
X start second
a start term
. second second

how could backtracking avoid when no a is found in the <second> state?
The DFA of course could never backtrack but is there a DFA for this regex?

Hope to learn a little bit more about what I seem to have forgotten ;)


Cheers
Robert
%r
 
O

Ola Bini

Robert said:
<snip>

stupid me make this simply /a*.*a/ sorry
Actually, to make it non-backtracking, you just need to update all the
possible states at the same time. It's quite easy. Just look for
Thompson NFA for a closer look at it.
The problem is that it gets problematic with backREFERENCES. That's the
main problem that is currently solved with backtracking implementations.

What Mentalguy said is basically that there should be better ways of
implementing regex engines with backreferences. Since you don't use
backreferences so often, you could have Thompson NFA for those cases
where it's fast, and switch to a backtracking engine only when you need it.

Cheers

--
Ola Bini (http://ola-bini.blogspot.com)
JvYAML, RbYAML, JRuby and Jatha contributor
System Developer, Karolinska Institutet (http://www.ki.se)
OLogix Consulting (http://www.ologix.com)

"Yields falsehood when quined" yields falsehood when quined.
 
R

Robert Dober

Actually, to make it non-backtracking, you just need to update all the
possible states at the same time. It's quite easy. Just look for
Thompson NFA for a closer look at it.
The problem is that it gets problematic with backREFERENCES. That's the
main problem that is currently solved with backtracking implementations.
Thx Ola, I'll look it up.
What Mentalguy said is basically that there should be better ways of
implementing regex engines with backreferences. Since you don't use
backreferences so often, you could have Thompson NFA for those cases
where it's fast, and switch to a backtracking engine only when you need it.
Yeah I was not concerned with the backreferences, just trying to
understand the basic stuff ;)
Cheers

--
Robert
 
M

MenTaLguY

No it must be me who is confused, I thought that backtracking cannot
be avoided in regexs like e.g. %r{[a]*.*a} when trying for a greedy
match

using a Ragel-ish notation, the NFA for /^a*.*a/ should look like this (after collapsing most of the epsilon transitions introduced by Thompson's Algorithm):

start: ('' ->a_star | '' ->any_star),
a_star: ('a' ->a_star | any ->any_star),
any_star: (any ->any_star | 'a' =>final),
final: (any ->final)

Here, -> is a regular transition, and => is a labeled transition which sets a counter (initialized to -1) to the current input position if that is larger than the counter's current value. When we have explored all possible matches, the value of the counter will be the position of the last character in the (greedy) match.
The DFA of course could never backtrack but is there a DFA for this regex?

Every NFA can be rewritten as a DFA. Each state in the resulting DFA corresponds to a superposition of states in the original NFA (again in pseudo-Ragel, where {a,b,c} denotes the superposition of NFA states a, b, and c):

{start,a_star,any_star}: (
'a' =>{a_star,any_star,final} |
(any-'a') ->{any_star}
),

{any_star}: (
'a' =>{any_star,final} |
(any-'a') ->{any_star}
),

{a_star,any_star,final}: (
'a' =>{a_star,any_star,final} |
(any-'a') ->{any_star,final}
),

{any_star,final}: (
'a' =>{any_star,final} |
(any-'a') ->{any_star,final}
)

The resulting FA is deterministic, and {a_star,any_star,final} and {any_star,final} are both accepting states.

If we need to capture subexpressions, we can introduce additional counters and labeled transitions with priorities to resolve greedy/non-greedy ambiguities (as is often done in Ragel to resolve ambiguities). Prioritized transitions are also necessary for regexps which are not anchored at the beginning.

-mental
 
R

Robert Klemme

No it must be me who is confused, I thought that backtracking cannot
be avoided in regexs like e.g. %r{[a]*.*a} when trying for a greedy
match

using a Ragel-ish notation, the NFA for /^a*.*a/ should look like this (after collapsing most of the epsilon transitions introduced by Thompson's Algorithm):

start: ('' ->a_star | '' ->any_star),
a_star: ('a' ->a_star | any ->any_star),
any_star: (any ->any_star | 'a' =>final),
final: (any ->final)

Here, -> is a regular transition, and => is a labeled transition which sets a counter (initialized to -1) to the current input position if that is larger than the counter's current value. When we have explored all possible matches, the value of the counter will be the position of the last character in the (greedy) match.
The DFA of course could never backtrack but is there a DFA for this regex?

Every NFA can be rewritten as a DFA. Each state in the resulting DFA corresponds to a superposition of states in the original NFA (again in pseudo-Ragel, where {a,b,c} denotes the superposition of NFA states a, b, and c):

While this is true in theory there is a number of modern RX features
that are extremely difficult or impossible to do with a DFA. So as it
stands now for pratical purposes most modern RX engines are NFA or
hybrids, sometimes (Perl) with heavy optimizations to prevent certain
ugly effects (see "Mastering Regular Expressions" for details).

Kind regards

robert
 
R

Robert Klemme

No, the bad news is that the bad design of most "regex" implementations will not let you escape with using "regexes" as if they were true regular expressions; knowing a lot about the implementation of the regex engine is a necessary thing to use them well.


NFAs were state-of-the-art in the 1960s. Backreferences are nice, but in other respects we've actually regressed since then.

I'm not suggesting we abandon backreferences and so forth (which are useful sometimes), but rather that NFAs should be used for evaluating the majority of "regexes" which don't use non-regular features.

I think you got NFA and DFA mixed up: DFA's are the ones that do not
suffer backtracking issues.

robert
 
R

Robert Dober

No it must be me who is confused, I thought that backtracking cannot
be avoided in regexs like e.g. %r{[a]*.*a} when trying for a greedy
match

using a Ragel-ish notation, the NFA for /^a*.*a/ should look like this (after collapsing most of the epsilon transitions introduced by Thompson's Algorithm):

start: ('' ->a_star | '' ->any_star),
a_star: ('a' ->a_star | any ->any_star),
any_star: (any ->any_star | 'a' =>final),
final: (any ->final)

Here, -> is a regular transition, and => is a labeled transition which sets a counter (initialized to -1) to the current input position if that is larger than the counter's current value. When we have explored all possible matches, the value of the counter will be the position of the last character in the (greedy) match.
The DFA of course could never backtrack but is there a DFA for this regex?

Every NFA can be rewritten as a DFA. Each state in the resulting DFA corresponds to a superposition of states in the original NFA (again in pseudo-Ragel, where {a,b,c} denotes the superposition of NFA states a, b, and c):

{start,a_star,any_star}: (
'a' =>{a_star,any_star,final} |
(any-'a') ->{any_star}
),

{any_star}: (
'a' =>{any_star,final} |
(any-'a') ->{any_star}
),

{a_star,any_star,final}: (
'a' =>{a_star,any_star,final} |
(any-'a') ->{any_star,final}
),

{any_star,final}: (
'a' =>{any_star,final} |
(any-'a') ->{any_star,final}
)

The resulting FA is deterministic, and {a_star,any_star,final} and {any_star,final} are both accepting states.

If we need to capture subexpressions, we can introduce additional counters and labeled transitions with priorities to resolve greedy/non-greedy ambiguities (as is often done in Ragel to resolve ambiguities). Prioritized transitions are also necessary for regexps which are not anchored at the beginning.

-mental
Thanx for the explanation :), I guess I learnt this algorithm once
upon a time.:(.
Anyway relearning is learning too.
Now there does not seem to be any practical need to do this
transformation though as the Thompson NFA algorithm kinda does the
same thing in runtime, not creating states that are sets of states but
just keeping a set of states.
Thx again for your time and patience, and yes I have to change my
initial statement

there are no bad news :)) [ still ignoring backreferences ]

Cheers
Robert
 
R

Robert Dober

Guys, I think your possibly mixing up backtracking and backreferences. It is
backreferences (referencing a earlier match, often to use as part of the
definition of a latter match) that NFA is not able to do.
No Tim I was not mixing it up, I said that I was not interested in the
backreferencing history.
You see I am just forgetting the basic stuff but Ola and Mental kindly
gave me a refresher.
both approaches use
backtracking.
The DFA cannot use backtracking, surely I am missing something *again*!
The NFA approach is able to parallelise the choices, reducing the
overheads of backtracking. So, given n different possible paths to follow,
instead of, in the worst case, trying n-1 paths and having to backtrack after
each one, you can try all n paths at once. At least, I htink thats what the
article was saying form a very quick read and from what I can remember from my
computability course and finite automata (oh so many years ago now!).
For me it is 25 years and for you?
Yeah I understood that in the same way, actually it is quite simple :)
Cheers
Robert
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,022
Latest member
MaybelleMa

Latest Threads

Top