Slow regular expressions :(

R

Roman Hausner

I am disappointed to learn that Ruby obviously implements yet another
regular expression library that does not avoid very slow matching
failures.

Just try this program:

str1 = "foo " * 70 + "foo ";
str2 = "foo " * 70 + "fo ";

strings = str1, str2;

strings.each { |s|
print "Matching #{s}\n";
if(s =~ /^(\s*foo\s*)*$/)
print "YES!\n";
else
print "NO!\n";
end
}

and compare it with this perl program:
my $str1 = ("foo " x 70) . "foo ";
my $str2 = ("foo " x 70) . "fo ";

my @strings = ( $str1, $str2 );

foreach $s ( @strings ) {
print "Matching $s\n";
if($s =~ /^(\s*foo\s*)*$/) {
print "YES!\n";
} else {
print "NO!\n";
}
}

On my machine, the perl version takes 0.229 seconds. The ruby version
was still running after 519.662 seconds when I killed it.

Is this being taken care of for the next release?
 
C

Caio Chassot

I am disappointed to learn that Ruby obviously implements yet another
regular expression library that does not avoid very slow matching
failures.

Is this being taken care of for the next release?

Ruby 1.9 uses a different regex engine, oniguruma. Not sure if it
addresses this particular problem.
 
K

Keith Gaughan

I am disappointed to learn that Ruby obviously implements yet another
regular expression library that does not avoid very slow matching
failures.

Just try this program:

str1 = "foo " * 70 + "foo ";
str2 = "foo " * 70 + "fo ";

strings = str1, str2;

strings.each { |s|
print "Matching #{s}\n";
if(s =~ /^(\s*foo\s*)*$/)
print "YES!\n";
else
print "NO!\n";
end
}

Ah, now while I'm not saying that Ruby's regex engine is slow--it
is--I think it's more likely here that you hit a pathological edge
case in how it works, specifically the '\s*' on each side of 'foo'.

When the code is changed to

strings.each do |s|
print "Matching #{s}\n"
if s =~ /^(foo\s*)*$/
print "Yes!\n"
else
print "No!\n"
end
end

The problem disappears and the Ruby and Perl versions of that code
benchmark similarly.

Patterns in the form (x*y*x*)* have a habit of acting pathologically and
there's almost always a better and clearer way of writing them, mainly
because they have a habit of causing a nasty amount of backtracking.
This applies to a good number of regex engines, and not just Ruby's.
On my machine, the perl version takes 0.229 seconds. The ruby version
was still running after 519.662 seconds when I killed it.

Is this being taken care of for the next release?

My suspicion is that Ruby's regex package doesn't account for this
pathological case, where as Perl's--which, to be frank, is pretty much
the best regex engine out there bar none except for its support for
Unicode properties--does.

Not a clue as to whether it'll be taken care of, though. JARH.

K.
 
C

Caleb Clausen

I am disappointed to learn that Ruby obviously implements yet another
regular expression library that does not avoid very slow matching
failures.

Just try this program:

str1 = "foo " * 70 + "foo ";
str2 = "foo " * 70 + "fo ";

strings = str1, str2;

strings.each { |s|
print "Matching #{s}\n";
if(s =~ /^(\s*foo\s*)*$/)
print "YES!\n";
else
print "NO!\n";
end
}


If you take out the first \s* (or move it outside the parens), the
slowness goes away. That first \s* isn't needed. Another trick that
may help here is changing the *'s to +'s.

This is a case of pathological backtracking. Usually, these things are
solved by carefully tuning the Regexp, rather than expecting the
Regexp engine to fix it...

Yes, perl optimizes this case, but the fact is that you're asking for
a lot of silly extra backtracking, and that is what ruby gives you. I
imagine that you didn't know that you'd created a backtracking
monster...

Basically, as long as everything is matching, you're fine and matches
will be nice and zippy. But once that 'fo' causes a mismatch, the
engine has to go back and check every possible way that the spaces in
the string could be matched by either of the 2 whitespace matchers.
Since there are ~70 spaces in your string and each could be matched by
either \s*, there are about 2**70 possibilites to explore, total. You
can see how that might take a little time.

I'm afraid I'm not explaining very well why this Regexp is a
monster... maybe someone else will take a stab.
 
D

Daniel Berger

Caleb said:
=20
=20
If you take out the first \s* (or move it outside the parens), the
slowness goes away. That first \s* isn't needed. Another trick that
may help here is changing the *'s to +'s.
=20
This is a case of pathological backtracking. Usually, these things are
solved by carefully tuning the Regexp, rather than expecting the
Regexp engine to fix it...
=20
Yes, perl optimizes this case, but the fact is that you're asking for
a lot of silly extra backtracking, and that is what ruby gives you. I
imagine that you didn't know that you'd created a backtracking
monster...
=20
Basically, as long as everything is matching, you're fine and matches
will be nice and zippy. But once that 'fo' causes a mismatch, the
engine has to go back and check every possible way that the spaces in
the string could be matched by either of the 2 whitespace matchers.
Since there are ~70 spaces in your string and each could be matched by
either \s*, there are about 2**70 possibilites to explore, total. You
can see how that might take a little time.
=20
I'm afraid I'm not explaining very well why this Regexp is a
monster... maybe someone else will take a stab.
=20

It would take too long and I'm tired. Jeffrey Friedl's "Mastering =
Regular=20
Expressions" is the definitive source on the subject. That being said, =
I think=20
the OP needs a lesson on greedy vs non-greedy regular expressions.

Also note that Perl must have optimized for this sort of backtracking =
after=20
5.005_03, because it hangs as well. I'd be curious as to their reasons =
for=20
doing this, but I have my suspicions.

Regards,

Dan


This communication is the property of Qwest and may contain confidential =
or
privileged information. Unauthorized use of this communication is =
strictly=20
prohibited and may be unlawful. If you have received this communication =

in error, please immediately notify the sender by reply e-mail and =
destroy=20
all copies of the communication and any attachments.
 
W

William James

Roman said:
I am disappointed to learn that Ruby obviously implements yet another
regular expression library that does not avoid very slow matching
failures.

Just try this program:

str1 = "foo " * 70 + "foo ";
str2 = "foo " * 70 + "fo ";

strings = str1, str2;

strings.each { |s|
print "Matching #{s}\n";
if(s =~ /^(\s*foo\s*)*$/)
print "YES!\n";
else
print "NO!\n";
end
}

and compare it with this perl program:
my $str1 = ("foo " x 70) . "foo ";
my $str2 = ("foo " x 70) . "fo ";

my @strings = ( $str1, $str2 );

foreach $s ( @strings ) {
print "Matching $s\n";
if($s =~ /^(\s*foo\s*)*$/) {
print "YES!\n";
} else {
print "NO!\n";
}
}

On my machine, the perl version takes 0.229 seconds. The ruby version
was still running after 519.662 seconds when I killed it.

Is this being taken care of for the next release?

Here's the way to do it.

strings = " foo " * 70 + "foo ",
"foo " * 70 + "foo ",
" foo " * 70 + "fo ",
""

strings.each { |s|
puts "Matching #{s}"
if s =~ /^ \s* # May start with whitespace.
# The rest of the string must be 0 or more
# occurrences of "foo" plus whitespace.
(?: foo \s+ ) *
$/x
puts "YES!"
else
puts "NO!"
end
}
 
D

Daniel Martin

Roman Hausner said:
I am disappointed to learn that Ruby obviously implements yet another
regular expression library that does not avoid very slow matching
failures.

Just a note that there are some patterns where I found (in my tests,
on a cygwin environment I can't recreate anymore) that ruby beat perl.

http://snowplow.org/martin/rebench/

I should note that the people at perlmonks.org disbelieve my results,
or rather, believe that my results are being skewed by some strange
cygwin issue. I haven't gotten off my butt and redone the same tests
on linux.
 
R

Roman Hausner

Caleb said:
If you take out the first \s* (or move it outside the parens), the
slowness goes away. That first \s* isn't needed. Another trick that
may help here is changing the *'s to +'s.

This is a case of pathological backtracking. Usually, these things are
solved by carefully tuning the Regexp, rather than expecting the
Regexp engine to fix it...
No. I am not looking for help on regular expressions here and I know
that this particular expression could be optimized. I also know that
some regular expressions with backtracking can get exponentially slow.

However, regular expression can get generated automatically or there
could be other reasons why an optimization a la perl is helpful.

The example simply points out that ruby uses a strategy that does not do
the optimization that is present in perl. My question was, whether Ruby
is going to fix this.

If you can answer that question, I would apreciate it. If not, maybe
somebody
else can.
 
W

William James

Roman said:

I.e., "No, I won't take out the first \s* (or move it outside the
parens). No, I won't tune the regular expression. No, this isn't
a case of pathological backtracking. No, I won't stop holding
my breath until I turn blue."
I am not looking for help on regular expressions here and I know
that this particular expression could be optimized.

No. This regular expression has been pessimized.
I also know that
some regular expressions with backtracking can get exponentially slow.

However, regular expression can get generated automatically

Stop generating crappy regular expressions.
or there
could be other reasons why an optimization a la perl is helpful.

The example simply points out that ruby uses a strategy that does not do
the optimization that is present in perl. My question was, whether Ruby
is going to fix this.

The arrogance. He refuses to listen to any advice about cleaning
up his crap and then demands to know when someone is going
to "fix" Ruby.

I hope that Ruby doesn't start pandering to those who obdurately
refuse to clean up their feculent waste.
 
J

Just Another Victim of the Ambient Morality

William James said:
I.e., "No, I won't take out the first \s* (or move it outside the
parens). No, I won't tune the regular expression. No, this isn't
a case of pathological backtracking. No, I won't stop holding
my breath until I turn blue."

This attitude is noted, below...

Stop generating crappy regular expressions.

This is easier said than done. It can be very hard to write code that
can specifically avoid generating pathological regexes, especially if you
don't understand how regexes work. Indeed, it would be nice if the regexp
engine can do this for you. After all, it is the job of the programming
language to make programming easier for us, is it not?

The arrogance. He refuses to listen to any advice about cleaning
up his crap and then demands to know when someone is going
to "fix" Ruby.

I hope that Ruby doesn't start pandering to those who obdurately
refuse to clean up their feculent waste.

While Roman Hausner obviously created this thread with a somewhat
beligerent attitude, it is rarely wise to reply in kind.
There is much merit in his suggestion to include this PERL regexp
optimisation feature in Ruby...
 
B

Bill Kelly

From: "Just Another Victim of the Ambient Morality said:
This is easier said than done. It can be very hard to write code that
can specifically avoid generating pathological regexes, especially if you
don't understand how regexes work. Indeed, it would be nice if the regexp
engine can do this for you. After all, it is the job of the programming
language to make programming easier for us, is it not?

Is it worth noting that the article at
http://perlmonks.org/index.pl?node_id=502408
indicates:

- a performance penalty is incurred for keeping track of the
"have we been here before" state;

- the solution doesn't cover all regexps, so some regexps
can still produce exponential backtracking.

?


The Perl guys seem to feel the trade-off was worth it. Maybe
so. But it doesn't seem 100% clear cut to me.


Regards,

Bill
 
J

Just Another Victim of the Ambient Morality

Bill Kelly said:
Is it worth noting that the article at
http://perlmonks.org/index.pl?node_id=502408
indicates:

- a performance penalty is incurred for keeping track of the
"have we been here before" state;

- the solution doesn't cover all regexps, so some regexps
can still produce exponential backtracking.

?


The Perl guys seem to feel the trade-off was worth it. Maybe
so. But it doesn't seem 100% clear cut to me.

It is not clear cut to me, either, and yes, it is worth noting that
there are performance trade-offs. However, the poster to whom I was
responding dismissed it, out of hand, by blaming the programmer, much like
the 60's auto industry blaming accidents on drivers as an excuse for not
bothering with safety features...

If the performance hit is great and the regex pattern uncommon, it may
very well not be worth it. This is, of course, debatable and that's my
point. Personally, I don't use Ruby to write fast programs, I use it to
write (correct) programs fast.
At the very least, perhaps the PERL Regexp engine can be written as a
Ruby extension and used as another Regexp class that we "require" in our
code when we feel it's more appropriate? I'm a big believer in the Best of
Both Worlds solution...
 
R

Roman Hausner

Bill said:
Is it worth noting that the article at
http://perlmonks.org/index.pl?node_id=502408
indicates:

- a performance penalty is incurred for keeping track of the
"have we been here before" state;

- the solution doesn't cover all regexps, so some regexps
can still produce exponential backtracking.

?


The Perl guys seem to feel the trade-off was worth it. Maybe
so. But it doesn't seem 100% clear cut to me.

The important point is that the performance penalty is linear: those
matches that get slower will get slower by a constant factor. But: this
penalty will prevent an *exponential* slowdown in many, typical cases. I
have not waited how long Ruby would have taken in the simple example
case I gave, but the slowdown was already more than 2000-fold when I
killed it -- much much more than the performance penalty would ever be.

Regarding the example: yes this is an extremely easy example meant for
illustration only. However, with complex hand-crafted regular
expressions or with automatically generated ones the same effect can
arise when it is much more complicated or even impossible to find an
equivalent regular expression that does not show the exponential
slowdown.

Exponentional slowdown can be an extremely bad thing, rendering an
application practically unusable. So I would say, the trade-off is
definitely worth it. Even if it cannot prevent *all* pathological cases
(my feeling is that most pathological cases that occur in practice *can*
be prevented though).
 
C

Chad Perrin

At the very least, perhaps the PERL Regexp engine can be written as a

Just a quick pseudo-cultural note . . .

The language is Perl. The parser/compiler/interpreter/blah is perl.
The term PERL is something used to identify texts one shouldn't bother
reading.
 
A

Alexandru E. Ungur

sender: "Just Another Victim of the Ambient Morality" date: "Fri, Jul 28, 2006 at 02:55:05PM +0900" <<<EOQ
It is not clear cut to me, either, and yes, it is worth noting that
there are performance trade-offs. However, the poster to whom I was
responding dismissed it, out of hand, by blaming the programmer, much like
the 60's auto industry blaming accidents on drivers as an excuse for not
bothering with safety features...

If the performance hit is great and the regex pattern uncommon, it may
very well not be worth it. This is, of course, debatable and that's my
point. Personally, I don't use Ruby to write fast programs, I use it to
write (correct) programs fast.
At the very least, perhaps the PERL Regexp engine can be written as a
Ruby extension and used as another Regexp class that we "require" in our
code when we feel it's more appropriate? I'm a big believer in the Best of
Both Worlds solution...
No offence, but who stops you from using Perl :) ?!?
Just use perl when you're too lazy to think about a regex and hope that
the engine will fix the crap you put in there, and use Ruby, when you do
have the time and the interest to actually think. How about that?
Wouldn't this be "the best of both worlds"... ?

Of course you and all the other wise "haha I found one more bug in Ruby"
guys could actually follow the very good advice of reading 'Mastering
Regular Expressions', and stop dumping crap on this list, but hey I know
it's not a perfect world :)

This is just disgusting, to take proud in your own stupidity and have
the audacity that others should turn it into wisdom on their time and
money. How about you two guys get together and write that extension you
so badly need and make yourself happy with your own hands? It's a little
harder than whinning on a list, but hey, you're smart guys, after all you
"found where Ruby sucks" and others do a great job... and nobody else
from all the people on the list discovered this! Gosh, you must be
really smart... :)

I'd say "Good luck" but you're probably just too smart to need it
anyway...



To all the other decent readers on the list:
I appologise for my post, I may be breaking the netiquette here, but
this was just too much... :(
I'm a Ruby nuby too, I am not the Grand Master of regular expressions,
but I love Ruby just the way it is, and it really hurts me to see this
crap attitude about Ruby's "limitations" coming from people not able
to understand their own limitations...


All the best to all the decent people,
Alex
 
R

Roman Hausner

Alexandru said:
No offence, but who stops you from using Perl :) ?!?
Just use perl when you're too lazy to think about a regex and hope that
the engine will fix the crap you put in there, and use Ruby, when you do
have the time and the interest to actually think. How about that?
Wouldn't this be "the best of both worlds"... ?

Of course you and all the other wise "haha I found one more bug in Ruby"
guys could actually follow the very good advice of reading 'Mastering
Regular Expressions', and stop dumping crap on this list, but hey I know
it's not a perfect world :)

This is just disgusting, to take proud in your own stupidity and have
the audacity that others should turn it into wisdom on their time and
money. How about you two guys get together and write that extension you
so badly need and make yourself happy with your own hands? It's a little
harder than whinning on a list, but hey, you're smart guys, after all
you
"found where Ruby sucks" and others do a great job... and nobody else
from all the people on the list discovered this! Gosh, you must be
really smart... :)

I'd say "Good luck" but you're probably just too smart to need it
anyway...
How often do you want to repeat embarrassing yourself here by showing
a total lack of understanding the actual issue here?
Why do you join into a discussion with your childish fanboyism when
you actually have no idea what you are talking about?
As I have pointed out the example is illustrative -- in the general case
it is much harder and sometimes impossible to simply manually correct
a regular expression that shows exponential backtracking behavior.
It *is* in many cases possible however to prevent that exponential
backtracking behavior with a nifty optimization of the matching engine
(though it comes at a penalty and does not work for all cases, as has
been
pointed out).
Also, the obvious reason for not using perl in the first place is that
Ruby has a lot of very fine things to offer that many prefer over perl.
My experience with Ruby is exactly that -- I love the beauty of the
design and many details about the language, but there are still flaws.
Apart from the one I pointed out here, my main other concern is the lack
of proper Unicode support -- though I hear that this is being worked on
for the next release.

So could I humbly ask you to try to understand these points (and the
issues with complex regular expressions, for that matter) and either
contribute something useful finally to this thread or else go somewhere
else to live out your childishness?
 
J

Just Another Victim of the Ambient Morality

Alexandru E. Ungur said:
No offence, but who stops you from using Perl :) ?!?

Your claim of not wanting to offend is questionable; see below...

Just use perl when you're too lazy to think about a regex and hope that
the engine will fix the crap you put in there, and use Ruby, when you do
have the time and the interest to actually think. How about that?
Wouldn't this be "the best of both worlds"... ?

Except that, in my personal opinion, writing the rest of the script in
Perl (does this capitalization follow protocol?) is not the better part of
the two worlds. I just suggested that the robustness of Perl's rexexp
engine might be beneficial. Indeed, I have never encountered this
pathalogical case but I'm glad I know of it, now!
Of course you and all the other wise "haha I found one more bug in Ruby"
guys could actually follow the very good advice of reading 'Mastering
Regular Expressions', and stop dumping crap on this list, but hey I know
it's not a perfect world :)

Me and all the others? I suggest you read my posts again because you're
lumping me into a group to which I don't belong. All the happy faces in the
world won't hide your passive aggresive attitude towards honest suggestions
and criticisms.

This is just disgusting, to take proud in your own stupidity and have
the audacity that others should turn it into wisdom on their time and
money. How about you two guys get together and write that extension you
so badly need and make yourself happy with your own hands? It's a little
harder than whinning on a list, but hey, you're smart guys, after all you
"found where Ruby sucks" and others do a great job... and nobody else
from all the people on the list discovered this! Gosh, you must be
really smart... :)

Honestly, who's taking pride in anything?
The idea that not having intimate knowledge of regular expressions is
stupid and audacious is pretentious, arrogant and elitist. Even if you knew
what conditions cause the exponential regex growth, that doesn't mean you
will know how to remove them if you generated the regexes programmatically.
It is useful for the regexp implementation to reduce this for us.

I'd say "Good luck" but you're probably just too smart to need it
anyway...

Good luck with what? To whom do you think you're speaking? With what
do you think I need luck? What do you think the issue is? In short: what
are you talking about?

To all the other decent readers on the list:
I appologise for my post, I may be breaking the netiquette here, but
this was just too much... :(
I'm a Ruby nuby too, I am not the Grand Master of regular expressions,
but I love Ruby just the way it is, and it really hurts me to see this
crap attitude about Ruby's "limitations" coming from people not able
to understand their own limitations...

Ironically, you speak of people who are "not able to understand their
own limitations" but, perhaps, they do understand their own limitations and
that's why they want a more robust regexp engine to protect themselves form
their own mistakes?

You are far too defensive. Attempts to improve the language are not
attacks on the language. Take a look at this article, titled: How Ruby
Sucks...

http://www.rubyist.net/~matz/slides/rc2003/mgp00003.html

...and you'll never guess who wrote it.

I can tell you're a newbie, too, because most people here actually read
what's posted and respond with reasoned answers. In particular, they are
not defensive about ideas, suggestions, or even criticisms of the language.
How do you think a language improves? How do you think Ruby became as good
as it is, now?


Here's the only insult I am going to throw.
I don't understand how a newbie can be such a fanboy...
 
K

Kev Jackson

If regexp stuff matter so much in the course of your programming for =20
a particular case, then using perl is a valid suggestion as it is =20
rightly seen as pretty much the best regexp engine so far (Oniguruma =20
[1]<sp?> has made some waves and may be able to best perl in the end, =20=

but it's not widely available and is an extension now, perhaps for =20
Ruby2/Rite it will be the built in regexp engine, who knows or dares =20
to dream).

So I don't find it unreasonable to say to people who need to do a lot =20=

of regexp work - "use perl", it was practically designed for it. But =20=

most of the time I find using regexp complicates things unnecessarily =20=

[2].

Most of the time Ruby is good enough, if you are generating regexps =20
automatically, perhaps perl is the better choice as it will prevent =20
this exponential slowdown for you. To say that Ruby must be 'fixed' =20
to allow you to write crappy regexps and have the engine take care of =20=

it for you is in my opinion not the answer. If you're going to use a =20=

regexp, you should be aware of what you are doing. If for some =20
reason you generate a load of them automatically, then I think that =20
there could possibly be a better way of achieving your goals in ruby =20
(perhaps a dsl instead of treating everything as raw text?). And =20
finally Ruby might not be the right language to choose and you'd be =20
better off throwing your funky generated regexps at perl to handle.

All this may change as soon as Ruby2/Rite appears with Oniguruma, but =20=

for now Ruby isn't the best tool for every possible problem - and if =20
it's a major worry for you, perhaps lend a hand with Onig... or =20
indeed write a perl-alike regexp engine extension - competition is =20
good :)

Kev

[1] http://www.geocities.jp/kosako3/oniguruma/
[2] Some people, when confronted with a problem, think =93I know, I=92ll =
=20
use regular expressions.=94 Now they have two problems.=97Jamie =
Zawinski, =20
in comp.lang.emacs
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,780
Messages
2,569,611
Members
45,274
Latest member
JessMcMast

Latest Threads

Top