Slow regular-expression engine

w_a_x_man · Jul 31, 2009

This Awk code takes less than one hundredth of a second to run:

BEGIN {

regex = \
"o?o?o?o?o?o?o?o?o?o?o?o?o?o?" \
"o?o?o?o?o?o?o?o?o?o?o?o?o?" \
"ooooooooooooooooooooooooooooo"

print "ooooooooooooooooooooooooooooo" ~ regex

}

This Ruby code takes 27.5 seconds:

t = Time.now

regex = Regexp.new(
"o?o?o?o?o?o?o?o?o?o?o?o?o?o?" +
"o?o?o?o?o?o?o?o?o?o?o?o?o?" +
"ooooooooooooooooooooooooooooo" )

p "ooooooooooooooooooooooooooooo" =~ regex

puts "#{ Time.now - t } seconds"

See http://swtch.com/~rsc/regexp/regexp1.html

dominikho · Jul 31, 2009

w_a_x_man said:
This Awk code takes less than one hundredth of a second to run:

BEGIN {

regex = \
"o?o?o?o?o?o?o?o?o?o?o?o?o?o?" \
"o?o?o?o?o?o?o?o?o?o?o?o?o?" \
"ooooooooooooooooooooooooooooo"

print "ooooooooooooooooooooooooooooo" ~ regex

}

This Ruby code takes 27.5 seconds:

t = Time.now

regex = Regexp.new(
"o?o?o?o?o?o?o?o?o?o?o?o?o?o?" +
"o?o?o?o?o?o?o?o?o?o?o?o?o?" +
"ooooooooooooooooooooooooooooo" )

p "ooooooooooooooooooooooooooooo" =~ regex

puts "#{ Time.now - t } seconds"

See http://swtch.com/~rsc/regexp/regexp1.html

In Ruby 1.9 its 1/4 of the time (still slow).

But: One might ask if that is a problem at all, considering that this
Regexp looks and is awful and could be written in a way better way which
takes times of 0.000139965 seconds.

t = Time.now
regex = Regexp.new(/o{0,27}ooooooooooooooooooooooooooooo/ )
p "ooooooooooooooooooooooooooooo" =~ regex
puts "#{ Time.now - t } seconds"

w_a_x_man · Jul 31, 2009

In Ruby 1.9 its 1/4 of the time (still slow).

But: One might ask if that is a problem at all, considering that this
Regexp looks and is awful and could be written in a way better way which
takes times of 0.000139965 seconds.

t = Time.now
regex = Regexp.new(/o{0,27}ooooooooooooooooooooooooooooo/ )
p "ooooooooooooooooooooooooooooo" =~ regex
puts "#{ Time.now - t } seconds"

Quoting the article:

... it is possible to write so-called "pathological" regular
expressions that Perl matches very very slowly. In contrast,
there are no regular expressions that are pathological for
the Thompson NFA implementation. Seeing the two graphs side
by side prompts the question, "why doesn't Perl use the
Thompson NFA approach?" It can, it should ...

Ben Bleything · Jul 31, 2009

Quoting the article:

=A0... it is possible to write so-called "pathological" regular
=A0expressions that Perl matches very very slowly. In contrast,
=A0there are no regular expressions that are pathological for
=A0the Thompson NFA implementation. Seeing the two graphs side
=A0by side prompts the question, "why doesn't Perl use the
=A0Thompson NFA approach?" It can, it should ...

So what? I'm sure Ruby core would be happy to consider a patch.

Ben

Robert Dober · Aug 1, 2009

You should prepare a mail template as inspired by this historic example

--------------------- 8< -----------------------
Dear reader

we have received your proof of Fermat's Last Theorem. We have to
regret to inform you that there is an error in line ###

--------------------- <8 ----------------------

Cheers
Robert

--=20
module Kernel
alias_method :=CE=BB, :lambda
end

Robert Klemme · Aug 1, 2009

Quoting the article:

... it is possible to write so-called "pathological" regular
expressions that Perl matches very very slowly. In contrast,
there are no regular expressions that are pathological for
the Thompson NFA implementation. Seeing the two graphs side
by side prompts the question, "why doesn't Perl use the
Thompson NFA approach?" It can, it should ...

I am not sure as to what exactly your point is. The awk you have been
using might have used a DFA implementation (sed does so AFAIK).
Nowadays most regular expression engines are NFA because of various
reasons (see "Mastering Regular Expressions").

Usually NFA's can be tricked into bad runtime performance with
expressions like the one you wrote. I do not consider this a
disadvantage because on the other hand you can optimize your expression
when working with a NFA, for example by deliberately selecting order of
sub expressions in the expression. And, the expression is really
pathologic, i.e. you would not use something like that in practice.

Btw, why don't you declare the regular expression directly, i.e.

regex = %r(
o?o?o?o?o?o?o?o?o?o?o?o?o?o?
o?o?o?o?o?o?o?o?o?o?o?o?o?
ooooooooooooooooooooooooooooo
)x

?

Kind regards

robert

regular expressions ... slow	9	Nov 17, 2008
FAQ 6.10 What is "/o" really for?	0	Apr 6, 2011
[SUMMARY] Text Image (#50)	29	Oct 13, 2005
[SUMMARY] Getting to 100 (#119)	0	Apr 12, 2007
regular expression for case insensitive USA state codes	1	May 16, 2006
OSX: require seems very slow	3	Oct 29, 2008
Rubys Regular Expression Engine	7	Jul 21, 2005
Recursion regular expression (xtended)	1	Aug 16, 2010

Slow regular-expression engine

w_a_x_man

dominikho

w_a_x_man

Ben Bleything

Robert Dober

Robert Klemme

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads