Slow regular expressions

S3 · Jul 24, 2006

--Boundary_(ID_wcHSDvAB6xTg9VN04MWj4Q)
Content-type: text/plain; charset=us-ascii
Content-transfer-encoding: 7BIT

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I am trying to write a simple script to parse
an Apache log file, but it is taking an extremely
long time. I used profile and the problem appears
to be with the regular expression matcher.

I have made a simple script to run it with different
lengths and it appears that the regular expression
matcher is being very slow. See the attached script.

Here are some timings:
~>time ./mklong.rb 1000
real 0m11.930s
~>time ./mklong.rb 2000
real 0m50.400s
~>time ./mklong.rb 3000
real 1m55.693s
~>time ./mklong.rb 4000
real 3m16.004s

So, dividing it out,
1000/11 = 91
2000/50 = 40
3000/120 = 25
4000/200 = 20

So, the matching appears to be much slower than O(n).
Isn't the whole point of regular expressions to be
fast and O(n)?
Whenever my script encounters a long string,
it grinds to a halt.

Why is this?
Did I make the regular expression correctly?
Is there some way to optimize it?
Is there a problem with the matcher?

Any help would be appreciated.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2-ecc0.1.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFExRiSxzVgPqtIcfsRAmJGAJ9FmvUTT7Q0692yIVvexWoSvg8FDQCdGBgZ
5R2ieCXoflMUgiwYVCQuMaI=
=d8Zp
-----END PGP SIGNATURE-----

--Boundary_(ID_wcHSDvAB6xTg9VN04MWj4Q)
Content-type: text/plain; name=mklong.rb
Content-transfer-encoding: 7BIT
Content-disposition: inline; filename=mklong.rb

#!/usr/bin/ruby

str='67.39.177.137 - - [05/Jun/2004:12:54:44 -0500] "SEARCH '
str<<'\xb1\x02'*(ARGV[0].to_i)
str<<'" 414 326 "-" "-"'

logformat=/(\S+)\s+(\S+)\s+(.+)\s+\[([^\]]+)\]\s+"(\S+) +([^"]+) +[A-Za-z\/]*([0-9.]+)"\s+(\S+)\s+(\S+)\s+"([^"]+)"\s+"([^"]+)"/

str.match(logformat)

--Boundary_(ID_wcHSDvAB6xTg9VN04MWj4Q)--

Justin Collins · Jul 24, 2006

S3 said:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I am trying to write a simple script to parse
an Apache log file, but it is taking an extremely
long time. I used profile and the problem appears
to be with the regular expression matcher.

I have made a simple script to run it with different
lengths and it appears that the regular expression
matcher is being very slow. See the attached script.

Here are some timings:
~>time ./mklong.rb 1000
real 0m11.930s
~>time ./mklong.rb 2000
real 0m50.400s
~>time ./mklong.rb 3000
real 1m55.693s
~>time ./mklong.rb 4000
real 3m16.004s

So, dividing it out,
1000/11 = 91
2000/50 = 40
3000/120 = 25
4000/200 = 20

So, the matching appears to be much slower than O(n).
Isn't the whole point of regular expressions to be
fast and O(n)?
Whenever my script encounters a long string,
it grinds to a halt.

Why is this?
Did I make the regular expression correctly?
Is there some way to optimize it?
Is there a problem with the matcher?

Any help would be appreciated.
<snip code>

I'm not a regex expert, so I can't help you with optimizing yours, but
if your pattern causes the regex engine to backtrack repeatedly, then it
will not match in linear time. This usually becomes apparent on long
strings where there can be lots of backtracking.

Sorry I can't offer more help than that.

-Justin

William James · Jul 24, 2006

S3 said:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I am trying to write a simple script to parse
an Apache log file, but it is taking an extremely
long time. I used profile and the problem appears
to be with the regular expression matcher.

I have made a simple script to run it with different
lengths and it appears that the regular expression
matcher is being very slow. See the attached script.

Here are some timings:
~>time ./mklong.rb 1000
real 0m11.930s
~>time ./mklong.rb 2000
real 0m50.400s
~>time ./mklong.rb 3000
real 1m55.693s
~>time ./mklong.rb 4000
real 3m16.004s

So, dividing it out,
1000/11 = 91
2000/50 = 40
3000/120 = 25
4000/200 = 20

So, the matching appears to be much slower than O(n).
Isn't the whole point of regular expressions to be
fast and O(n)?
Whenever my script encounters a long string,
it grinds to a halt.

Why is this?
Did I make the regular expression correctly?
Is there some way to optimize it?
Is there a problem with the matcher?

Any help would be appreciated.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2-ecc0.1.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFExRiSxzVgPqtIcfsRAmJGAJ9FmvUTT7Q0692yIVvexWoSvg8FDQCdGBgZ
5R2ieCXoflMUgiwYVCQuMaI=
=d8Zp
-----END PGP SIGNATURE-----

--Boundary_(ID_wcHSDvAB6xTg9VN04MWj4Q)
Content-Type: text/plain
Content-Transfer-Encoding: 7BIT
Content-Disposition: inline;
filename="mklong.rb"
X-Google-AttachSize: 288

#!/usr/bin/ruby

str='67.39.177.137 - - [05/Jun/2004:12:54:44 -0500] "SEARCH '
str<<'\xb1\x02'*(ARGV[0].to_i)
str<<'" 414 326 "-" "-"'

logformat=/(\S+)\s+(\S+)\s+(.+)\s+\[([^\]]+)\]\s+"(\S+) +([^"]+) +[A-Za-z\/]*([0-9.]+)"\s+(\S+)\s+(\S+)\s+"([^"]+)"\s+"([^"]+)"/

str.match(logformat)

--Boundary_(ID_wcHSDvAB6xTg9VN04MWj4Q)--

Your regular expression does not match your string.

Try this.

logformat =
%r{
\A
(\S+) \s+ (\S+) \s+ (\S+) \s+
\[ ( [^\]]+ ) \] \s+
" (\S+) [ ]+ ( [^"]+ ) "
\s+ ( \d+ ) \s+ ( \d+ ) \s+
" ([^"]+) " \s+ " ([^"]+) "
\Z
}x

Keith Gaughan · Jul 24, 2006

So, the matching appears to be much slower than O(n).
Isn't the whole point of regular expressions to be
fast and O(n)?

No. The point behind regular expressions is to provide a succinct
mini-language for matching text that can be expressed using a Type-3
grammar[1] without the pain of having to write the code to implement the
necessary FSM. Speed is a secondary consideration, and down to the
person implementing the regex engine and the person writing the regex.
Especially the person writing the regex.

Whenever my script encounters a long string,
it grinds to a halt.

Why is this?

You're making the matcher work harder than it needs to by not giving it
enough information about what it has to match. The bigger the file is,
the more it has to backtrack to match each line. See my note on anchors
below.

The regex engine will do as much as it can match the

Did I make the regular expression correctly?
Is there some way to optimize it?
Is there a problem with the matcher?

No, the matcher is fine.

<snip>

There's a few problems I can see straight off:

1. No anchors: without them, your regex ends up doing an awful lot more
work because it will end up attempting to match the pattern in the
middle of lines rather than only matching from the start of lines.
By putting '^' at the start of your regex to match the beginning of
the line, you'll see a big increase in performance. That's probably
your single biggest problem.

It's also possibly worth matching the end of the line with '$'.

2. To quote Jamie Zawinski:

"Some people, when confronted with a problem, think I know, I'll
use regular expressions. Now they have two problems."

If you don't need a regex, don't use them. In this case, try
reading the file in line by line (or slurp it all into memory if
it's not too big), and break the line up along spaces using the
split() method.

3. Get Jeffrey Freidl's Mastering Regular Expression[2]. You'll thank me
later.

K.

[1] http://en.wikipedia.com/wiki/Chomsky_hierarchy
[2] http://regex.info/

Keith Gaughan · Jul 24, 2006

logformat=/(\S+)\s+(\S+)\s+(.+)\s+\[([^\]]+)\]\s+"(\S+) +([^"]+) +[A-Za-z\/]*([0-9.]+)"\s+(\S+)\s+(\S+)\s+"([^"]+)"\s+"([^"]+)"/

A few other things:

1. What's with the all the '\s+'? All you need to do is match a space,
so the regex becomes...

rx = /^(\S+) (\S+) (\S+) \[([^\]])\] "(GET|POST|HEAD) (\S+) HTTP\/\d\.\d" \d{3} \d+/

As you can see, I've simplified a few other things, such as including
anything that's constant, minimising all the places where you're
matching whitespace to simple spaces. And I added the anchor at the
beginning to match the start of the line, which is *very* important.

K.

Eric Hodel · Jul 28, 2006

Is there some way to optimize it?

#!/usr/bin/ruby

str='67.39.177.137 - - [05/Jun/2004:12:54:44 -0500] "SEARCH '
str<<'\xb1\x02'*(ARGV[0].to_i)
str<<'" 414 326 "-" "-"'

logformat=/(\S+)\s+(\S+)\s+(.+)\s+\[([^\]]+)\]\s+"(\S+) +([^"]+) +
[A-Za-z\/]*([0-9.]+)"\s+(\S+)\s+(\S+)\s+"([^"]+)"\s+"([^"]+)"/

str.match(logformat)

In addition to the other advice you've received, don't use String#match.

=~ with a literal regexp is faster than =~ with anything else, and
match is slower than both.

$ ruby rebm.rb

Rehearsal -----------------------------------------------
empty 0.320000 0.000000 0.320000 ( 0.345024)
=~ literal 1.490000 0.010000 1.500000 ( 1.620720)
=~ variable 1.750000 0.010000 1.760000 ( 1.985988)
match 4.460000 0.040000 4.500000 ( 4.980295)
-------------------------------------- total: 8.080000sec

user system total real
empty 0.260000 0.000000 0.260000 ( 0.280089)
=~ literal 1.490000 0.010000 1.500000 ( 1.786133)
=~ variable 1.760000 0.020000 1.780000 ( 3.459860)
match 4.550000 0.050000 4.600000 ( 7.792777)
$ cat rebm.rb
require 'benchmark'

N = 1_000_000

string = 'a b c d e'

Benchmark.bmbm do |bm|

bm.report 'empty' do
N.times do end
end

bm.report '=~ literal' do
N.times do string =~ /a (.) c\s. e/ end
end

bm.report '=~ variable' do
re = /a (.) c\s. e/
N.times do string =~ re end
end

bm.report 'match' do
re = /a (.) c\s. e/
N.times do string.match re end
end

end

Jeremy Henty · Jul 29, 2006

=~ with a literal regexp is faster than =~ with anything else, and
match is slower than both.

[snip]
=~ variable 1.750000 0.010000 1.760000 ( 1.985988)
match 4.460000 0.040000 4.500000 ( 4.980295)
[snip]
=~ variable 1.760000 0.020000 1.780000 ( 3.459860)
match 4.550000 0.050000 4.600000 ( 7.792777)

Yow! Calling String#match is 2-3 times as slow! Why is that? The
overhead of calling the method?

Regards,

Jeremy Henty

Caleb Clausen · Jul 29, 2006

Yow! Calling String#match is 2-3 times as slow! Why is that? The
overhead of calling the method?

I was curious about this too.
It looks like it's the effort of computing a MatchData that takes all
the extra time. This snippet is just as slow as using #match:

bm.report '=~ literal + $~' do
N.times do string =~ /a (.) c\s. e/ ; $~ end
end

Eric Hodel · Jul 30, 2006

I was curious about this too.
It looks like it's the effort of computing a MatchData that takes all
the extra time. This snippet is just as slow as using #match:

bm.report '=~ literal + $~' do
N.times do string =~ /a (.) c\s. e/ ; $~ end
end

Yes, it is the work of constructing the MatchData object.

Slow regular expressions :(	28	Jul 27, 2006
The power of regular expressions without regular expressions.	0	Jul 17, 2013
Special chars in regular expressions - Problems	1	Oct 6, 2007
FAQ 6.17 How do I efficiently match many regular expressions at once?	0	Apr 28, 2011
FAQ 6.12 Can I use Perl regular expressions to match balanced text?	0	Jan 9, 2011
Regular Expressions	7	Jan 31, 2005
Help me understand use of regular expressions to validate data	5	May 29, 2006
Moving files with ruby...	4	Aug 24, 2006

Slow regular expressions

S3

Justin Collins

William James

Keith Gaughan

Keith Gaughan

Eric Hodel

Jeremy Henty

Caleb Clausen

Eric Hodel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads