Why is regex so slow?

Roy Smith · Jun 18, 2013

I've got a 170 MB file I want to search for lines that look like:

[2010-10-20 16:47:50.339229 -04:00] INFO (6): songza.amie.history - ENQUEUEING: /listen/the-station-one

This code runs in 1.3 seconds:

------------------------------
import re

pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0

for line in open('error.log'):
m = pattern.search(line)
if m:
count += 1

print count
------------------------------

If I add a pre-filter before the regex, it runs in 0.78 seconds (about
twice the speed!)

------------------------------
import re

pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0

for line in open('error.log'):
if 'ENQ' not in line:
continue
m = pattern.search(line)
if m:
count += 1

print count
------------------------------

Every line which contains 'ENQ' also matches the full regex (61425
lines match, out of 2.1 million total). I don't understand why the
first way is so much slower.

Once the regex is compiled, you should have a state machine pattern
matcher. It should be O(n) in the length of the input to figure out
that it doesn't match as far as "ENQ". And that's exactly how long it
should take for "if 'ENQ' not in line" to run as well. Why is doing
twice the work also twice the speed?

I'm running Python 2.7.3 on Ubuntu Precise, x86_64.

Skip Montanaro · Jun 18, 2013

I don't understand why the first way is so much slower.

I have no obvious answers, but a couple suggestions:

1. Can you anchor the pattern at the beginning of the line? (use
match() instead of search())
2. Does it get faster it you eliminate the "(.*)" part of the pattern?
It seems that if you find a line matching the first part of the
pattern, you could just as easily split the line yourself instead of
creating a group.

Skip

Roy Smith · Jun 18, 2013

I have no obvious answers, but a couple suggestions:

1. Can you anchor the pattern at the beginning of the line? (use
match() instead of search())

That's one of the things we tried. Didn't make any difference.

2. Does it get faster it you eliminate the "(.*)" part of the pattern?

Just tried that, it also didn't make any difference.

It seems that if you find a line matching the first part of the
pattern, you could just as easily split the line yourself instead of
creating a group.

At this point, I'm not so much interested in making this faster as understanding why it's so slow. I'm tempted to open this up as a performance bug against the regex module (which I assume will be rejected, at least for the 2.x series).

Chris Angelico · Jun 18, 2013

I'm tempted to open this up as a performance bug against the regex module (which I assume will be rejected, at least for the 2.x series).

Yeah, I'd try that against 3.3 before opening a performance bug. Also,
it's entirely possible that performance is majorly different in 3.x
anyway, on account of strings being Unicode. Definitely merits another
look imho.

ChrisA

MRAB · Jun 18, 2013

I've got a 170 MB file I want to search for lines that look like:

[2010-10-20 16:47:50.339229 -04:00] INFO (6): songza.amie.history - ENQUEUEING: /listen/the-station-one

This code runs in 1.3 seconds:

------------------------------
import re

pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0

for line in open('error.log'):
m = pattern.search(line)
if m:
count += 1

print count
------------------------------

If I add a pre-filter before the regex, it runs in 0.78 seconds (about
twice the speed!)

------------------------------
import re

pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0

for line in open('error.log'):
if 'ENQ' not in line:
continue
m = pattern.search(line)
if m:
count += 1

print count
------------------------------

Every line which contains 'ENQ' also matches the full regex (61425
lines match, out of 2.1 million total). I don't understand why the
first way is so much slower.

Once the regex is compiled, you should have a state machine pattern
matcher. It should be O(n) in the length of the input to figure out
that it doesn't match as far as "ENQ". And that's exactly how long it
should take for "if 'ENQ' not in line" to run as well. Why is doing
twice the work also twice the speed?

I'm running Python 2.7.3 on Ubuntu Precise, x86_64.

I'd be interested in how the 'regex' module
(http://pypi.python.org/pypi/regex) compares.

Mark Lawrence · Jun 18, 2013

That's one of the things we tried. Didn't make any difference.

Just tried that, it also didn't make any difference.

At this point, I'm not so much interested in making this faster as understanding why it's so slow. I'm tempted to open this up as a performance bug against the regex module (which I assume will be rejected, at least for the 2.x series).

Out of curiousity have the tried the new regex module from pypi rather
than the stdlib version? A heck of a lot of work has gone into it see
http://bugs.python.org/issue2636

--
"Steve is going for the pink ball - and for those of you who are
watching in black and white, the pink is next to the green." Snooker
commentator 'Whispering' Ted Lowe.

Mark Lawrence

Johannes Bauer · Jun 18, 2013

Yeah, I'd try that against 3.3 before opening a performance bug. Also,
it's entirely possible that performance is majorly different in 3.x
anyway, on account of strings being Unicode. Definitely merits another
look imho.

Hmmm, at least Python 3.2 seems to have the same issue. I generated test
data with:

#!/usr/bin/python3
import random
random.seed(0)
f = open("error.log", "w")
for i in range(1500000):
q = random.randint(0, 99)
if q == 0:
print("ENQUEUEING: /listen/ fhsduifhsd uifhuisd hfuisd hfuihds
iufhsd", file = f)
else:
print("fiosdjfoi sdmfio sdmfio msdiof msdoif msdoimf oisd mfoisdm f",
file = f)

Resulting file has a size of 91530018 and md5 of
2d20c3447a0b51a37d28126b8348f6c5 (just to make sure we're on the same
page because I'm not sure the PRNG is stable across Python versions).

Testing with:

#!/usr/bin/python3
import re
pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0
for line in open('error.log'):
# if 'ENQ' not in line:
# continue
m = pattern.search(line)
if m:
count += 1
print(count)

The pre-check version is about 42% faster in my case (0.75 sec vs. 1.3
sec). Curious. This is Python 3.2.3 on Linux x86_64.

Regards,
Johannes

--

Zumindest nicht öffentlich!

Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos über Rüdiger Thomas in dsa <[email protected]>

Rick Johnson · Jun 18, 2013

I've got a 170 MB file I want to search for lines that look like:
[2010-10-20 16:47:50.339229 -04:00] INFO (6): songza.amie.history - ENQUEUEING: /listen/the-station-one
This code runs in 1.3 seconds:
------------------------------
import re
pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0
for line in open('error.log'):
m = pattern.search(line)
if m:
count += 1
print count

Is the power of regexps required to solve such a simplistic problem? I believe string methods should suffice.

py> line = "[2010-10-20 16:47:50.339229 -04:00] INFO (6): songza.amie.history - ENQUEUEING: /listen/the-station-one"
py> idx = line.find('ENQ')
py> if idx > 0:
match = line[idx:]
py> match
'ENQUEUEING: /listen/the-station-one'

Roy Smith · Jun 18, 2013

Mark Lawrence said:
Out of curiousity have the tried the new regex module from pypi rather
than the stdlib version? A heck of a lot of work has gone into it see
http://bugs.python.org/issue2636

I just installed that and gave it a shot. It's *slower* (and, much
higher variation from run to run). I'm too exhausted fighting with
OpenOffice to get this into some sane spreadsheet format, so here's
the raw timings:

Built-in re module:
0:01.32
0:01.33
0:01.32
0:01.33
0:01.35
0:01.32
0:01.35
0:01.36
0:01.33
0:01.32

regex with flags=V0:
0:01.66
0:01.53
0:01.51
0:01.47
0:01.81
0:01.58
0:01.78
0:01.57
0:01.64
0:01.60

regex with flags=V1:
0:01.53
0:01.57
0:01.65
0:01.61
0:01.83
0:01.82
0:01.59
0:01.60
0:01.55
0:01.82

Roy Smith · Jun 18, 2013

Resulting file has a size of 91530018 and md5 of

If people want to test against my original data, grab

https://s3.amazonaws.com/songza/recruiting/logs.tar.gz

and do "cat app[12].error.log > error.log".

MRAB · Jun 18, 2013

I just installed that and gave it a shot. It's *slower* (and, much
higher variation from run to run). I'm too exhausted fighting with
OpenOffice to get this into some sane spreadsheet format, so here's
the raw timings:

Built-in re module:
0:01.32
0:01.33
0:01.32
0:01.33
0:01.35
0:01.32
0:01.35
0:01.36
0:01.33
0:01.32

regex with flags=V0:
0:01.66
0:01.53
0:01.51
0:01.47
0:01.81
0:01.58
0:01.78
0:01.57
0:01.64
0:01.60

regex with flags=V1:
0:01.53
0:01.57
0:01.65
0:01.61
0:01.83
0:01.82
0:01.59
0:01.60
0:01.55
0:01.82

I reckon that about 1/3 of that time is spent in
PyArg_ParseTupleAndKeywords, just getting the arguments!

There's a higher initial overhead in using regex than string methods,
so working just a line at time will take longer.

AndrÃ© Malo · Jun 18, 2013

* Johannes Bauer said:
The pre-check version is about 42% faster in my case (0.75 sec vs. 1.3
sec). Curious. This is Python 3.2.3 on Linux x86_64.

A lot of time is spent with dict lookups (timings at my box, Python 3.2.3)
in your inner loop (1500000 times...)

#!/usr/bin/python3
import re
pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0
for line in open('error.log'):
#if 'ENQ' not in line:
# continue
m = pattern.search(line)
if m:
count += 1
print(count)

runs ~ 1.39 s

replacing some dict lookups with index lookups:

#!/usr/bin/python3
def main():
import re
pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0
for line in open('error.log'):
#if 'ENQ' not in line:
# continue
m = pattern.search(line)
if m:
count += 1
print(count)
main()

runs ~ 1.15s

and further:

#!/usr/bin/python3
def main():
import re
pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
search = pattern.search
count = 0
for line in open('error.log'):
#if 'ENQ' not in line:
# continue
m = search(line)
if m:
count += 1
print(count)
main()

runs ~ 1.08 s

and for reference:

#!/usr/bin/python3
def main():
import re
pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
search = pattern.search
count = 0
for line in open('error.log'):
if 'ENQ' not in line:
continue
m = search(line)
if m:
count += 1
print(count)
main()

runs ~ 0.71 s

The final difference is probably just the difference between a hardcoded
string search and a generic NFA.

nd

Antoine Pitrou · Jun 18, 2013

Roy Smith said:
Every line which contains 'ENQ' also matches the full regex (61425
lines match, out of 2.1 million total). I don't understand why the
first way is so much slower.

One invokes a fast special-purpose substring searching routine (the
str.__contains__ operator), the other a generic matching engine able to
process complex patterns. It's hardly a surprise for the specialized routine
to be faster. That's like saying "I don't understand why my CPU is slower
than my GPU at calculating 3D structures".

That said, there may be ways to improve the regex engine to deal with such
special cases in a speedier way. But there will still be some overhead related
to the fact that you are invoking a powerful generic engine rather than a
lean and mean specialized routine.

(to be fair, on CPython there's also the fact that operators are faster
than method calls, so some overhead is added by that too)

Once the regex is compiled, you should have a state machine pattern
matcher. It should be O(n) in the length of the input to figure out
that it doesn't match as far as "ENQ". And that's exactly how long it
should take for "if 'ENQ' not in line" to run as well.

You should read again on the O(...) notation. It's an asymptotic complexity,
it tells you nothing about the exact function values at different data points.
So you can have two O(n) routines, one of which always twice faster than the
other.

Regards

Antoine.

AndrÃ© Malo · Jun 18, 2013

* AndrÃ© Malo said:
A lot of time is spent with dict lookups (timings at my box, Python 3..2.3)
in your inner loop (1500000 times...)

[...]

Missed one basic timing BTW:

#!/usr/bin/python3
def main():
for line in open('error.log'):
pass
main()

runs ~ 0.53 s

nd

Roy Smith · Jun 18, 2013

One invokes a fast special-purpose substring searching routine (the
str.__contains__ operator), the other a generic matching engine able to
process complex patterns. It's hardly a surprise for the specialized routine
to be faster.

Except that the complexity in regexes is compiling the pattern down to a FSM. Once you've got the FSM built, the inner loop should be pretty quick. In C, the inner loop for executing a FSM should be something like:

for(char* p = input; p; ++p) {
next_state = current_state[*p];
if (next_state == MATCH) {
break;
}
}

which should compile down to a couple of machine instructions which run entirely in the instruction pipeline cache. But I'm probably simplifying it more than I should

(to be fair, on CPython there's also the fact that operators are faster
than method calls, so some overhead is added by that too)

I've been doing some experimenting, and I'm inclined to believe this is indeed a significant part of it. I also took some ideas from André Malo andfactored out some name lookups from the inner loop. That bummed me another 10% in speed.

Grant Edwards · Jun 18, 2013

Roy Smith <roy <at> panix.com> writes:

You should read again on the O(...) notation. It's an asymptotic complexity,
it tells you nothing about the exact function values at different data points.
So you can have two O(n) routines, one of which always twice faster than the
other.

And you can have two O(n) routines, one of which is twice as fast for
one value of n and the other is twice as fast for a different value of
n (and that's true for any value of 'twice': 2X 10X 100X).

All the O() tells you is the general shape of the line. It doesn't
tell you where the line is or how steep the slope is (except in the
case of O(1), where you do know the slope is 0. It's perfectly
feasible that for the range of values of n that you care about in a
particular application, there's an O(n^2) algorithm that's way faster
than another O(log(n)) algorithm. [Though that becomes a lot less
likely as n gets large.]

Terry Reedy · Jun 18, 2013

Or one that is a million times as fast.

And you can have two O(n) routines, one of which is twice as fast for
one value of n and the other is twice as fast for a different value of
n (and that's true for any value of 'twice': 2X 10X 100X).

All the O() tells you is the general shape of the line. It doesn't
tell you where the line is or how steep the slope is (except in the
case of O(1), where you do know the slope is 0. It's perfectly
feasible that for the range of values of n that you care about in a
particular application, there's an O(n^2) algorithm that's way faster
than another O(log(n)) algorithm.

In fact, Tim Peters put together two facts to create the current list.sort.
1. O(n*n) binary insert sort is faster than O(n*logn) merge sort, with
both competently coded in C, for n up to about 64. Part of the reason is
that binary insert sort is actually O(n*logn) (n binary searches) +
O(n*n) (n insertions with a shift averaging n/2 items). The multiplier
for the O(n*n) part is much smaller because on modern CPUs, the shift
needed for the insertion is a single machine instruction.
2. O(n*logn) sorts have a lower assymtotic complexity because they
divide the sequence roughly in half about logn times. In other words,
they are 'fast' because they split a list into lots of little pieces. So
Tim's aha moment was to think 'Lets stop splitting when pieces are less
than or equal to 64, rather than splitting all the way down to 1 or 2".

Steven D'Aprano · Jun 19, 2013

I've got a 170 MB file I want to search for lines that look like:

[2010-10-20 16:47:50.339229 -04:00] INFO (6): songza.amie.history -
ENQUEUEING: /listen/the-station-one

This code runs in 1.3 seconds:

------------------------------
import re

pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0
for line in open('error.log'):
m = pattern.search(line)
if m:
count += 1

print count

You say that as if it were surprising. It's not. Regex searches have
higher overhead than bog standard dumb `substring in string` tests, so
reducing the number of high-overhead regex searches calls with a low-
overhead `in` test is often a nett win.

Not always, it depends on how many hits/misses you have, and the relative
overhead of each, but as a general rule, I would always try pre-filtering
as the obvious way to optimize a regex.

Even if the regex engine is just as efficient at doing simple character
matching as `in`, and it probably isn't, your regex tries to match all
eleven characters of "ENQUEUEING" while the `in` test only has to match
three, "ENQ".

[...]

Once the regex is compiled, you should have a state machine pattern
matcher. It should be O(n) in the length of the input to figure out
that it doesn't match as far as "ENQ".

Yes, but O(N) doesn't tell you how long it takes to run, only how it
*scales*.

class MyDumbString(str):
def __contains__(self, substr):
time.sleep(1000)
return super(MyDumbString, self).__contains__(substring)

MyDumbString `in` is also O(N), but it is a wee less efficient than the
built-in version...

Regexes do a lot of work, because they are far more complex than dumb
string __contains__. It should not surprise you that the overhead is
greater, even when matching a plain-ol' substring with no metacharacters.
Especially since Python does not traditionally use regexes for
everything, like Perl does, and so has not spent the effort to complicate
the implementation in order to squeeze every last microsecond out of it.

And that's exactly how long it
should take for "if 'ENQ' not in line" to run as well. Why is doing
twice the work also twice the speed?

It's not doing twice the work. It's doing less work, overall, by spending
a moment to trivially filter out the lines which cannot possibly match,
before spending a few moments to do a more careful match. If the number
of non-matching lines is high, as it is in your data, then the cheaper
pre-filter pays for itself and then some.

Dave Angel · Jun 19, 2013

On 06/18/2013 09:51 PM, Steven D'Aprano wrote:

Even if the regex engine is just as efficient at doing simple character
matching as `in`, and it probably isn't, your regex tries to match all
eleven characters of "ENQUEUEING" while the `in` test only has to match
three, "ENQ".

The rest of your post was valid, and useful, but there's a misconception
in this paragraph; I hope you don't mind me pointing it out.

In general, for simple substring searches, you can search for a large
string faster than you can search for a smaller one. I'd expect

if "ENQUEUING" in bigbuffer

to be faster than

if "ENQ" in bigbuffer

assuming that all occurrences of ENQ will actually match the whole
thing. If CPython's implementation doesn't show the speed difference,
maybe there's some room for optimization.

See Boyer-Moore if you want a peek at the algorithm.

When I was writiing a simple search program, I could typically search
for a 4-character string faster than REP SCASB could match a one
character string. And that's a single instruction (with prefix).

Steven D'Aprano · Jun 19, 2013

On 06/18/2013 09:51 PM, Steven D'Aprano wrote:

The rest of your post was valid, and useful, but there's a misconception
in this paragraph; I hope you don't mind me pointing it out.

Of course not, I'm always happy to learn if I'm mistaken.

In general, for simple substring searches, you can search for a large
string faster than you can search for a smaller one. I'd expect

if "ENQUEUING" in bigbuffer

to be faster than

if "ENQ" in bigbuffer

And so it is:

steve@runes:~$ python2.7 -m timeit -s "sub = 'ENQ'" \

-s "s = 'blah '*10000 + 'ENQUIRING blah blah blah'" \
"sub in s"

10000 loops, best of 3: 38.3 usec per loop
steve@runes:~$ python2.7 -m timeit -s "sub = 'ENQUIRING'" \

-s "s = 'blah '*10000 + 'ENQUIRING blah blah blah'" \
"sub in s"

100000 loops, best of 3: 15.4 usec per loop

Thank you.

regex line by line over file	8	Mar 27, 2014
Why is os.stat so slow?	0	Jun 15, 2009
Regex not matching a string	2	Jan 9, 2013
My regex kung-fu is not strong =(	0	Apr 4, 2020
Doing both regex match and assignment within a If loop?	7	Mar 29, 2013
collections.Counter surprisingly slow	11	Jul 28, 2013
Why is Java so slow????	58	Nov 19, 2007
Questions about regex	3	May 29, 2009

Why is regex so slow?

Roy Smith

Skip Montanaro

Roy Smith

Chris Angelico

MRAB

Mark Lawrence

Johannes Bauer

Rick Johnson

Roy Smith

Roy Smith

MRAB

AndrÃ© Malo

Antoine Pitrou

AndrÃ© Malo

Roy Smith

Grant Edwards

Terry Reedy

Steven D'Aprano

Dave Angel

Steven D'Aprano

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads