Why is regex so slow?

R

Roy Smith

I've got a 170 MB file I want to search for lines that look like:

[2010-10-20 16:47:50.339229 -04:00] INFO (6): songza.amie.history - ENQUEUEING: /listen/the-station-one

This code runs in 1.3 seconds:

------------------------------
import re

pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0

for line in open('error.log'):
m = pattern.search(line)
if m:
count += 1

print count
------------------------------

If I add a pre-filter before the regex, it runs in 0.78 seconds (about
twice the speed!)

------------------------------
import re

pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0

for line in open('error.log'):
if 'ENQ' not in line:
continue
m = pattern.search(line)
if m:
count += 1

print count
------------------------------

Every line which contains 'ENQ' also matches the full regex (61425
lines match, out of 2.1 million total). I don't understand why the
first way is so much slower.

Once the regex is compiled, you should have a state machine pattern
matcher. It should be O(n) in the length of the input to figure out
that it doesn't match as far as "ENQ". And that's exactly how long it
should take for "if 'ENQ' not in line" to run as well. Why is doing
twice the work also twice the speed?

I'm running Python 2.7.3 on Ubuntu Precise, x86_64.
 
S

Skip Montanaro

I don't understand why the first way is so much slower.

I have no obvious answers, but a couple suggestions:

1. Can you anchor the pattern at the beginning of the line? (use
match() instead of search())
2. Does it get faster it you eliminate the "(.*)" part of the pattern?
It seems that if you find a line matching the first part of the
pattern, you could just as easily split the line yourself instead of
creating a group.

Skip
 
R

Roy Smith

I have no obvious answers, but a couple suggestions:

1. Can you anchor the pattern at the beginning of the line? (use
match() instead of search())

That's one of the things we tried. Didn't make any difference.
2. Does it get faster it you eliminate the "(.*)" part of the pattern?

Just tried that, it also didn't make any difference.
It seems that if you find a line matching the first part of the
pattern, you could just as easily split the line yourself instead of
creating a group.


At this point, I'm not so much interested in making this faster as understanding why it's so slow. I'm tempted to open this up as a performance bug against the regex module (which I assume will be rejected, at least for the 2.x series).
 
C

Chris Angelico

I'm tempted to open this up as a performance bug against the regex module (which I assume will be rejected, at least for the 2.x series).

Yeah, I'd try that against 3.3 before opening a performance bug. Also,
it's entirely possible that performance is majorly different in 3.x
anyway, on account of strings being Unicode. Definitely merits another
look imho.

ChrisA
 
M

MRAB

I've got a 170 MB file I want to search for lines that look like:

[2010-10-20 16:47:50.339229 -04:00] INFO (6): songza.amie.history - ENQUEUEING: /listen/the-station-one

This code runs in 1.3 seconds:

------------------------------
import re

pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0

for line in open('error.log'):
m = pattern.search(line)
if m:
count += 1

print count
------------------------------

If I add a pre-filter before the regex, it runs in 0.78 seconds (about
twice the speed!)

------------------------------
import re

pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0

for line in open('error.log'):
if 'ENQ' not in line:
continue
m = pattern.search(line)
if m:
count += 1

print count
------------------------------

Every line which contains 'ENQ' also matches the full regex (61425
lines match, out of 2.1 million total). I don't understand why the
first way is so much slower.

Once the regex is compiled, you should have a state machine pattern
matcher. It should be O(n) in the length of the input to figure out
that it doesn't match as far as "ENQ". And that's exactly how long it
should take for "if 'ENQ' not in line" to run as well. Why is doing
twice the work also twice the speed?

I'm running Python 2.7.3 on Ubuntu Precise, x86_64.
I'd be interested in how the 'regex' module
(http://pypi.python.org/pypi/regex) compares. :)
 
M

Mark Lawrence

That's one of the things we tried. Didn't make any difference.


Just tried that, it also didn't make any difference.



At this point, I'm not so much interested in making this faster as understanding why it's so slow. I'm tempted to open this up as a performance bug against the regex module (which I assume will be rejected, at least for the 2.x series).

Out of curiousity have the tried the new regex module from pypi rather
than the stdlib version? A heck of a lot of work has gone into it see
http://bugs.python.org/issue2636

--
"Steve is going for the pink ball - and for those of you who are
watching in black and white, the pink is next to the green." Snooker
commentator 'Whispering' Ted Lowe.

Mark Lawrence
 
J

Johannes Bauer

Yeah, I'd try that against 3.3 before opening a performance bug. Also,
it's entirely possible that performance is majorly different in 3.x
anyway, on account of strings being Unicode. Definitely merits another
look imho.

Hmmm, at least Python 3.2 seems to have the same issue. I generated test
data with:

#!/usr/bin/python3
import random
random.seed(0)
f = open("error.log", "w")
for i in range(1500000):
q = random.randint(0, 99)
if q == 0:
print("ENQUEUEING: /listen/ fhsduifhsd uifhuisd hfuisd hfuihds
iufhsd", file = f)
else:
print("fiosdjfoi sdmfio sdmfio msdiof msdoif msdoimf oisd mfoisdm f",
file = f)

Resulting file has a size of 91530018 and md5 of
2d20c3447a0b51a37d28126b8348f6c5 (just to make sure we're on the same
page because I'm not sure the PRNG is stable across Python versions).

Testing with:

#!/usr/bin/python3
import re
pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0
for line in open('error.log'):
# if 'ENQ' not in line:
# continue
m = pattern.search(line)
if m:
count += 1
print(count)

The pre-check version is about 42% faster in my case (0.75 sec vs. 1.3
sec). Curious. This is Python 3.2.3 on Linux x86_64.

Regards,
Johannes

--
Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos über Rüdiger Thomas in dsa <[email protected]>
 
R

Rick Johnson

I've got a 170 MB file I want to search for lines that look like:
[2010-10-20 16:47:50.339229 -04:00] INFO (6): songza.amie.history - ENQUEUEING: /listen/the-station-one
This code runs in 1.3 seconds:
------------------------------
import re
pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0
for line in open('error.log'):
m = pattern.search(line)
if m:
count += 1
print count

Is the power of regexps required to solve such a simplistic problem? I believe string methods should suffice.

py> line = "[2010-10-20 16:47:50.339229 -04:00] INFO (6): songza.amie.history - ENQUEUEING: /listen/the-station-one"
py> idx = line.find('ENQ')
py> if idx > 0:
match = line[idx:]
py> match
'ENQUEUEING: /listen/the-station-one'
 
R

Roy Smith

Mark Lawrence said:
Out of curiousity have the tried the new regex module from pypi rather
than the stdlib version? A heck of a lot of work has gone into it see
http://bugs.python.org/issue2636

I just installed that and gave it a shot. It's *slower* (and, much
higher variation from run to run). I'm too exhausted fighting with
OpenOffice to get this into some sane spreadsheet format, so here's
the raw timings:

Built-in re module:
0:01.32
0:01.33
0:01.32
0:01.33
0:01.35
0:01.32
0:01.35
0:01.36
0:01.33
0:01.32

regex with flags=V0:
0:01.66
0:01.53
0:01.51
0:01.47
0:01.81
0:01.58
0:01.78
0:01.57
0:01.64
0:01.60

regex with flags=V1:
0:01.53
0:01.57
0:01.65
0:01.61
0:01.83
0:01.82
0:01.59
0:01.60
0:01.55
0:01.82
 
M

MRAB

I just installed that and gave it a shot. It's *slower* (and, much
higher variation from run to run). I'm too exhausted fighting with
OpenOffice to get this into some sane spreadsheet format, so here's
the raw timings:

Built-in re module:
0:01.32
0:01.33
0:01.32
0:01.33
0:01.35
0:01.32
0:01.35
0:01.36
0:01.33
0:01.32

regex with flags=V0:
0:01.66
0:01.53
0:01.51
0:01.47
0:01.81
0:01.58
0:01.78
0:01.57
0:01.64
0:01.60

regex with flags=V1:
0:01.53
0:01.57
0:01.65
0:01.61
0:01.83
0:01.82
0:01.59
0:01.60
0:01.55
0:01.82
I reckon that about 1/3 of that time is spent in
PyArg_ParseTupleAndKeywords, just getting the arguments!

There's a higher initial overhead in using regex than string methods,
so working just a line at time will take longer.
 
A

André Malo

* Johannes Bauer said:
The pre-check version is about 42% faster in my case (0.75 sec vs. 1.3
sec). Curious. This is Python 3.2.3 on Linux x86_64.

A lot of time is spent with dict lookups (timings at my box, Python 3.2.3)
in your inner loop (1500000 times...)

#!/usr/bin/python3
import re
pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0
for line in open('error.log'):
#if 'ENQ' not in line:
# continue
m = pattern.search(line)
if m:
count += 1
print(count)

runs ~ 1.39 s

replacing some dict lookups with index lookups:

#!/usr/bin/python3
def main():
import re
pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0
for line in open('error.log'):
#if 'ENQ' not in line:
# continue
m = pattern.search(line)
if m:
count += 1
print(count)
main()

runs ~ 1.15s

and further:

#!/usr/bin/python3
def main():
import re
pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
search = pattern.search
count = 0
for line in open('error.log'):
#if 'ENQ' not in line:
# continue
m = search(line)
if m:
count += 1
print(count)
main()

runs ~ 1.08 s

and for reference:

#!/usr/bin/python3
def main():
import re
pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
search = pattern.search
count = 0
for line in open('error.log'):
if 'ENQ' not in line:
continue
m = search(line)
if m:
count += 1
print(count)
main()

runs ~ 0.71 s

The final difference is probably just the difference between a hardcoded
string search and a generic NFA.

nd
 
A

Antoine Pitrou

Roy Smith said:
Every line which contains 'ENQ' also matches the full regex (61425
lines match, out of 2.1 million total). I don't understand why the
first way is so much slower.

One invokes a fast special-purpose substring searching routine (the
str.__contains__ operator), the other a generic matching engine able to
process complex patterns. It's hardly a surprise for the specialized routine
to be faster. That's like saying "I don't understand why my CPU is slower
than my GPU at calculating 3D structures".

That said, there may be ways to improve the regex engine to deal with such
special cases in a speedier way. But there will still be some overhead related
to the fact that you are invoking a powerful generic engine rather than a
lean and mean specialized routine.

(to be fair, on CPython there's also the fact that operators are faster
than method calls, so some overhead is added by that too)
Once the regex is compiled, you should have a state machine pattern
matcher. It should be O(n) in the length of the input to figure out
that it doesn't match as far as "ENQ". And that's exactly how long it
should take for "if 'ENQ' not in line" to run as well.

You should read again on the O(...) notation. It's an asymptotic complexity,
it tells you nothing about the exact function values at different data points.
So you can have two O(n) routines, one of which always twice faster than the
other.

Regards

Antoine.
 
A

André Malo

* André Malo said:
A lot of time is spent with dict lookups (timings at my box, Python 3..2.3)
in your inner loop (1500000 times...)

[...]

Missed one basic timing BTW:

#!/usr/bin/python3
def main():
for line in open('error.log'):
pass
main()

runs ~ 0.53 s

nd
 
R

Roy Smith

One invokes a fast special-purpose substring searching routine (the
str.__contains__ operator), the other a generic matching engine able to
process complex patterns. It's hardly a surprise for the specialized routine
to be faster.

Except that the complexity in regexes is compiling the pattern down to a FSM. Once you've got the FSM built, the inner loop should be pretty quick. In C, the inner loop for executing a FSM should be something like:

for(char* p = input; p; ++p) {
next_state = current_state[*p];
if (next_state == MATCH) {
break;
}
}

which should compile down to a couple of machine instructions which run entirely in the instruction pipeline cache. But I'm probably simplifying it more than I should :)
(to be fair, on CPython there's also the fact that operators are faster
than method calls, so some overhead is added by that too)

I've been doing some experimenting, and I'm inclined to believe this is indeed a significant part of it. I also took some ideas from André Malo andfactored out some name lookups from the inner loop. That bummed me another 10% in speed.
 
G

Grant Edwards

Roy Smith <roy <at> panix.com> writes:

You should read again on the O(...) notation. It's an asymptotic complexity,
it tells you nothing about the exact function values at different data points.
So you can have two O(n) routines, one of which always twice faster than the
other.

And you can have two O(n) routines, one of which is twice as fast for
one value of n and the other is twice as fast for a different value of
n (and that's true for any value of 'twice': 2X 10X 100X).

All the O() tells you is the general shape of the line. It doesn't
tell you where the line is or how steep the slope is (except in the
case of O(1), where you do know the slope is 0. It's perfectly
feasible that for the range of values of n that you care about in a
particular application, there's an O(n^2) algorithm that's way faster
than another O(log(n)) algorithm. [Though that becomes a lot less
likely as n gets large.]
 
T

Terry Reedy

Or one that is a million times as fast.
And you can have two O(n) routines, one of which is twice as fast for
one value of n and the other is twice as fast for a different value of
n (and that's true for any value of 'twice': 2X 10X 100X).

All the O() tells you is the general shape of the line. It doesn't
tell you where the line is or how steep the slope is (except in the
case of O(1), where you do know the slope is 0. It's perfectly
feasible that for the range of values of n that you care about in a
particular application, there's an O(n^2) algorithm that's way faster
than another O(log(n)) algorithm.

In fact, Tim Peters put together two facts to create the current list.sort.
1. O(n*n) binary insert sort is faster than O(n*logn) merge sort, with
both competently coded in C, for n up to about 64. Part of the reason is
that binary insert sort is actually O(n*logn) (n binary searches) +
O(n*n) (n insertions with a shift averaging n/2 items). The multiplier
for the O(n*n) part is much smaller because on modern CPUs, the shift
needed for the insertion is a single machine instruction.
2. O(n*logn) sorts have a lower assymtotic complexity because they
divide the sequence roughly in half about logn times. In other words,
they are 'fast' because they split a list into lots of little pieces. So
Tim's aha moment was to think 'Lets stop splitting when pieces are less
than or equal to 64, rather than splitting all the way down to 1 or 2".
 
S

Steven D'Aprano

I've got a 170 MB file I want to search for lines that look like:

[2010-10-20 16:47:50.339229 -04:00] INFO (6): songza.amie.history -
ENQUEUEING: /listen/the-station-one

This code runs in 1.3 seconds:

------------------------------
import re

pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0
for line in open('error.log'):
m = pattern.search(line)
if m:
count += 1

print count

You say that as if it were surprising. It's not. Regex searches have
higher overhead than bog standard dumb `substring in string` tests, so
reducing the number of high-overhead regex searches calls with a low-
overhead `in` test is often a nett win.

Not always, it depends on how many hits/misses you have, and the relative
overhead of each, but as a general rule, I would always try pre-filtering
as the obvious way to optimize a regex.

Even if the regex engine is just as efficient at doing simple character
matching as `in`, and it probably isn't, your regex tries to match all
eleven characters of "ENQUEUEING" while the `in` test only has to match
three, "ENQ".

[...]
Once the regex is compiled, you should have a state machine pattern
matcher. It should be O(n) in the length of the input to figure out
that it doesn't match as far as "ENQ".

Yes, but O(N) doesn't tell you how long it takes to run, only how it
*scales*.

class MyDumbString(str):
def __contains__(self, substr):
time.sleep(1000)
return super(MyDumbString, self).__contains__(substring)

MyDumbString `in` is also O(N), but it is a wee less efficient than the
built-in version...

Regexes do a lot of work, because they are far more complex than dumb
string __contains__. It should not surprise you that the overhead is
greater, even when matching a plain-ol' substring with no metacharacters.
Especially since Python does not traditionally use regexes for
everything, like Perl does, and so has not spent the effort to complicate
the implementation in order to squeeze every last microsecond out of it.

And that's exactly how long it
should take for "if 'ENQ' not in line" to run as well. Why is doing
twice the work also twice the speed?

It's not doing twice the work. It's doing less work, overall, by spending
a moment to trivially filter out the lines which cannot possibly match,
before spending a few moments to do a more careful match. If the number
of non-matching lines is high, as it is in your data, then the cheaper
pre-filter pays for itself and then some.
 
D

Dave Angel

On 06/18/2013 09:51 PM, Steven D'Aprano wrote:

Even if the regex engine is just as efficient at doing simple character
matching as `in`, and it probably isn't, your regex tries to match all
eleven characters of "ENQUEUEING" while the `in` test only has to match
three, "ENQ".

The rest of your post was valid, and useful, but there's a misconception
in this paragraph; I hope you don't mind me pointing it out.

In general, for simple substring searches, you can search for a large
string faster than you can search for a smaller one. I'd expect

if "ENQUEUING" in bigbuffer

to be faster than

if "ENQ" in bigbuffer

assuming that all occurrences of ENQ will actually match the whole
thing. If CPython's implementation doesn't show the speed difference,
maybe there's some room for optimization.

See Boyer-Moore if you want a peek at the algorithm.

When I was writiing a simple search program, I could typically search
for a 4-character string faster than REP SCASB could match a one
character string. And that's a single instruction (with prefix).
 
S

Steven D'Aprano

On 06/18/2013 09:51 PM, Steven D'Aprano wrote:


The rest of your post was valid, and useful, but there's a misconception
in this paragraph; I hope you don't mind me pointing it out.

Of course not, I'm always happy to learn if I'm mistaken.

In general, for simple substring searches, you can search for a large
string faster than you can search for a smaller one. I'd expect

if "ENQUEUING" in bigbuffer

to be faster than

if "ENQ" in bigbuffer

And so it is:


steve@runes:~$ python2.7 -m timeit -s "sub = 'ENQ'" \
-s "s = 'blah '*10000 + 'ENQUIRING blah blah blah'" \
"sub in s"
10000 loops, best of 3: 38.3 usec per loop
steve@runes:~$ python2.7 -m timeit -s "sub = 'ENQUIRING'" \
-s "s = 'blah '*10000 + 'ENQUIRING blah blah blah'" \
"sub in s"
100000 loops, best of 3: 15.4 usec per loop



Thank you.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,049
Latest member
Allen00Reed

Latest Threads

Top