re.search much slower then grep on some regular expressions

Henning_Thornblad · Jul 4, 2008

What can be the cause of the large difference between re.search and
grep?

This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re

row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)

While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.

Is this a bug in python?

Thanks...
Henning Thornblad

Bruno Desthuilliers · Jul 4, 2008

Henning_Thornblad a écrit :

What can be the cause of the large difference between re.search and
grep?

This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re

row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)

While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.

Is this a bug in python?

Please re-read carefully your python code. Don't you think there's a
subtle difference between reading a file and buildin 156000 string objects ?

Bruno Desthuilliers · Jul 4, 2008

Bruno Desthuilliers a écrit :

Henning_Thornblad a écrit :

What can be the cause of the large difference between re.search and
grep?

This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re

row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)

While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.

Is this a bug in python?

Click to expand...

Please re-read carefully your python code. Don't you think there's a
subtle difference between reading a file and buildin 156000 string
objects ?

Mmm... This set aside, after testing it (building the string in a
somewhat more efficient way), the call to re.search effectively takes
ages to return. Please forget my previous post.

Peter Otten · Jul 4, 2008

Henning_Thornblad said:
What can be the cause of the large difference between re.search and
grep?

grep uses a smarter algorithm

This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re

row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)

While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.

Is this a bug in python?

You could call this a performance bug, but it's not common enough in real
code to get the necessary brain cycles from the core developers.
So you can either write a patch yourself or use a workaround.

re.search('[^ "=]*/', row) if "/" in row else None

might be good enough.

Peter

Paddy · Jul 4, 2008

Henning_Thornblad said:
Henning_Thornblad said:

What can be the cause of the large difference between re.search and
grep?

Click to expand...

grep uses a smarter algorithm

This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re

Click to expand...

row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)

Click to expand...

While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.

Click to expand...

Is this a bug in python?

Click to expand...

You could call this a performance bug, but it's not common enough in real
code to get the necessary brain cycles from the core developers.
So you can either write a patch yourself or use a workaround.

re.search('[^ "=]*/', row) if "/" in row else None

might be good enough.

Peter

It is not a smarter algorithm that is used in grep. Python RE's have
more capabilities than grep RE's which need a slower, more complex
algorithm.
You could argue that if the costly RE features are not used then maybe
simpler, faster algorithms should be automatically swapped in but ....

- Paddy.

Filipe Fernandes · Jul 4, 2008

Henning_Thornblad said:
Henning_Thornblad said:

What can be the cause of the large difference between re.search and
grep?

Click to expand...

grep uses a smarter algorithm

This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re

row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)

While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.

Is this a bug in python?

Click to expand...

You could call this a performance bug, but it's not common enough in real
code to get the necessary brain cycles from the core developers.
So you can either write a patch yourself or use a workaround.

re.search('[^ "=]*/', row) if "/" in row else None

might be good enough.

Wow... I'm rather surprised at how slow this is... using re.match
yields much quicker results, but of course it's not quite the same as
re.search

Incidentally, if you add the '/' to "row" at the end of the string,
re.search returns instantly with a match object.

@ Peter
I'm not versed enough in regex to tell if this is a bug or not
(although I suspect it is), but why would you say this particular
regex isn't common enough in real code?

filipe

Carl Banks · Jul 5, 2008

Henning_Thornblad wrote:

Click to expand...

grep uses a smarter algorithm

This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re
row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)
While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.
Is this a bug in python?

Click to expand...

Click to expand...

You could call this a performance bug, but it's not common enough in real
code to get the necessary brain cycles from the core developers.
So you can either write a patch yourself or use a workaround.

Click to expand...

re.search('[^ "=]*/', row) if "/" in row else None

Click to expand...

might be good enough.

Click to expand...

Wow... I'm rather surprised at how slow this is... using re.match
yields much quicker results, but of course it's not quite the same as
re.search

Incidentally, if you add the '/' to "row" at the end of the string,
re.search returns instantly with a match object.

This behavior is showing that you're getting n-squared performance;
the regexp seems to be checking 156000*(156000-1)/2 substrings for a
match.

I don't think it's possible to avoid quadratic behavior in regexps in
general, but clearly there are ways to optimize in some cases.

I'm guessing that grep builds a table of locations of individual
characters as it scans and, when the regexp exhausts the input, it
tries to backtrack to the last slash it saw, except there wasn't one
so it knows the regexp cannot be satisfied and it exits early.

@ Peter
I'm not versed enough in regex to tell if this is a bug or not
(although I suspect it is),

I'm pretty sure it isn't: the regexp documentation makes no claims as
to the performance of the regexp that I'm aware of.

but why would you say this particular
regex isn't common enough in real code?

When re.search regexps start with things like [^...]* or .*, typically
the excluded characters are a typically found frequently in the
input. For example, the pattern .*hello.* could be used to find a
line with hello in it, with the expectation that there are lots of
newlines. But if there aren't any newlines the regexp wouldn't be
very useful.

Carl Banks

Peter Pearson · Jul 5, 2008

Henning_Thornblad wrote:

Click to expand...

This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re

Click to expand...

row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row) [snip]
Is this a bug in python?

Click to expand...

Click to expand...

This behavior is showing that you're getting n-squared performance;
the regexp seems to be checking 156000*(156000-1)/2 substrings for a
match.

I did this:

$ python -m timeit -s "import re" "re.search( '[^13]*x', 900*'a' )"
100 loops, best of 3: 16.7 msec per loop

for values of 900 ranging from 300 to 1000, and the time taken
per loop was indeed quadratic.

John Nagle · Jul 5, 2008

Henning_Thornblad said:
What can be the cause of the large difference between re.search and
grep?

This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re

row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)

While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.

Is this a bug in python?

Thanks...
Henning Thornblad

You're recompiling the regular expression on each use.
Use "re.compile" before the loop to do it once.

John Nagle

Peter Otten · Jul 5, 2008

John said:
Henning_Thornblad said:

What can be the cause of the large difference between re.search and
grep?

This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re

row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)

While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.

Is this a bug in python?

Thanks...
Henning Thornblad

Click to expand...

You're recompiling the regular expression on each use.
Use "re.compile" before the loop to do it once.

Now that's premature optimization

Apart from the fact that re.search() is executed only once in the above
script the re library uses a caching scheme so that even if the re.search()
call were in a loop the overhead would be a few microseconds for the cache
lookup.

Peter

Peter Otten · Jul 5, 2008

Paddy said:
It is not a smarter algorithm that is used in grep. Python RE's have
more capabilities than grep RE's which need a slower, more complex
algorithm.

So you're saying the Python algo is alternatively gifted...

Peter

Paddy · Jul 5, 2008

So you're saying the Python algo is alternatively gifted...

Peter

The following isn't the article I read on regexp types and their speed
differences but it does give info on regexp engine types:
http://books.google.co.uk/books?id=...X&oi=book_result&resnum=3&ct=result#PPA145,M1

- Paddy.

Sebastian \lunar\ Wiesner · Jul 5, 2008

Paddy said:
Henning_Thornblad said:

What can be the cause of the large difference between re.search and
grep?

Click to expand...

grep uses a smarter algorithm

This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re

Click to expand...

row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)

Click to expand...

While doing a simple grep:
grep '[^ "=]*/' input Â Â Â Â Â Â Â Â Â (input contains 156.000 a in
one row)
doesn't even take a second.

Click to expand...

Is this a bug in python?

Click to expand...

You could call this a performance bug, but it's not common enough in real
code to get the necessary brain cycles from the core developers.
So you can either write a patch yourself or use a workaround.

re.search('[^ "=]*/', row) if "/" in row else None

might be good enough.

Peter

Click to expand...

It is not a smarter algorithm that is used in grep. Python RE's have
more capabilities than grep RE's which need a slower, more complex
algorithm.

FWIW, grep itself can confirm this statement. The following command roughly
takes as long as Python's re.search:

# grep -P '[^ "=]*/' input

-P tells grep to use real perl-compatible regular expressions.

Carl Banks · Jul 5, 2008

Paddy <[email protected]>:

Henning_Thornblad wrote:
What can be the cause of the large difference between re.search and
grep?
grep uses a smarter algorithm
This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re
row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)
While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.
Is this a bug in python?
You could call this a performance bug, but it's not common enough in real
code to get the necessary brain cycles from the core developers.
So you can either write a patch yourself or use a workaround.
re.search('[^ "=]*/', row) if "/" in row else None
might be good enough.
Peter

Click to expand...

Click to expand...

It is not a smarter algorithm that is used in grep. Python RE's have
more capabilities than grep RE's which need a slower, more complex
algorithm.

Click to expand...

FWIW, grep itself can confirm this statement. The following command roughly
takes as long as Python's re.search:

# grep -P '[^ "=]*/' input

-P tells grep to use real perl-compatible regular expressions.

This confirms that a particular engine might not be optimized for it,
but it's not necessarily a reflection that the engine is more complex.

I'm not sure of the specifics and maybe that is the case, but it could
also be a case of a different codebase which is optimized differently.

Carl Banks

Sebastian \lunar\ Wiesner · Jul 5, 2008

Carl Banks said:
Paddy <[email protected]>:

Henning_Thornblad wrote:
What can be the cause of the large difference between re.search and
grep?

Click to expand...

grep uses a smarter algorithm

Click to expand...

This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re

Click to expand...

row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)

Click to expand...

While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.

Click to expand...

Is this a bug in python?

Click to expand...

You could call this a performance bug, but it's not common enough in
real code to get the necessary brain cycles from the core developers.
So you can either write a patch yourself or use a workaround.

Click to expand...

re.search('[^ "=]*/', row) if "/" in row else None

Click to expand...

might be good enough.

It is not a smarter algorithm that is used in grep. Python RE's have
more capabilities than grep RE's which need a slower, more complex
algorithm.

Click to expand...

FWIW, grep itself can confirm this statement. The following command
roughly takes as long as Python's re.search:

# grep -P '[^ "=]*/' input

-P tells grep to use real perl-compatible regular expressions.

Click to expand...

This confirms that a particular engine might not be optimized for it,
but it's not necessarily a reflection that the engine is more complex.

My posting wasn't intended to reflect the differences in complexity between
normal GNU grep expressions (which are basically extended POSIX
expressions) and perl-compatible expressions. The latter just _are_ more
complex, having additional features like look aheads or non-greedy
qualifiers.

I just wanted to illustrate, that the speed of the given search is somehow
related to the complexity of the engine.

Btw, other pcre implementation are as slow as Python or "grep -P". I tried
a sample C++-code using pcre++ (a wrapper around libpcre) and saw it
running equally long.

Carl Banks · Jul 5, 2008

Carl Banks <[email protected]>:

Paddy <[email protected]>:
Henning_Thornblad wrote:
What can be the cause of the large difference between re.search and
grep?
grep uses a smarter algorithm
This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re
row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)
While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.
Is this a bug in python?
You could call this a performance bug, but it's not common enough in
real code to get the necessary brain cycles from the core developers.
So you can either write a patch yourself or use a workaround.
re.search('[^ "=]*/', row) if "/" in row else None
might be good enough.
Peter
It is not a smarter algorithm that is used in grep. Python RE's have
more capabilities than grep RE's which need a slower, more complex
algorithm.
FWIW, grep itself can confirm this statement. The following command
roughly takes as long as Python's re.search:
# grep -P '[^ "=]*/' input
-P tells grep to use real perl-compatible regular expressions.

Click to expand...

Click to expand...

This confirms that a particular engine might not be optimized for it,
but it's not necessarily a reflection that the engine is more complex.

Click to expand...

My posting wasn't intended to reflect the differences in complexity between
normal GNU grep expressions (which are basically extended POSIX
expressions) and perl-compatible expressions. The latter just _are_ more
complex, having additional features like look aheads or non-greedy
qualifiers.

I just wanted to illustrate, that the speed of the given search is somehow
related to the complexity of the engine.

I don't think you've illustrated that at all. What you've illustrated
is that one implementation of regexp optimizes something that another
doesn't. It might be due to differences in complexity; it might not.
(Maybe there's something about PCREs that precludes the optimization
that the default grep uses, but I'd be inclined to think not.)

Carl Banks

bearophileHUGS · Jul 5, 2008

Paddy:

You could argue that if the costly RE features are not used then maybe
simpler, faster algorithms should be automatically swapped in but ....

Many Python implementations contains a TCL interpreter. TCL REs may be
better than Python ones, so it can be interesting to benchmark the
same RE with TCL, to see how much time it needs. If someone here knows
TCL it may require to write just few lines of TCL code (used through
tkinter, by direct call).

Bye,
bearophile

Mark Dickinson · Jul 5, 2008

I don't think you've illustrated that at all. What you've illustrated
is that one implementation of regexp optimizes something that another
doesn't. It might be due to differences in complexity; it might not.
(Maybe there's something about PCREs that precludes the optimization
that the default grep uses, but I'd be inclined to think not.)

It seems like an appropriate moment to point out *this* paper:

http://swtch.com/~rsc/regexp/regexp1.html

Apparently, grep and Tcl convert a regex to a finite state machine.
Matching is then *very* fast: essentially linear time in the
length of the string being matched, even in the worst case.
Though it is possible for the size of the finite state machine
to grow exponentially with the size of the regex.

But not all PCREs can be converted to a finite state machine, so
Perl, Python, etc. use a backtracking approach, which has
exponential running time in the worst case. In particular,
it's not possible to use a finite state machine to represent
a regular expression that contains backreferences.

Part of the problem is a lack of agreement on what
'regular expression' means. Strictly speaking, PCREs aren't
regular expressions at all, for some values of the term
'regular expression'. See

http://en.wikipedia.org/wiki/Regular_expression

Mark

Paddy · Jul 5, 2008

It seems like an appropriate moment to point out *this* paper:

http://swtch.com/~rsc/regexp/regexp1.html

That's the one!

Thanks Mark.

- Paddy.

Terry Reedy · Jul 5, 2008

Mark said:
On Jul 5, 1:54 pm, Carl Banks <[email protected]> wrote:

Part of the problem is a lack of agreement on what
'regular expression' means.

Twenty years ago, there was. Calling a extended re-derived grammar
expression like Perl's a 'regular-expression' is a bit like calling a
Hummer a 'car' -- perhaps to hide its gas-guzzling behavior.

regular expressions eliminating filenames of type foo.thumbnail.jpg	7	Jun 25, 2007
Searching for Regular Expressions in a string WITH overlap	1	Nov 21, 2008
Request for Feedback; a module making it easier to use regular expressions.	1	Jan 31, 2005
regular expression concatenation with strings	6	Jun 22, 2007
ANN: 'rex', a module for easy creation and use of regular expressions	0	Jun 10, 2004
Personal archive tool, looking for suggestions on improving the code	5	Jul 27, 2010
brand new to python	2	Mar 13, 2005
Some help needed with small multi-threaded program!	1	May 17, 2010

re.search much slower then grep on some regular expressions

Henning_Thornblad

Bruno Desthuilliers

Bruno Desthuilliers

Peter Otten

Paddy

Filipe Fernandes

Carl Banks

Peter Pearson

John Nagle

Peter Otten

Peter Otten

Paddy

Sebastian \lunar\ Wiesner

Carl Banks

Sebastian \lunar\ Wiesner

Carl Banks

bearophileHUGS

Mark Dickinson

Paddy

Terry Reedy

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads