re.search much slower then grep on some regular expressions

  • Thread starter Henning_Thornblad
  • Start date
H

Henning_Thornblad

What can be the cause of the large difference between re.search and
grep?

This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re

row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)


While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.

Is this a bug in python?

Thanks...
Henning Thornblad
 
B

Bruno Desthuilliers

Henning_Thornblad a écrit :
What can be the cause of the large difference between re.search and
grep?

This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re

row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)


While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.

Is this a bug in python?

Please re-read carefully your python code. Don't you think there's a
subtle difference between reading a file and buildin 156000 string objects ?
 
B

Bruno Desthuilliers

Bruno Desthuilliers a écrit :
Henning_Thornblad a écrit :
What can be the cause of the large difference between re.search and
grep?

This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re

row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)


While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.

Is this a bug in python?

Please re-read carefully your python code. Don't you think there's a
subtle difference between reading a file and buildin 156000 string
objects ?

Mmm... This set aside, after testing it (building the string in a
somewhat more efficient way), the call to re.search effectively takes
ages to return. Please forget my previous post.
 
P

Peter Otten

Henning_Thornblad said:
What can be the cause of the large difference between re.search and
grep?

grep uses a smarter algorithm ;)
This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re

row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)


While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.

Is this a bug in python?

You could call this a performance bug, but it's not common enough in real
code to get the necessary brain cycles from the core developers.
So you can either write a patch yourself or use a workaround.

re.search('[^ "=]*/', row) if "/" in row else None

might be good enough.

Peter
 
P

Paddy

Henning_Thornblad said:
What can be the cause of the large difference between re.search and
grep?

grep uses a smarter algorithm ;)


This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re
row=""
for a in range(156000):
    row+="a"
print re.search('[^ "=]*/',row)
While doing a simple grep:
grep '[^ "=]*/' input                  (input contains 156.000 a in
one row)
doesn't even take a second.
Is this a bug in python?

You could call this a performance bug, but it's not common enough in real
code to get the necessary brain cycles from the core developers.
So you can either write a patch yourself or use a workaround.

re.search('[^ "=]*/', row) if "/" in row else None

might be good enough.

Peter

It is not a smarter algorithm that is used in grep. Python RE's have
more capabilities than grep RE's which need a slower, more complex
algorithm.
You could argue that if the costly RE features are not used then maybe
simpler, faster algorithms should be automatically swapped in but ....

- Paddy.
 
F

Filipe Fernandes

Henning_Thornblad said:
What can be the cause of the large difference between re.search and
grep?

grep uses a smarter algorithm ;)
This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re

row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)


While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.

Is this a bug in python?

You could call this a performance bug, but it's not common enough in real
code to get the necessary brain cycles from the core developers.
So you can either write a patch yourself or use a workaround.

re.search('[^ "=]*/', row) if "/" in row else None

might be good enough.

Wow... I'm rather surprised at how slow this is... using re.match
yields much quicker results, but of course it's not quite the same as
re.search

Incidentally, if you add the '/' to "row" at the end of the string,
re.search returns instantly with a match object.

@ Peter
I'm not versed enough in regex to tell if this is a bug or not
(although I suspect it is), but why would you say this particular
regex isn't common enough in real code?

filipe
 
C

Carl Banks

Henning_Thornblad wrote:
grep uses a smarter algorithm ;)
This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re
row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)
While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.
Is this a bug in python?
You could call this a performance bug, but it's not common enough in real
code to get the necessary brain cycles from the core developers.
So you can either write a patch yourself or use a workaround.
re.search('[^ "=]*/', row) if "/" in row else None
might be good enough.

Wow... I'm rather surprised at how slow this is... using re.match
yields much quicker results, but of course it's not quite the same as
re.search

Incidentally, if you add the '/' to "row" at the end of the string,
re.search returns instantly with a match object.

This behavior is showing that you're getting n-squared performance;
the regexp seems to be checking 156000*(156000-1)/2 substrings for a
match.

I don't think it's possible to avoid quadratic behavior in regexps in
general, but clearly there are ways to optimize in some cases.

I'm guessing that grep builds a table of locations of individual
characters as it scans and, when the regexp exhausts the input, it
tries to backtrack to the last slash it saw, except there wasn't one
so it knows the regexp cannot be satisfied and it exits early.

@ Peter
I'm not versed enough in regex to tell if this is a bug or not
(although I suspect it is),

I'm pretty sure it isn't: the regexp documentation makes no claims as
to the performance of the regexp that I'm aware of.

but why would you say this particular
regex isn't common enough in real code?

When re.search regexps start with things like [^...]* or .*, typically
the excluded characters are a typically found frequently in the
input. For example, the pattern .*hello.* could be used to find a
line with hello in it, with the expectation that there are lots of
newlines. But if there aren't any newlines the regexp wouldn't be
very useful.



Carl Banks
 
P

Peter Pearson

Henning_Thornblad wrote:
This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re
row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row) [snip]
Is this a bug in python?

This behavior is showing that you're getting n-squared performance;
the regexp seems to be checking 156000*(156000-1)/2 substrings for a
match.

I did this:

$ python -m timeit -s "import re" "re.search( '[^13]*x', 900*'a' )"
100 loops, best of 3: 16.7 msec per loop

for values of 900 ranging from 300 to 1000, and the time taken
per loop was indeed quadratic.
 
J

John Nagle

Henning_Thornblad said:
What can be the cause of the large difference between re.search and
grep?

This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re

row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)


While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.

Is this a bug in python?

Thanks...
Henning Thornblad

You're recompiling the regular expression on each use.
Use "re.compile" before the loop to do it once.

John Nagle
 
P

Peter Otten

John said:
Henning_Thornblad said:
What can be the cause of the large difference between re.search and
grep?

This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re

row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)


While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.

Is this a bug in python?

Thanks...
Henning Thornblad

You're recompiling the regular expression on each use.
Use "re.compile" before the loop to do it once.

Now that's premature optimization :)

Apart from the fact that re.search() is executed only once in the above
script the re library uses a caching scheme so that even if the re.search()
call were in a loop the overhead would be a few microseconds for the cache
lookup.

Peter
 
P

Peter Otten

Paddy said:
It is not a smarter algorithm that is used in grep. Python RE's have
more capabilities than grep RE's which need a slower, more complex
algorithm.

So you're saying the Python algo is alternatively gifted...

Peter
 
S

Sebastian \lunar\ Wiesner

Paddy said:
Henning_Thornblad said:
What can be the cause of the large difference between re.search and
grep?

grep uses a smarter algorithm ;)


This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re
row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)
While doing a simple grep:
grep '[^ "=]*/' input                  (input contains 156.000 a in
one row)
doesn't even take a second.
Is this a bug in python?

You could call this a performance bug, but it's not common enough in real
code to get the necessary brain cycles from the core developers.
So you can either write a patch yourself or use a workaround.

re.search('[^ "=]*/', row) if "/" in row else None

might be good enough.

Peter

It is not a smarter algorithm that is used in grep. Python RE's have
more capabilities than grep RE's which need a slower, more complex
algorithm.

FWIW, grep itself can confirm this statement. The following command roughly
takes as long as Python's re.search:

# grep -P '[^ "=]*/' input

-P tells grep to use real perl-compatible regular expressions.
 
C

Carl Banks

Paddy <[email protected]>:


Henning_Thornblad wrote:
What can be the cause of the large difference between re.search and
grep?
grep uses a smarter algorithm ;)
This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re
row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)
While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.
Is this a bug in python?
You could call this a performance bug, but it's not common enough in real
code to get the necessary brain cycles from the core developers.
So you can either write a patch yourself or use a workaround.
re.search('[^ "=]*/', row) if "/" in row else None
might be good enough.
Peter
It is not a smarter algorithm that is used in grep. Python RE's have
more capabilities than grep RE's which need a slower, more complex
algorithm.

FWIW, grep itself can confirm this statement. The following command roughly
takes as long as Python's re.search:

# grep -P '[^ "=]*/' input

-P tells grep to use real perl-compatible regular expressions.

This confirms that a particular engine might not be optimized for it,
but it's not necessarily a reflection that the engine is more complex.

I'm not sure of the specifics and maybe that is the case, but it could
also be a case of a different codebase which is optimized differently.


Carl Banks
 
S

Sebastian \lunar\ Wiesner

Carl Banks said:
Paddy <[email protected]>:


Henning_Thornblad wrote:
What can be the cause of the large difference between re.search and
grep?
grep uses a smarter algorithm ;)
This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re
row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)
While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.
Is this a bug in python?
You could call this a performance bug, but it's not common enough in
real code to get the necessary brain cycles from the core developers.
So you can either write a patch yourself or use a workaround.
re.search('[^ "=]*/', row) if "/" in row else None
might be good enough.

It is not a smarter algorithm that is used in grep. Python RE's have
more capabilities than grep RE's which need a slower, more complex
algorithm.

FWIW, grep itself can confirm this statement. The following command
roughly takes as long as Python's re.search:

# grep -P '[^ "=]*/' input

-P tells grep to use real perl-compatible regular expressions.

This confirms that a particular engine might not be optimized for it,
but it's not necessarily a reflection that the engine is more complex.

My posting wasn't intended to reflect the differences in complexity between
normal GNU grep expressions (which are basically extended POSIX
expressions) and perl-compatible expressions. The latter just _are_ more
complex, having additional features like look aheads or non-greedy
qualifiers.

I just wanted to illustrate, that the speed of the given search is somehow
related to the complexity of the engine.

Btw, other pcre implementation are as slow as Python or "grep -P". I tried
a sample C++-code using pcre++ (a wrapper around libpcre) and saw it
running equally long.
 
C

Carl Banks

Carl Banks <[email protected]>:


Paddy <[email protected]>:
Henning_Thornblad wrote:
What can be the cause of the large difference between re.search and
grep?
grep uses a smarter algorithm ;)
This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re
row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)
While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.
Is this a bug in python?
You could call this a performance bug, but it's not common enough in
real code to get the necessary brain cycles from the core developers.
So you can either write a patch yourself or use a workaround.
re.search('[^ "=]*/', row) if "/" in row else None
might be good enough.
Peter
It is not a smarter algorithm that is used in grep. Python RE's have
more capabilities than grep RE's which need a slower, more complex
algorithm.
FWIW, grep itself can confirm this statement. The following command
roughly takes as long as Python's re.search:
# grep -P '[^ "=]*/' input
-P tells grep to use real perl-compatible regular expressions.
This confirms that a particular engine might not be optimized for it,
but it's not necessarily a reflection that the engine is more complex.

My posting wasn't intended to reflect the differences in complexity between
normal GNU grep expressions (which are basically extended POSIX
expressions) and perl-compatible expressions. The latter just _are_ more
complex, having additional features like look aheads or non-greedy
qualifiers.

I just wanted to illustrate, that the speed of the given search is somehow
related to the complexity of the engine.

I don't think you've illustrated that at all. What you've illustrated
is that one implementation of regexp optimizes something that another
doesn't. It might be due to differences in complexity; it might not.
(Maybe there's something about PCREs that precludes the optimization
that the default grep uses, but I'd be inclined to think not.)


Carl Banks
 
B

bearophileHUGS

Paddy:
You could argue that if the costly RE features are not used then maybe
simpler, faster algorithms should be automatically swapped in but ....

Many Python implementations contains a TCL interpreter. TCL REs may be
better than Python ones, so it can be interesting to benchmark the
same RE with TCL, to see how much time it needs. If someone here knows
TCL it may require to write just few lines of TCL code (used through
tkinter, by direct call).

Bye,
bearophile
 
M

Mark Dickinson

I don't think you've illustrated that at all.  What you've illustrated
is that one implementation of regexp optimizes something that another
doesn't.  It might be due to differences in complexity; it might not.
(Maybe there's something about PCREs that precludes the optimization
that the default grep uses, but I'd be inclined to think not.)

It seems like an appropriate moment to point out *this* paper:

http://swtch.com/~rsc/regexp/regexp1.html

Apparently, grep and Tcl convert a regex to a finite state machine.
Matching is then *very* fast: essentially linear time in the
length of the string being matched, even in the worst case.
Though it is possible for the size of the finite state machine
to grow exponentially with the size of the regex.

But not all PCREs can be converted to a finite state machine, so
Perl, Python, etc. use a backtracking approach, which has
exponential running time in the worst case. In particular,
it's not possible to use a finite state machine to represent
a regular expression that contains backreferences.

Part of the problem is a lack of agreement on what
'regular expression' means. Strictly speaking, PCREs aren't
regular expressions at all, for some values of the term
'regular expression'. See

http://en.wikipedia.org/wiki/Regular_expression

Mark
 
T

Terry Reedy

Mark said:
On Jul 5, 1:54 pm, Carl Banks <[email protected]> wrote:
Part of the problem is a lack of agreement on what
'regular expression' means.

Twenty years ago, there was. Calling a extended re-derived grammar
expression like Perl's a 'regular-expression' is a bit like calling a
Hummer a 'car' -- perhaps to hide its gas-guzzling behavior.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,599
Members
45,175
Latest member
Vinay Kumar_ Nevatia
Top