re.search much slower then grep on some regular expressions

Sebastian \lunar\ Wiesner · Jul 6, 2008

Terry Reedy said:
Twenty years ago, there was. Calling a extended re-derived grammar
expression like Perl's a 'regular-expression' is a bit like calling a
Hummer a 'car' -- perhaps to hide its gas-guzzling behavior.

Yet, to many programmers the Hummer is not only just a car, its _the_ car.
Everything else is just silly, bloody old posix crap

(meaning, that pcres are often seen as true regular expressions, and
everything else is referred to as "being somehow restricted")

sjdevnull · Jul 6, 2008

Apparently, grep and Tcl convert a regex to a finite state machine. ....
But not all PCREs can be converted to a finite state machine ....
Part of the problem is a lack of agreement on what
'regular expression' means. Strictly speaking, PCREs aren't
regular expressions at all, for some values of the term
'regular expression'. See

http://en.wikipedia.org/wiki/Regular_expression

Formally, there's no lack of agreement that I'm aware of. Anyone
formally defining it will use something along the lines of what
wikipedia specifies (with "Regular expressions in this sense can
express the regular languages, exactly the class of languages accepted
by finite state automata."). Colloquially it's used to mean "any text-
matching scheme that looks vaguely like sed/grep regexes".

PCREs are certainly not "real" regular expressions, but they are
colloquially.

Mark Wooding · Jul 6, 2008

Sebastian "lunar" Wiesner said:
I just wanted to illustrate, that the speed of the given search is somehow
related to the complexity of the engine.

Btw, other pcre implementation are as slow as Python or "grep -P". I tried
a sample C++-code using pcre++ (a wrapper around libpcre) and saw it
running equally long.

So some other implementations are equally poor. I note that Perl itself
handles this case very quickly, as does Edi Weitz's CL-PPCRE library.

Yes, Perl-compatible `regular expressions' are more complicated than
POSIX extended regular expressions; but that doesn't mean that you have
to implement them in a dumb way. Indeed, it stands to reason that
expressions describing truly regular languages can be matched using the
same old algorithms that grep uses.

-- [mdw]

Sebastian \lunar\ Wiesner · Jul 6, 2008

Mark Wooding said:
So some other implementations are equally poor. I note that Perl itself
handles this case very quickly, as does Edi Weitz's CL-PPCRE library.

I don't know about the latter, but perl doesn't use any smart algorithm, it
just heavily relies on memoization to gain speed.

This makes perl perform poorly in other situations, as mentioned in the
paper cited by Mark Dickinson:

# perl -e '("a" x 100000) =~ /^(ab?)*$/;'
zsh: segmentation fault perl -e '("a" x 100000) =~ /^(ab?)*$/;'

In Python on the other, this pattern works fine, at the cost of performance
losses on other patterns.

It'd be interesting to know, how CL-PPCRE performs here (I don't know this
library).

Yes, Perl-compatible `regular expressions' are more complicated than
POSIX extended regular expressions; but that doesn't mean that you have
to implement them in a dumb way. Indeed, it stands to reason that
expressions describing truly regular languages can be matched using the
same old algorithms that grep uses.

I completely agree. I'd just believe, that the combination of some finite
state machine for "classic" expressions with some backtracking code is
terribly hard to implement. But I'm not an expert in this, probably some
libraries out there already do this. In this case, it'd be time to give
pythons re engine a more sophisticated implementation

Terry Reedy · Jul 6, 2008

Sebastian said:
I completely agree. I'd just believe, that the combination of some finite
state machine for "classic" expressions with some backtracking code is
terribly hard to implement. But I'm not an expert in this, probably some
libraries out there already do this. In this case, it'd be time to give
pythons re engine a more sophisticated implementation

One thing to remember in making comparisons is that Python moved to its
own implementation to support unicode as well as extended ascii (bytes
in 3.0). Does grep do that? Has pcre, which Python once used, been
upgraded to do that?

Mark Wooding · Jul 6, 2008

Sebastian "lunar" Wiesner said:
# perl -e '("a" x 100000) =~ /^(ab?)*$/;'
zsh: segmentation fault perl -e '("a" x 100000) =~ /^(ab?)*$/;'

(Did you really run that as root?)

It'd be interesting to know, how CL-PPCRE performs here (I don't know this
library).

Stack overflow condition. :-(

-- [mdw]

Sebastian \lunar\ Wiesner · Jul 7, 2008

Mark Wooding said:
(Did you really run that as root?)

How come, that you think so?

Marc 'BlackJack' Rintsch · Jul 7, 2008

How come, that you think so?

The '#' is usually the prompt of root accounts while ordinary users get a '$'.

Ciao,
Marc 'BlackJack' Rintsch

Henning Thornblad · Jul 7, 2008

When trying to find an alternative way of solving my problem i found
that running this script:

#!/usr/bin/env python

import re

row=""
for a in range(156000):
row+="a"
print "How many, dude?"
print re.search('/[^ "=]*',row) (the / has moved)

wouldn't take even a second (The re.search part of course)

This is for me very strange since this,
in theory, is the same problem as before.

/Henning Thornblad

Kris Kennaway · Jul 7, 2008

Paddy said:
Henning_Thornblad said:

What can be the cause of the large difference between re.search and
grep?

Click to expand...

grep uses a smarter algorithm

This script takes about 5 min to run on my computer:
#!/usr/bin/env python
import re
row=""
for a in range(156000):
row+="a"
print re.search('[^ "=]*/',row)
While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.
Is this a bug in python?

Click to expand...

You could call this a performance bug, but it's not common enough in real
code to get the necessary brain cycles from the core developers.
So you can either write a patch yourself or use a workaround.

re.search('[^ "=]*/', row) if "/" in row else None

might be good enough.

Peter

Click to expand...

It is not a smarter algorithm that is used in grep. Python RE's have
more capabilities than grep RE's which need a slower, more complex
algorithm.
You could argue that if the costly RE features are not used then maybe
simpler, faster algorithms should be automatically swapped in but ....

I can and do

It's a major problem that regular expression parsing in python has
exponential complexity when polynomial algorithms (for a subset of
regexp expressions, e.g. excluding back-references) are well-known.

It rules out using python for entire classes of applications where
regexp parsing is on the critical path.

Kris

John Machin · Jul 7, 2008

When trying to find an alternative way of solving my problem i found
that running this script:

#!/usr/bin/env python

import re

row=""
for a in range(156000):
row+="a"
print "How many, dude?"
print re.search('/[^ "=]*',row) (the / has moved)

wouldn't take even a second (The re.search part of course)

This is for me very strange since this,
in theory, is the same problem as before.

In theory. In practice, it is the same problem *iff* you start at the
end of the text and work backwards.

Your original pattern can be characterised as X*Y, where X and Y are
character classes; your new one is YX*. You are asking each to search
a text that contains many Xs and no Ys.

The second pattern will cause a naive search engine to examine each
character in the text (is it a Y?) as a candidate for the start of a
successful match; each test fails, and the whole expedition has cost
O(N).

OTOH, the first pattern will start at offset 0, cheerfully cruising
past all those lovely Xs, but failing (no Y) at the end of the text.
It will then think that it's been too greedy, reduce the matched span
of X* from N characters to N-1, fail to find Y, N-2, fail, ... so
that's O(N**2) just for the first trial starting from offset 0. Being
naive, it will try again starting from offset 1, 2, 3, ... an O(N**3)
process.

If you were to tell us what you are actually trying to do, we could
probably help you with a faster method.

BTW, if the text is 'aaa///bbb///ccc', your original search will find
'aaa///bbb///'; if you want it to find 'aaa/', the pattern should be:
'[^ "=/]*/'

Cheers,
John

John Machin · Jul 8, 2008

When trying to find an alternative way of solving my problem i found
that running this script:

#!/usr/bin/env python

import re

row=""
for a in range(156000):
row+="a"
print "How many, dude?"
print re.search('/[^ "=]*',row) (the / has moved)

wouldn't take even a second (The re.search part of course)

This is for me very strange since this,
in theory, is the same problem as before.

In theory. In practice, it is the same problem *iff* you start at the
end of the text and work backwards.

Your original pattern can be characterised as X*Y, where X and Y are
character classes; your new one is YX*. You are asking each to search
a text that contains many Xs and no Ys.

The second pattern will cause a naive search engine to examine each
character in the text (is it a Y?) as a candidate for the start of a
successful match; each test fails, and the whole expedition has cost
O(N).

OTOH, the first pattern will start at offset 0, cheerfully cruising
past all those lovely Xs, but failing (no Y) at the end of the text.
It will then think that it's been too greedy, reduce the matched span
of X* from N characters to N-1, fail to find Y, N-2, fail, ... so
that's O(N) just for the first trial starting from offset 0. Being
naive, it will try again starting from offset 1, 2, 3, ... an O(N**2)
process. If Y were instead something more complicated, it would be
worse than O(N**2).

If you were to tell us what you are actually trying to do, we could
probably help you with a faster method.

BTW, if the text is 'aaa///bbb///ccc', your original search will find
'aaa///bbb///'; if you want it to find 'aaa/', the pattern should be:
'[^ "=/]*/'

Cheers,
John

samwyse · Jul 8, 2008

What can be the cause of the large difference between re.search and
grep?

While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.

Is this a bug in python?

You might want to look at Plex.
http://www.cosc.canterbury.ac.nz/greg.ewing/python/Plex/

"Another advantage of Plex is that it compiles all of the regular
expressions into a single DFA. Once that's done, the input can be
processed in a time proportional to the number of characters to be
scanned, and independent of the number or complexity of the regular
expressions. Python's existing regular expression matchers do not have
this property. "

I haven't tested this, but I think it would do what you want:

from Plex import *
lexicon = Lexicon([
(Rep(AnyBut(' "='))+Str('/'), TEXT),
(AnyBut('\n'), IGNORE),
])
filename = "my_file.txt"
f = open(filename, "r")
scanner = Scanner(lexicon, f, filename)
while 1:
token = scanner.read()
print token
if token[0] is None:
break

Kris Kennaway · Jul 8, 2008

samwyse said:
What can be the cause of the large difference between re.search and
grep?

Click to expand...

While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.

Is this a bug in python?

Click to expand...

You might want to look at Plex.
http://www.cosc.canterbury.ac.nz/greg.ewing/python/Plex/

"Another advantage of Plex is that it compiles all of the regular
expressions into a single DFA. Once that's done, the input can be
processed in a time proportional to the number of characters to be
scanned, and independent of the number or complexity of the regular
expressions. Python's existing regular expression matchers do not have
this property. "

Very interesting! Thanks very much for the pointer.

Kris

Kris Kennaway · Jul 8, 2008

samwyse said:
What can be the cause of the large difference between re.search and
grep?

Click to expand...

While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.

Is this a bug in python?

Click to expand...

You might want to look at Plex.
http://www.cosc.canterbury.ac.nz/greg.ewing/python/Plex/

"Another advantage of Plex is that it compiles all of the regular
expressions into a single DFA. Once that's done, the input can be
processed in a time proportional to the number of characters to be
scanned, and independent of the number or complexity of the regular
expressions. Python's existing regular expression matchers do not have
this property. "

I haven't tested this, but I think it would do what you want:

from Plex import *
lexicon = Lexicon([
(Rep(AnyBut(' "='))+Str('/'), TEXT),
(AnyBut('\n'), IGNORE),
])
filename = "my_file.txt"
f = open(filename, "r")
scanner = Scanner(lexicon, f, filename)
while 1:
token = scanner.read()
print token
if token[0] is None:
break

Hmm, unfortunately it's still orders of magnitude slower than grep in my
own application that involves matching lots of strings and regexps
against large files (I killed it after 400 seconds, compared to 1.5 for
grep), and that's leaving aside the much longer compilation time (over a
minute). If the matching was fast then I could possibly pickle the
lexer though (but it's not).

Kris

Kris

Henning Thornblad · Jul 9, 2008

When trying to find an alternative way of solving my problem i found
that running this script:

Click to expand...

#!/usr/bin/env python

Click to expand...

import re

Click to expand...

row=""
for a in range(156000):
row+="a"
print "How many, dude?"
print re.search('/[^ "=]*',row) (the / has moved)

Click to expand...

wouldn't take even a second (The re.search part of course)

Click to expand...

This is for me very strange since this,
in theory, is the same problem as before.

Click to expand...

In theory. In practice, it is the same problem *iff* you start at the
end of the text and work backwards.

Your original pattern can be characterised as X*Y, where X and Y are
character classes; your new one is YX*. You are asking each to search
a text that contains many Xs and no Ys.

The second pattern will cause a naive search engine to examine each
character in the text (is it a Y?) as a candidate for the start of a
successful match; each test fails, and the whole expedition has cost
O(N).

OTOH, the first pattern will start at offset 0, cheerfully cruising
past all those lovely Xs, but failing (no Y) at the end of the text.
It will then think that it's been too greedy, reduce the matched span
of X* from N characters to N-1, fail to find Y, N-2, fail, ... so
that's O(N) just for the first trial starting from offset 0. Being
naive, it will try again starting from offset 1, 2, 3, ... an O(N**2)
process. If Y were instead something more complicated, it would be
worse than O(N**2).

If you were to tell us what you are actually trying to do, we could
probably help you with a faster method.

BTW, if the text is 'aaa///bbb///ccc', your original search will find
'aaa///bbb///'; if you want it to find 'aaa/', the pattern should be:
'[^ "=/]*/'

Cheers,
John

What we had was a regex that searched textfiles, row for row,
for an absolute path starting/ending with ", = or blank space.

The path have a known signature which is pretty unique and the regex
looked like this:
[^ ="]*/[^ ="]*signature[^ =]*/

So we didn't really see any performance loss until we got an textfile
with some really long rows, one reaching 156,000 characters. We
removed the cause of these long rows.
But also decided to tune the regex making it more scalable.

I have though of two possible soultions:
1. Invert [^ ="] so that it becomes [alot of charcters you could find
in a path]
2. Add [ ="]: [ ="][^ ="]*/[^ ="]*signature[^ =]*/[ ="] and just trim
them later
They both work good performance wise but i can't figure out which one
is the less ugly.

Anyway, I'm glad I asked the question. I have learned more about regex
then i ever thought i'd need

Many Thanks
Henning Thornblad

John Machin · Jul 9, 2008

samwyse said:
samwyse said:

What can be the cause of the large difference between re.search and
grep?
While doing a simple grep:
grep '[^ "=]*/' input (input contains 156.000 a in
one row)
doesn't even take a second.
Is this a bug in python?

Click to expand...

Click to expand...

You might want to look at Plex.
http://www.cosc.canterbury.ac.nz/greg.ewing/python/Plex/

Click to expand...

"Another advantage of Plex is that it compiles all of the regular
expressions into a single DFA. Once that's done, the input can be
processed in a time proportional to the number of characters to be
scanned, and independent of the number or complexity of the regular
expressions. Python's existing regular expression matchers do not have
this property. "

Click to expand...

I haven't tested this, but I think it would do what you want:

Click to expand...

from Plex import *
lexicon = Lexicon([
(Rep(AnyBut(' "='))+Str('/'), TEXT),
(AnyBut('\n'), IGNORE),
])
filename = "my_file.txt"
f = open(filename, "r")
scanner = Scanner(lexicon, f, filename)
while 1:
token = scanner.read()
print token
if token[0] is None:
break

Click to expand...

Hmm, unfortunately it's still orders of magnitude slower than grep in my
own application that involves matching lots of strings and regexps
against large files (I killed it after 400 seconds, compared to 1.5 for
grep), and that's leaving aside the much longer compilation time (over a
minute). If the matching was fast then I could possibly pickle the
lexer though (but it's not).

Can you give us some examples of the kinds of patterns that you are
using in practice and are slow using Python re? How large is "large"?
What kind of text?

Instead of grep, you might like to try nrgrep ... google("nrgrep
Navarro Raffinot"): PDF paper about it on Citeseer (if it's up),
postscript paper and C source findable from Gonzalo Navarro's home-
page.

Cheers,
John

Kris Kennaway · Jul 9, 2008

John said:
Can you give us some examples of the kinds of patterns that you are
using in practice and are slow using Python re?

Trivial stuff like:

(Str('error in pkg_delete'), ('mtree', 'mtree')),
(Str('filesystem was touched prior to .make install'),
('mtree', 'mtree')),
(Str('list of extra files and directories'), ('mtree', 'mtree')),
(Str('list of files present before this port was installed'),
('mtree', 'mtree')),
(Str('list of filesystem changes from before and after'),
('mtree', 'mtree')),

(re('Configuration .* not supported'), ('arch', 'arch')),

(re('(configure: error:|Script.*configure.*failed
unexpectedly|script.*failed: here are the contents of)'),
('configure_error', 'configure')),
....

There are about 150 of them and I want to find which is the first match
in a text file that ranges from a few KB up to 512MB in size.

> How large is "large"?
What kind of text?

It's compiler/build output.

Instead of grep, you might like to try nrgrep ... google("nrgrep
Navarro Raffinot"): PDF paper about it on Citeseer (if it's up),
postscript paper and C source findable from Gonzalo Navarro's home-
page.

Thanks, looks interesting but I don't think it is the best fit here. I
would like to avoid spawning hundreds of processes to process each file
(since I have tens of thousands of them to process).

Kris

Jeroen Ruigrok van der Werven · Jul 9, 2008

-On [20080709 14:08] said:
It's compiler/build output.

Sounds like the FreeBSD ports build cluster.

Kris, have you tried a PGO build of Python with your specific usage? I
cannot guarantee it will significantly speed things up though.

Also, a while ago I did tests with various GCC compilers and their effect on
Python running time as well as Intel's cc. Intel won on (nearly) all
accounts, meaning it was faster overall.

Kris Kennaway · Jul 9, 2008

Jeroen said:
-On [20080709 14:08] said:

It's compiler/build output.

Click to expand...

Sounds like the FreeBSD ports build cluster.

Yes indeed!

Kris, have you tried a PGO build of Python with your specific usage? I
cannot guarantee it will significantly speed things up though.

I am pretty sure the problem is algorithmic, not bad byte code

If it
was a matter of a few % then that is in the scope of compiler tweaks,
but we're talking orders of magnitude.

Kris

regular expressions eliminating filenames of type foo.thumbnail.jpg	7	Jun 25, 2007
Searching for Regular Expressions in a string WITH overlap	1	Nov 21, 2008
Request for Feedback; a module making it easier to use regular expressions.	1	Jan 31, 2005
regular expression concatenation with strings	6	Jun 22, 2007
ANN: 'rex', a module for easy creation and use of regular expressions	0	Jun 10, 2004
Personal archive tool, looking for suggestions on improving the code	5	Jul 27, 2010
brand new to python	2	Mar 13, 2005
Some help needed with small multi-threaded program!	1	May 17, 2010

re.search much slower then grep on some regular expressions

Sebastian \lunar\ Wiesner

sjdevnull

Mark Wooding

Sebastian \lunar\ Wiesner

Terry Reedy

Mark Wooding

Sebastian \lunar\ Wiesner

Marc 'BlackJack' Rintsch

Henning Thornblad

Kris Kennaway

John Machin

John Machin

samwyse

Kris Kennaway

Kris Kennaway

Henning Thornblad

John Machin

Kris Kennaway

Jeroen Ruigrok van der Werven

Kris Kennaway

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads