re.search much slower then grep on some regular expressions

samwyse · Jul 9, 2008

samwyse wrote:

Hmm, unfortunately it's still orders of magnitude slower than grep in my
own application that involves matching lots of strings and regexps
against large files (I killed it after 400 seconds, compared to 1.5 for
grep), and that's leaving aside the much longer compilation time (over a
minute). If the matching was fast then I could possibly pickle the
lexer though (but it's not).

That's funny, the compilation is almost instantaneous for me.
However, I just tested it to several files, the first containing
4875*'a', the rest each twice the size of the previous. And you're
right, for each doubling of the file size, the match take four times
as long, meaning O(n^2). 156000*'a' would probably take 8 hours.
Here are my results:

compile_lexicon() took 0.0236021580595 secs
test('file-0.txt') took 24.8322969831 secs
test('file-1.txt') took 99.3956799681 secs
test('file-2.txt') took 398.349623132 secs

And here's my (probably over-engineered) testbed:

from __future__ import with_statement
from os.path import exists
from timeit import Timer

from Plex import *

filename = "file-%d.txt"

def create_files(n):
for x in range(0,n):
fname = filename % x
if not exists(fname):
print 'creating', fname
with open(fname, 'w') as f:
print >>f, (4875*2**x)*'a',

def compile_lexicon():
global lexicon
lexicon = Lexicon([
(Rep(AnyBut(' "='))+Str('/'), TEXT),
(AnyBut('\n'), IGNORE),
])

def test(fname):
with open(fname, 'r') as f:
scanner = Scanner(lexicon, f, fname)
while 1:
token = scanner.read()
#print token
if token[0] is None:
break

def my_timed_test(func_name, *args):
stmt = func_name + '(' + ','.join(map(repr, args)) + ')'
t = Timer(stmt, "from __main__ import "+func_name)
print stmt, 'took', t.timeit(1), 'secs'

if __name__ == '__main__':
create_files(6)
my_timed_test('compile_lexicon')
for x in range(0,4):
my_timed_test('test', filename%x)

Kris Kennaway · Jul 9, 2008

samwyse said:
That's funny, the compilation is almost instantaneous for me.

My lexicon was quite a bit bigger, containing about 150 strings and regexps.

However, I just tested it to several files, the first containing
4875*'a', the rest each twice the size of the previous. And you're
right, for each doubling of the file size, the match take four times
as long, meaning O(n^2). 156000*'a' would probably take 8 hours.
Here are my results:

The docs say it is supposed to be linear in the file size ;-) ;-(

Kris

John Machin · Jul 10, 2008

Trivial stuff like:

(Str('error in pkg_delete'), ('mtree', 'mtree')),
(Str('filesystem was touched prior to .make install'),
('mtree', 'mtree')),
(Str('list of extra files and directories'), ('mtree', 'mtree')),
(Str('list of files present before this port was installed'),
('mtree', 'mtree')),
(Str('list of filesystem changes from before and after'),
('mtree', 'mtree')),

(re('Configuration .* not supported'), ('arch', 'arch')),

(re('(configure: error:|Script.*configure.*failed
unexpectedly|script.*failed: here are the contents of)'),
('configure_error', 'configure')),
...

There are about 150 of them and I want to find which is the first match
in a text file that ranges from a few KB up to 512MB in size.

It's compiler/build output.

Thanks, looks interesting but I don't think it is the best fit here. I
would like to avoid spawning hundreds of processes to process each file
(since I have tens of thousands of them to process).

Uh-huh ... try this, then:

http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/

You could use this to find the "Str" cases and the prefixes of the
"re" cases (which seem to be no more complicated than 'foo.*bar.*zot')
and use something slower like Python's re to search the remainder of
the line for 'bar.*zot'.

Cheers,
John

Kris Kennaway · Jul 10, 2008

John said:
Uh-huh ... try this, then:

http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/

You could use this to find the "Str" cases and the prefixes of the
"re" cases (which seem to be no more complicated than 'foo.*bar.*zot')
and use something slower like Python's re to search the remainder of
the line for 'bar.*zot'.

If it was just strings, then sure...with regexps it might be possible to
make it work, but it doesn't sound particularly maintainable. I will
stick with my shell script until python gets a regexp engine of
equivalent performance.

Kris

J. Cliff Dyer · Jul 10, 2008

That's funny, the compilation is almost instantaneous for me.
However, I just tested it to several files, the first containing
4875*'a', the rest each twice the size of the previous. And you're
right, for each doubling of the file size, the match take four times
as long, meaning O(n^2). 156000*'a' would probably take 8 hours.
Here are my results:

compile_lexicon() took 0.0236021580595 secs
test('file-0.txt') took 24.8322969831 secs
test('file-1.txt') took 99.3956799681 secs
test('file-2.txt') took 398.349623132 secs

Sounds like a good strategy would be to find the smallest chunk of the
file that matches can't cross, and iterate your search on units of those
chunks. For example, if none of your regexes cross line boundaries,
search each line of the file individually. That may help turn around
the speed degradation you're seeing.

Cheers,
Cliff

Sebastian \lunar\ Wiesner · Jul 10, 2008

Marc 'BlackJack' Rintsch said:
The '#' is usually the prompt of root accounts while ordinary users get a
'$'.

Oh ... I just used "#" as a general placeholder, because I didn't want to
copy my whole two-line prompt

I can assure, that I did not try this as
root

and will use "$" in future

Kris Kennaway · Jul 10, 2008

J. Cliff Dyer said:
Sounds like a good strategy would be to find the smallest chunk of the
file that matches can't cross, and iterate your search on units of those
chunks. For example, if none of your regexes cross line boundaries,
search each line of the file individually. That may help turn around
the speed degradation you're seeing.

That's what I'm doing. I've also tried various other things like
mmapping the file and searching it at once, etc, but almost all of the
time is spent in the regexp engine so optimizing other things only gives
marginal improvement.

Kris

regular expressions eliminating filenames of type foo.thumbnail.jpg	7	Jun 25, 2007
Searching for Regular Expressions in a string WITH overlap	1	Nov 21, 2008
Request for Feedback; a module making it easier to use regular expressions.	1	Jan 31, 2005
regular expression concatenation with strings	6	Jun 22, 2007
ANN: 'rex', a module for easy creation and use of regular expressions	0	Jun 10, 2004
Personal archive tool, looking for suggestions on improving the code	5	Jul 27, 2010
brand new to python	2	Mar 13, 2005
Some help needed with small multi-threaded program!	1	May 17, 2010

re.search much slower then grep on some regular expressions

samwyse

Kris Kennaway

John Machin

Kris Kennaway

J. Cliff Dyer

Sebastian \lunar\ Wiesner

Kris Kennaway

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads