S
samwyse
samwyse wrote:
Hmm, unfortunately it's still orders of magnitude slower than grep in my
own application that involves matching lots of strings and regexps
against large files (I killed it after 400 seconds, compared to 1.5 for
grep), and that's leaving aside the much longer compilation time (over a
minute). If the matching was fast then I could possibly pickle the
lexer though (but it's not).
That's funny, the compilation is almost instantaneous for me.
However, I just tested it to several files, the first containing
4875*'a', the rest each twice the size of the previous. And you're
right, for each doubling of the file size, the match take four times
as long, meaning O(n^2). 156000*'a' would probably take 8 hours.
Here are my results:
compile_lexicon() took 0.0236021580595 secs
test('file-0.txt') took 24.8322969831 secs
test('file-1.txt') took 99.3956799681 secs
test('file-2.txt') took 398.349623132 secs
And here's my (probably over-engineered) testbed:
from __future__ import with_statement
from os.path import exists
from timeit import Timer
from Plex import *
filename = "file-%d.txt"
def create_files(n):
for x in range(0,n):
fname = filename % x
if not exists(fname):
print 'creating', fname
with open(fname, 'w') as f:
print >>f, (4875*2**x)*'a',
def compile_lexicon():
global lexicon
lexicon = Lexicon([
(Rep(AnyBut(' "='))+Str('/'), TEXT),
(AnyBut('\n'), IGNORE),
])
def test(fname):
with open(fname, 'r') as f:
scanner = Scanner(lexicon, f, fname)
while 1:
token = scanner.read()
#print token
if token[0] is None:
break
def my_timed_test(func_name, *args):
stmt = func_name + '(' + ','.join(map(repr, args)) + ')'
t = Timer(stmt, "from __main__ import "+func_name)
print stmt, 'took', t.timeit(1), 'secs'
if __name__ == '__main__':
create_files(6)
my_timed_test('compile_lexicon')
for x in range(0,4):
my_timed_test('test', filename%x)