R
Roy Smith
I've got a 170 MB file I want to search for lines that look like:
[2010-10-20 16:47:50.339229 -04:00] INFO (6): songza.amie.history - ENQUEUEING: /listen/the-station-one
This code runs in 1.3 seconds:
------------------------------
import re
pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0
for line in open('error.log'):
m = pattern.search(line)
if m:
count += 1
print count
------------------------------
If I add a pre-filter before the regex, it runs in 0.78 seconds (about
twice the speed!)
------------------------------
import re
pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0
for line in open('error.log'):
if 'ENQ' not in line:
continue
m = pattern.search(line)
if m:
count += 1
print count
------------------------------
Every line which contains 'ENQ' also matches the full regex (61425
lines match, out of 2.1 million total). I don't understand why the
first way is so much slower.
Once the regex is compiled, you should have a state machine pattern
matcher. It should be O(n) in the length of the input to figure out
that it doesn't match as far as "ENQ". And that's exactly how long it
should take for "if 'ENQ' not in line" to run as well. Why is doing
twice the work also twice the speed?
I'm running Python 2.7.3 on Ubuntu Precise, x86_64.
[2010-10-20 16:47:50.339229 -04:00] INFO (6): songza.amie.history - ENQUEUEING: /listen/the-station-one
This code runs in 1.3 seconds:
------------------------------
import re
pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0
for line in open('error.log'):
m = pattern.search(line)
if m:
count += 1
print count
------------------------------
If I add a pre-filter before the regex, it runs in 0.78 seconds (about
twice the speed!)
------------------------------
import re
pattern = re.compile(r'ENQUEUEING: /listen/(.*)')
count = 0
for line in open('error.log'):
if 'ENQ' not in line:
continue
m = pattern.search(line)
if m:
count += 1
print count
------------------------------
Every line which contains 'ENQ' also matches the full regex (61425
lines match, out of 2.1 million total). I don't understand why the
first way is so much slower.
Once the regex is compiled, you should have a state machine pattern
matcher. It should be O(n) in the length of the input to figure out
that it doesn't match as far as "ENQ". And that's exactly how long it
should take for "if 'ENQ' not in line" to run as well. Why is doing
twice the work also twice the speed?
I'm running Python 2.7.3 on Ubuntu Precise, x86_64.