Fredrik said:
Ron Adam wrote:
that's probably because your benchmark has a lot of dubious overhead:
I think it does what the OP described, but that may not be what he
really needs.
Although the test to find best of n, instead was finding worse of n.
Which explains why I was getting a larger variance than I thought I
should have been getting.
word_finder = re.compile('[\w@]+', re.I)
no need to force case-insensitive search here; \w looks for both lower-
and uppercase characters.
But the dictionary keys need to be either upper or lower otherwise you
count 'The' separately from 'the'.
since you're using a case-insensitive RE, that lower() call is not necessary.
and findall() is of course faster than finditer() + m.group().
Cool, I don't use re that often so I just used what was posted to test
against.
and if you want performance, why are you creating a new dictionary for
each line in the sample?
Because that's what the OP apparently wanted. A line by line word
count. I originally did it to get an the over all count and then change
it so it matched the re version that was posted.
here's a more optimized RE word finder:
word_finder_2 = re.compile('[\w@]+').findall
def count_words_2(string, word_finder=word_finder_2):
# avoid global lookups
countDict = {}
for word in word_finder(string):
countDict[word] = countDict.get(word,0) + 1
return countDict
with your original test on a slow machine, I get
count_words: 0.29868684 (best of 3)
count_words_2: 0.17244873 (best of 3)
if I call the function once, on the entire sample string, I get
count_words: 0.23096036 (best of 3)
count_words_2: 0.11690620 (best of 3)
</F>
Wow, a lot bigger difference than on my machine. <curious> An athlon
64 3000+ on winxp. I'm not sure how much difference that would make?
This is what I get after adding the above version to it, with the
lower(). There's not quite as big a difference as you get, but the find
all version is still faster than both the others.
Cheers,
Ron
Character count: 100000
Word count: 16477
Average word size: 6.06906597075
word_counter: 0.06245989 (best of 3)
count_words: 0.07309812 (best of 3)
count_words_2: 0.04981024 (best of 3)
And as count all words...
Character count: 100000
Word count: 16477
Average word size: 6.06906597075
word_counter: 0.05325006 (best of 3)
count_words: 0.05910528 (best of 3)
count_words_2: 0.03748158 (best of 3)
They all improve, but the re find all version is clearly better.
#####################
import string
import re
import time
import random
# Create a really ugly n length string to test with.
# The word length are
n = 100000
random.seed(1)
lines = ''.join([ random.choice(string.ascii_letters * 2
+ '_@$&*()#/<>' + ' \n' * 6) for x in range(n) ])
print 'Character count:', n
print 'Word count:', len(lines.split())
print 'Average word size:', float(n)/len(lines.split())
letters = string.lowercase + string.digits + '_@'
def word_iter(text, letters=letters):
wd = ''
for c in text + ' ':
if c in letters:
wd += c
elif wd != '':
yield wd
wd = ''
def word_counter(text):
countDict={}
for wd in word_iter(text.lower()):
if wd in countDict:
countDict[wd] += 1
else:
countDict[wd] = 1
return countDict
word_finder = re.compile('[\w@]+', re.I).finditer
def count_words(string, word_finder=word_finder):
# avoid global lookups
countDict = {}
for match in word_finder(string.lower()):
word = match.group(0)
countDict[word] = countDict.get(word,0) + 1
return countDict
word_finder_2 = re.compile('[\w@]+').findall
def count_words_2(string, word_finder=word_finder_2):
# avoid global lookups
countDict = {}
for word in word_finder(string.lower()):
countDict[word] = countDict.get(word,0) + 1
return countDict
foos = [word_counter, count_words, count_words_2]
r1 = r2 = None
for foo in foos:
best_time = 1000000 # too large to be useful on purpose
for n in range(3):
t = time.clock()
#for line in lines.splitlines():
countDict = foo(lines)
tt = time.clock()-t
best_time = min(tt, best_time)
r1 = r2
r2 = countDict
if r1 != None:
# change to 1 if assert fails to find problem
if 0:
for k in r1.keys():
if r1[k] != r2[k]:
print k,r1[k],r2[k]
assert r1 == r2
print '%s: %.8f (best of %d)' \
% (foo.__name__, best_time, n+1)