Fatest standard way to sum bytes (and their squares)?

Erik Max Francis · Aug 12, 2007

For a file hashing system (finding similar files, rather than identical
ones), I need to be able to efficiently and quickly sum the ordinals of
the bytes of a file and their squares. Because of the nature of the
application, it's a requirement that I do it in Python, or only with
standard library modules (if such facilities exist) that might assist.

So far the fastest way I've found is using the `sum` builtin and
generators::

ordinalSum = sum(ord(x) for x in data)
ordinalSumSquared = sum(ord(x)**2 for x in data)

This is about twice as fast as an explicit loop, but since it's going to
be processing massive amounts of data, the faster the better. Are there
any tricks I'm not thinking of, or perhaps helper functions in other
modules that I'm not thinking of?

Erik Max Francis · Aug 12, 2007

And of course I meant fastest, not "fatest." Fattest wouldn't be good,
either.

Alexander Schmolck · Aug 12, 2007

Erik Max Francis said:
For a file hashing system (finding similar files, rather than identical ones),
I need to be able to efficiently and quickly sum the ordinals of the bytes of
a file and their squares. Because of the nature of the application, it's a
requirement that I do it in Python, or only with standard library modules (if
such facilities exist) that might assist.

So far the fastest way I've found is using the `sum` builtin and generators::

ordinalSum = sum(ord(x) for x in data)
ordinalSumSquared = sum(ord(x)**2 for x in data)

This is about twice as fast as an explicit loop, but since it's going to be
processing massive amounts of data, the faster the better. Are there any
tricks I'm not thinking of, or perhaps helper functions in other modules that
I'm not thinking of?

Is this any faster?

ordSum, orsSumSq = (lambda c:c.real,c.imag)(sum(complex(ord(x),ord(x)<<1)
for x in data))

'as

Peter Otten · Aug 12, 2007

Erik said:
For a file hashing system (finding similar files, rather than identical
ones), I need to be able to efficiently and quickly sum the ordinals of
the bytes of a file and their squares. Because of the nature of the
application, it's a requirement that I do it in Python, or only with
standard library modules (if such facilities exist) that might assist.

So far the fastest way I've found is using the `sum` builtin and
generators::

ordinalSum = sum(ord(x) for x in data)
ordinalSumSquared = sum(ord(x)**2 for x in data)

This is about twice as fast as an explicit loop, but since it's going to
be processing massive amounts of data, the faster the better. Are there
any tricks I'm not thinking of, or perhaps helper functions in other
modules that I'm not thinking of?

Two ideas:

Use a lookup-table for ord(c)**2
Use array.array()

$ cat summit.py
import array

data = "dere gewizzede bizzede bizz" * 1000 + chr(255)

lookup = dict((i, i**2) for i in range(256))

def summit_str(data=data):
return sum(ord(x) for x in data), sum(ord(x)**2 for x in data)

def summit_array(data=data, lookup=lookup):
a = array.array("B")
a.fromstring(data)
return sum(a), sum(lookup[x] for x in a)

if __name__ == "__main__":
assert summit_array() == summit_str()

$ python -m timeit -s'from summit import summit_str as summit' 'summit()'
10 loops, best of 3: 32.2 msec per loop
$ python -m timeit -s'from summit import summit_array as summit' 'summit()'
100 loops, best of 3: 13.4 msec per loop

Your actual code may be even faster because you can read the bytes directly
from the file.

Peter

Peter Otten · Aug 12, 2007

Peter said:
Two ideas:

Use a lookup-table for ord(c)**2
Use array.array()

$ cat summit.py
import array

data = "dere gewizzede bizzede bizz" * 1000 + chr(255)

lookup = dict((i, i**2) for i in range(256))

Oops, make that a list:

lookup = [i**2 for i in range(256)]

def summit_str(data=data):
return sum(ord(x) for x in data), sum(ord(x)**2 for x in data)

def summit_array(data=data, lookup=lookup):
a = array.array("B")
a.fromstring(data)
return sum(a), sum(lookup[x] for x in a)

if __name__ == "__main__":
assert summit_array() == summit_str()

$ python -m timeit -s'from summit import summit_str as summit' 'summit()'
10 loops, best of 3: 32.2 msec per loop
$ python -m timeit -s'from summit import summit_array as summit'
'summit()' 100 loops, best of 3: 13.4 msec per loop

$ python -m timeit -s'from summit import summit_array as summit' 'summit()'
100 loops, best of 3: 12.9 msec per loop

Not a big gain, but still...

Peter Otten · Aug 12, 2007

Peter said:
lookup = [i**2 for i in range(256)]

def summit_str(data=data):
return sum(ord(x) for x in data), sum(ord(x)**2 for x in data)

def summit_array(data=data, lookup=lookup):
a = array.array("B")
a.fromstring(data)
return sum(a), sum(lookup[x] for x in a)

Click to expand...

$ python -m timeit -s'from summit import summit_array as summit'
'summit()' 100 loops, best of 3: 12.9 msec per loop

from itertools import imap

def summit_array(data=data, lookup=lookup):
a = array.array("B")
a.fromstring(data)
return sum(a), sum(imap(lookup.__getitem__, a))

$ python -m timeit -s'from summit import summit_array as summit' 'summit()'
100 loops, best of 3: 9.15 msec per loop

I think I'll stop here...

Peter

Duncan Booth · Aug 12, 2007

Erik Max Francis said:
So far the fastest way I've found is using the `sum` builtin and
generators::

ordinalSum = sum(ord(x) for x in data)
ordinalSumSquared = sum(ord(x)**2 for x in data)

This is about twice as fast as an explicit loop, but since it's going to
be processing massive amounts of data, the faster the better. Are there
any tricks I'm not thinking of, or perhaps helper functions in other
modules that I'm not thinking of?

I haven't timed it, but at the very least I'd avoid going through all the
data twice:

count = {}
for i in range(256):
count[chr(i)] = 0
for x in data:
count[x] += 1

ordinalSum = sum(ord(c)*count[c] for c in count)
ordinalSumSquared = sum(ord(c)**2 * count[c] for c in count)

Hrvoje Niksic · Aug 12, 2007

Erik Max Francis said:
So far the fastest way I've found is using the `sum` builtin and
generators::

ordinalSum = sum(ord(x) for x in data)
ordinalSumSquared = sum(ord(x)**2 for x in data)

For ordinalSum, using imap is almost twice as fast:

$ python -m timeit -s 'data=[chr(x) for x in xrange(256)]' 'sum(ord(x) for x in data)'
10000 loops, best of 3: 92.4 usec per loop
$ python -m timeit -s 'data=[chr(x) for x in xrange(256)]; from itertools import imap' 'sum(imap(ord, data))'
10000 loops, best of 3: 55.4 usec per loop

Of course, that optimization doesn't work for the squared sum; using a
lambda only pessimizes it.

Erik Max Francis · Aug 13, 2007

Alexander said:
Is this any faster?

ordSum, orsSumSq = (lambda c:c.real,c.imag)(sum(complex(ord(x),ord(x)<<1)
for x in data))

That's pretty clever, but I neglected to mention that I need to
accumulate the sums as ints/longs to avoid losing precision, so
converting to floating point isn't an optional. (The sums are
normalized by the sizes of the files and expanded to 32 bits in order to
maximize precision.)

Erik Max Francis · Aug 13, 2007

Hrvoje said:
For ordinalSum, using imap is almost twice as fast:

$ python -m timeit -s 'data=[chr(x) for x in xrange(256)]' 'sum(ord(x) for x in data)'
10000 loops, best of 3: 92.4 usec per loop
$ python -m timeit -s 'data=[chr(x) for x in xrange(256)]; from itertools import imap' 'sum(imap(ord, data))'
10000 loops, best of 3: 55.4 usec per loop

You're using data which is a list of chars (strings), rather than a
string itself, which is what the format is in. The imap optimization
doesn't appear to work quite as dramatically well for me with strings
instead of lists, but it certainly is an improvement.

Of course, that optimization doesn't work for the squared sum; using a
lambda only pessimizes it.

Right.

Erik Max Francis · Aug 13, 2007

Duncan said:
I haven't timed it, but at the very least I'd avoid going through all the
data twice:

count = {}
for i in range(256):
count[chr(i)] = 0
for x in data:
count[x] += 1

ordinalSum = sum(ord(c)*count[c] for c in count)
ordinalSumSquared = sum(ord(c)**2 * count[c] for c in count)

This approach is definitely faster than using a generator in my tests,
and was second fastest overall. I'm actually a bit surprised that it's
as fast as it is; it never would have occurred to me to try it this way.

Erik Max Francis · Aug 13, 2007

Peter said:
from itertools import imap

def summit_array(data=data, lookup=lookup):
a = array.array("B")
a.fromstring(data)
return sum(a), sum(imap(lookup.__getitem__, a))

$ python -m timeit -s'from summit import summit_array as summit' 'summit()'
100 loops, best of 3: 9.15 msec per loop

I think I'll stop here...

Yes, this was the fastest approach of all the alternatives I tried.
Plugging the algorithm in the tool, it makes it about twice as fast as
the generator-sum method I was using before, and about _five_ time as
fast as a naive for loop. The use of an array is particularly clever; I
never would have thought of that.

Thanks to everyone for their ideas. Amusingly I posted it late last
night (well, this morning I guess), not expecting much, at least until
the next day, but got a factor of two speedup within half an hour!

Hrvoje Niksic · Aug 13, 2007

Erik Max Francis said:
Hrvoje said:

For ordinalSum, using imap is almost twice as fast:
$ python -m timeit -s 'data=[chr(x) for x in xrange(256)]'
'sum(ord(x) for x in data)'
10000 loops, best of 3: 92.4 usec per loop
$ python -m timeit -s 'data=[chr(x) for x in xrange(256)]; from itertools import imap' 'sum(imap(ord, data))'
10000 loops, best of 3: 55.4 usec per loop

Click to expand...

You're using data which is a list of chars (strings), rather than a
string itself, which is what the format is in. The imap
optimization doesn't appear to work quite as dramatically well for
me with strings instead of lists, but it certainly is an
improvement.

I wouldn't expect to see any difference in strings and lists. In this
simple test I get approximately the same ~1.7x speedup:

$ python -m timeit 'sum(ord(x) for x in "abcdefghijklmnopqrstuvwxyz")'
100000 loops, best of 3: 12.7 usec per loop
$ python -m timeit -s 'from itertools import imap' 'sum(imap(ord, "abcdefghijklmnopqrstuvwxyz"))'
100000 loops, best of 3: 7.42 usec per loop

Alexander Schmolck · Aug 13, 2007

Erik Max Francis said:
That's pretty clever, but I neglected to mention that I need to accumulate the
sums as ints/longs to avoid losing precision, so converting to floating point
isn't an optional.

I think you can safely use this trick (provided it really makes things faster)
provided you sum blocks no larger than 2**(53-8) bytes; if your files are
really that big you'd certainly want to split summing into several blocks
anyway, because otherwise you'll be doing *lots* of extra bignum arithmetic
instead of int32/int64 addition (I'd assume this will slow things noticably
down even in python). Another trick you could try, again using table-lookup:
work on words (i.e. 2bytes) instead of single bytes again using a table (from
word->(byte-sum,sq-byte-sum) tuples) ; this will half the function calls and
the table size of is hopefully still small enough to not to ruin your
cache-hit rate (you might want to try array).

'as

Alexander Schmolck · Aug 13, 2007

Alexander Schmolck said:
I think you can safely use this trick (provided it really makes things faster)
provided you sum blocks no larger than 2**(53-8) bytes;

Forgot about the squares; make this 2**(53-16); still pretty big.

'as

Dan Stromberg · Aug 19, 2007

For a file hashing system (finding similar files, rather than identical
ones), I need to be able to efficiently and quickly sum the ordinals of
the bytes of a file and their squares. Because of the nature of the
application, it's a requirement that I do it in Python, or only with
standard library modules (if such facilities exist) that might assist.

So far the fastest way I've found is using the `sum` builtin and
generators::

ordinalSum = sum(ord(x) for x in data)
ordinalSumSquared = sum(ord(x)**2 for x in data)

This is about twice as fast as an explicit loop, but since it's going to
be processing massive amounts of data, the faster the better. Are there
any tricks I'm not thinking of, or perhaps helper functions in other
modules that I'm not thinking of?

I see a lot of messages attacking the CPU optimization, but what about the
I/O optimization - which admittedly, the question seems to sidestep.

You might experiment with using mmap() instead of read()... If it helps,
it may help big, because the I/O time is likely to dominate the CPU time.

Faster way to XOR?	0	Jan 7, 2011
pre-PEP: Standard Microthreading Pattern	4	May 1, 2007
Python3: to add, remove and change	3	Apr 2, 2009
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012
Correct way to handle independent interpreters when embedding in asingle-threaded C++ app	1	Nov 18, 2004
New keyword 'orif' and its implications	113	Aug 28, 2007
pass arguments between Sun FORTRAN and C programs	6	Oct 14, 2009
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Jan 12, 2008

Fatest standard way to sum bytes (and their squares)?

Erik Max Francis

Erik Max Francis

Alexander Schmolck

Peter Otten

Peter Otten

Peter Otten

Duncan Booth

Hrvoje Niksic

Erik Max Francis

Erik Max Francis

Erik Max Francis

Erik Max Francis

Hrvoje Niksic

Alexander Schmolck

Alexander Schmolck

Dan Stromberg

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads