From a purely academic standpoint, I'm not convinced that sum() is
inefficient in terms of big-O complexity, though.
showell@showell-laptop:~$ python
Python 2.6.2 (release26-maint, Apr 19 2009, 01:56:41) [GCC 4.3.3] on
linux2
[...]
But it's not *sum* that is inefficient, it is sum *with a particular data
structure*.
Sure, if you create your own data structure, you can make it as efficient
as you like. Obviously the sum algorithm itself has to perform one
addition per item, or O(N), which scales tolerably well. But each
addition has a cost. If the cost is constant, then sum() as a whole
remains O(N). But if the cost of addition varies with N, sum() degrades
badly.
We can compare the performance of sum with different data structures,
starting with plain integers versus long integers:
from timeit import Timer
setup = 'data = [%d]*%d'
for i in range(6):
.... t1 = Timer('sum(data, 0)', setup % (1, 10**i))
.... t2 = Timer('sum(data, 0)', setup % (10**50, 10**i))
.... print min(t1.repeat(number=1000)),
.... print min(t2.repeat(number=1000))
....
0.00179290771484 0.00263810157776
0.00340414047241 0.00854396820068
0.0190401077271 0.0502791404724
0.155302047729 0.645124912262
0.794432878494 2.55748295784
7.97877693176 25.3812758923
Both scale about as well, but the cost varies significantly: arithmetic
on very large longints is expensive.
Strings, with a trick to fool sum into accepting them, versus lists. Note
that I changed the number of iterations from 6 down to 5. The reason why
will become obvious:
.... def __add__(self, other):
.... return other
....
empty = EmptyStringStarter()
setup = """from __main__ import empty; data = [%r]*%d"""
for i in range(5):
.... t1 = Timer('sum(data, empty)', setup % ('a', 10**i))
.... t2 = Timer('sum(data, [])', setup % ([1], 10**i))
.... print min(t1.repeat(number=1000)),
.... print min(t2.repeat(number=1000))
....
0.00849103927612 0.00226998329163
0.0121459960938 0.0082700252533
0.0489149093628 0.186735153198
0.428920030594 5.28623914719
14.6552250385 589.102822065
Strings perform tolerably well, up to a point, but lists perform
terribly. And in fact, the relatively good performance of strings is an
artifact of recent versions of CPython. In Jython and IronPython, and
older versions of CPython, it will behave as poorly as lists.
I wonder how severely sum(), without
the restriction, would underperform join() on modern versions of Python,
though.
Take note that, in order to get an answer in reasonable time, I've
reduced the number of timing iterations drastically:
.... t1 = Timer('sum(data, empty)', setup % ('a', 10**i))
.... t2 = Timer('"".join(data)', setup % ('a', 10**i))
.... print min(t1.repeat(number=10)),
.... print min(t2.repeat(number=10))
....
8.89301300049e-05 1.09672546387e-05
0.000131845474243 2.19345092773e-05
0.000591993331909 9.29832458496e-05
0.0101289749146 0.00082802772522
0.365957021713 0.00884819030762
24.2072279453 0.0421011447906
Note the performance degradation of sum. It gets worse. Much worse:
.... t1 = Timer('sum(data, empty)', setup % ('a', 10**i))
.... t2 = Timer('"".join(data)', setup % ('a', 10**i))
.... print min(t1.repeat(number=1)), # note fewer iterations
.... print min(t2.repeat(number=1))
....
0.031229019165 0.000817060470581
2.45445990562 0.00365781784058
1024.79705095 0.0398509502411
This is absolutely catastrophic performance degradation.