generator expressions: performance anomaly?

J

John Machin

Please consider the timings below, where a generator expression starts
out slower than the equivalent list comprehension, and gets worse:
python -m timeit -s "orig=range(100000)" "lst=orig[:];lst[:]=(x for x
in orig)"
10 loops, best of 3: 6.84e+004 usec per loop
python -m timeit -s "orig=range(200000)" "lst=orig[:];lst[:]=(x for x
in orig)"
10 loops, best of 3: 5.22e+005 usec per loop
python -m timeit -s "orig=range(300000)" "lst=orig[:];lst[:]=(x for x
in orig)"
10 loops, best of 3: 1.32e+006 usec per loop
python -m timeit -s "orig=range(100000)" "lst=orig[:];lst[:]=[x for x
in orig]"
10 loops, best of 3: 6.15e+004 usec per loop
python -m timeit -s "orig=range(200000)" "lst=orig[:];lst[:]=[x for x
in orig]"
10 loops, best of 3: 1.43e+005 usec per loop
python -m timeit -s "orig=range(300000)" "lst=orig[:];lst[:]=[x for x
in orig]"
10 loops, best of 3: 2.33e+005 usec per loop

Specs: Python 2.4, Windows 2000, 1.4GHz Athlon chip, 768Mb of memory.

Background: There was/is a very recent thread about ways of removing
all instances of x from a list. /F proposed a list comprehension to
build the result list. Given a requirement to mutate the original list,
this necessitates the assignment to lst[:]. I tried a generator
expression as well. However while the listcomp stayed competitive up to
a million-element list, the genexp went into outer space, taking about
20 times as long. The above timeit runs show a simpler scenario where
the genexp also seems to be going quadratic.
Comments, clues, ... please.

TIA,
John
 
F

Fredrik Lundh

John said:
Given a requirement to mutate the original list, this necessitates the assignment
to lst[:].

do you always pull requirements out of thin air?

</F>
 
N

Nick Coghlan

John said:
Background: There was/is a very recent thread about ways of removing
all instances of x from a list. /F proposed a list comprehension to
build the result list. Given a requirement to mutate the original list,
this necessitates the assignment to lst[:]. I tried a generator
expression as well. However while the listcomp stayed competitive up to
a million-element list, the genexp went into outer space, taking about
20 times as long. The above timeit runs show a simpler scenario where
the genexp also seems to be going quadratic.
Comments, clues, ... please.

Py> lc = [x for x in range(100)]
Py> len(lc)
100
Py> ge = (x for x in range(100))
Py> len(ge)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: len() of unsized object

It would be nice if unconditional ge's with known length inputs propagated
__len__, but that is not currently the case. There's a similar performance
glitch associated with constructing a tuple from a generator expression (with
vanilla 2.4, detouring via list is actually faster)

Cheers,
Nick.
 
J

John Machin

Fredrik said:
John said:
Given a requirement to mutate the original list, this necessitates the assignment
to lst[:].

do you always pull requirements out of thin air?

</F>

I was interested in doing a fair comparison between all the proposed
methods of deleting multiple occurrences from a list, and as the
majority were permuting the list in place, I shoe-horned the listcomp
and genexp methods into the same mould. If you choose to describe that
as pulling requirements out of thin air, feel free.

Do you have any suggestions as to why a generator expression would
perform much worse than the equivalent list comprehension?

[OT]
Anyway, I don't have time for chit-chat right now. My wife got a
digital camera for Xmas and dumped a whole bunch of 5 megapixel JPEGs
on me and I'm in the middle of writing a quick script to resize them,
using some imaging library I just downloaded off the web, written by
some grumpy PILlock from Sweden, evidently :)

Cheers,
John
 
R

Raymond Hettinger

John Machin said:
Please consider the timings below, where a generator expression starts
out slower than the equivalent list comprehension, and gets worse:
python -m timeit -s "orig=range(100000)" "lst=orig[:];lst[:]=(x for x
in orig)" . . .
python -m timeit -s "orig=range(200000)" "lst=orig[:];lst[:]=(x for x
in orig)"

This has nothing to do with genexps and everything to do with list slice
assignment.

List slice assignment is an example of a tool with a special case optimization
for inputs that know their own length -- that enables the tool to pre-allocate
its result rather than growing and resizing in spurts. Other such tools include
tuple(), map() and zip().

The incredulous tone of the posting suggests a presumption that genexps always
outperform list comps. That is not the case. Sometimes you want a list instead
of a generator because you want more than just iteration; perhaps the consumer
function may be able to take advantage of length reporting, indexing, slicing,
re-iterability or other useful list behaviors.

Also, the timing jig bites. Use xrange() instead of range(). Otherwise,
there's an immediate forfeit of the memory savings normally being sought by the
use of generators and iterators.


Background: There was/is a very recent thread about ways of removing
all instances of x from a list. /F proposed a list comprehension to
build the result list.

FWIW, deques have a natural idiom for this kind of situation:

for i in xrange(len(mydeque)):
x = mydeque.popleft()
if some_condition(x):
mydeque.append(x)


Given a requirement to mutate the original list,
this necessitates the assignment to lst[:].

Sometimes people introduce arbitrary requirements and twist themselves in knots
when real applications tend to allow a wider range of alternative solutions.
There is a funky code smell emanating from any code resembling
a[:]=some_iterator. It would make me examine how 'a' is being used and check
whether the surrounding code can use the iterator or an new object.


Comments, clues, ... please.

As a point of interest, note that Py2.4 improved some of its built-in iterators
to report their length if requested. Now you now why.
4



Raymond Hettinger
 
P

Paul Rubin

Raymond Hettinger said:
As a point of interest, note that Py2.4 improved some of its
built-in iterators to report their length if requested. Now you know why.

4

Is len(d.iteritems()) ever different from just len(d)?
 
R

Raymond Hettinger

[Raymond Hettinger]
[Paul Rubin]
Is len(d.iteritems()) ever different from just len(d)?

Yes, after the iteration starts:
3

In general, len(it) should give you what you would get with len(list(it)) but
without consuming the iterator.


Raymond
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Paul said:
Is len(d.iteritems()) ever different from just len(d)?

You mean, for an arbitrary object d? Certainly:
.... def iteritems(self):return {}.iteritems()
.... def __len__(self): return 1
....0

Why do you ask?

Regards,
Martin
 
B

bearophileHUGS

Nick Coghlan:
There's a similar performance glitch associated with constructing a
tuple from a generator expression (with vanilla 2.4, detouring via list
is actually faster)

You look right:

..from time import clock
..print "[x for x in l], list(x for x in l), aux = list(x for x in l);
tuple(aux), tuple(x for x in l):"
..for np in range(13, 21):
.. n = 1*2**np
.. print "n:", n, " s:",
.. l = xrange(n)
.. t1 = clock()
.. [x for x in l]
.. t2 = clock()
.. print round(t2-t1,3),
..
.. t1 = clock()
.. list(x for x in l)
.. t2 = clock()
.. print round(t2-t1,3),
..
.. t1 = clock()
.. aux = list(x for x in l)
.. tuple(aux)
.. t2 = clock()
.. print round(t2-t1,3),
..
.. #l = tuple(l) useless
.. t1 = clock()
.. tuple(x for x in l)
.. t2 = clock()
.. print round(t2-t1,3)


Output:
[x for x in l], list(x for x in l), aux = list(x for x in l);
tuple(aux), tuple(x for x in l):
n: 8192 s: 0.01 0.007 0.01 0.013
n: 16384 s: 0.024 0.019 0.021 0.032
n: 32768 s: 0.054 0.042 0.049 0.113
n: 65536 s: 0.111 0.078 0.101 0.088
n: 131072 s: 0.196 0.155 0.183 0.177
n: 262144 s: 0.436 0.385 0.429 1.832
n: 524288 s: 0.921 0.744 0.877 8.271
n: 1048576 s: 1.86 1.546 1.866 33.154

The timings for tuple(x for x in l) seem to grow as len(n)**2.
Bear hugs,
Bearophile
 
J

John Machin

John Machin said:
Please consider the timings below, where a generator expression starts
out slower than the equivalent list comprehension, and gets worse:
python -m timeit -s "orig=range(100000)" "lst=orig[:];lst[:]=(x for x
in orig)" . . .
python -m timeit -s "orig=range(200000)" "lst=orig[:];lst[:]=(x for x
in orig)"

This has nothing to do with genexps and everything to do with list slice
assignment.

List slice assignment is an example of a tool with a special case optimization
for inputs that know their own length -- that enables the tool to pre-allocate
its result rather than growing and resizing in spurts. Other such tools include
tuple(), map() and zip().

My reading of the source: if the input is not a list or tuple, a
(temporary) tuple is built from the input, using PySequence_Tuple() in
abstract.c. If the input cannot report its own length, then that
function resorts to "growing and resizing in spurts", using the
following code:

if (j >= n) {
if (n < 500)
n += 10;
else
n += 100;
if (_PyTuple_Resize(&result, n) != 0) {

Perhaps it could be changed to use a proportional increase, like
list_resize() in listobject.c, which advertises (amortised) linear
time. Alternative: build a temporary list instead?
 
R

Raymond Hettinger

[Raymond Hettinger]
[John Machin]
My reading of the source: if the input is not a list or tuple, a
(temporary) tuple is built from the input, using PySequence_Tuple() in
abstract.c. If the input cannot report its own length, then that
function resorts to "growing and resizing in spurts", using the
following code:

if (j >= n) {
if (n < 500)
n += 10;
else
n += 100;
if (_PyTuple_Resize(&result, n) != 0) {

Perhaps it could be changed to use a proportional increase, like
list_resize() in listobject.c, which advertises (amortised) linear
time.

Check out the current source. The time machine beat you to it.

Keep the faith,


Raymond Hettinger
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,570
Members
45,045
Latest member
DRCM

Latest Threads

Top