Performance of lists vs. list comprehensions

G

Gerald Britton

Yesterday I stumbled across some old code in a project I was working
on. It does something like this:

mystring = '\n'.join( [ line for line in lines if <some conditions
depending on line> ] )

where "lines" is a simple list of strings. I realized that the code
had been written before you could put a list comprehension as an
argument to join(). I figured that I could drop the '[' and ']' and
leave the rest as a list comprehension since that works with current
Python (and the project's base is 2.5). So I rewrote the original
statement like this:

mystring = '\n'.join( line for line in lines if <some conditions
depending on line> )

It works as expected. Then I got curious as to how it performs. I
was surprised to learn that the rewritten expression runs more than
twice as _slow_. e.g.:
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
Timer("' '.join([x for x in l])", 'l = map(str,range(10))').timeit()
2.9967339038848877
7.2045478820800781

Notice that I dropped the condition testing that was in my original
code. I just wanted to see the effect of two different expressions.

I thought that maybe there was some lower bound on the number of the
items in the list or list comprehension beyond which the comprehension
would prove more efficient. There doesn't appear to be one. I scaled
the length of the input list up to 1 million items and got more or
less the same relative performance.

Now I'm really curious and I'd like to know:

1. Can anyone else confirm this observation?

2. Why should the "pure" list comprehension be slower than the same
comprehension enclosed in '[...]' ?
 
A

Alf P. Steinbach

* Gerald Britton:
Yesterday I stumbled across some old code in a project I was working
on. It does something like this:

mystring = '\n'.join( [ line for line in lines if <some conditions
depending on line> ] )

where "lines" is a simple list of strings. I realized that the code
had been written before you could put a list comprehension as an
argument to join(). I figured that I could drop the '[' and ']' and
leave the rest as a list comprehension since that works with current
Python (and the project's base is 2.5). So I rewrote the original
statement like this:

mystring = '\n'.join( line for line in lines if <some conditions
depending on line> )

It works as expected. Then I got curious as to how it performs. I
was surprised to learn that the rewritten expression runs more than
twice as _slow_. e.g.:
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
Timer("' '.join([x for x in l])", 'l = map(str,range(10))').timeit()
2.9967339038848877
Timer("' '.join(x for x in l)", 'l = map(str,range(10))').timeit()
7.2045478820800781

Notice that I dropped the condition testing that was in my original
code. I just wanted to see the effect of two different expressions.

I thought that maybe there was some lower bound on the number of the
items in the list or list comprehension beyond which the comprehension
would prove more efficient. There doesn't appear to be one. I scaled
the length of the input list up to 1 million items and got more or
less the same relative performance.

Now I'm really curious and I'd like to know:

1. Can anyone else confirm this observation?
>>> Timer("' '.join([x for x in l])", 'l = map(str,range(10))').timeit() 5.8625191190500345
>>> Timer("' '.join(x for x in l)", 'l = map(str,range(10))').timeit() 12.093135300715574
>>> _

2. Why should the "pure" list comprehension be slower than the same
comprehension enclosed in '[...]' ?

Regarding (2) the unparenthesized expression in join is *not* a list
comprehension but a generator expression.

And as such it involves join calling next() on the generator object repeatedly,
with each next() call involving a light-weight context shift.

In addition the docs mumble something about "lazy" evaluation, and that may also
contribute to the overhead.

I think that in contrast, the interpreter can evaluate a list comprehension, [x
for x in blah], directly without any context shifting, just by transforming it
to equivalent code and putting the target expressions innermost there.

And so the main factor causing a slowdown for a list comprehension would, I
think, be paging and such if the list it produced was Really Huge, while for the
generator there's no memory issue but rather much calling & context shifting.


Cheers & hth.,

- Alf
 
G

Gerald Britton

Thanks! Good explanation.

* Gerald Britton:
Yesterday I stumbled across some old code in a project I was working
on.  It does something like this:

mystring = '\n'.join( [ line for line in lines if <some conditions
depending on line> ] )

where "lines" is a simple list of strings.  I realized that the code
had been written before you could put a list comprehension as an
argument to join().  I figured that I could drop the '[' and ']' and
leave the rest as a list comprehension since that works with current
Python (and the project's base is 2.5).  So I rewrote the original
statement like this:

mystring = '\n'.join( line for line in lines if <some conditions
depending on line> )

It works as expected.  Then I got curious as to how it performs.  I
was surprised to learn that the rewritten expression runs more than
twice as _slow_.  e.g.:

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
Timer("' '.join([x for x in l])", 'l = map(str,range(10))').timeit()
2.9967339038848877

Timer("' '.join(x for x in l)", 'l = map(str,range(10))').timeit()

7.2045478820800781

Notice that I dropped the condition testing that was in my original
code.  I just wanted to see the effect of two different expressions.

I thought that maybe there was some lower bound on the number of the
items in the list or list comprehension beyond which the comprehension
would prove more efficient.  There doesn't appear to be one.  I scaled
the length of the input list up to 1 million items and got more or
less the same relative performance.

Now I'm really curious and I'd like to know:

1. Can anyone else confirm this observation?
Timer("' '.join([x for x in l])", 'l = map(str,range(10))').timeit() 5.8625191190500345
Timer("' '.join(x for x in l)", 'l = map(str,range(10))').timeit() 12.093135300715574
_

2. Why should the "pure" list comprehension be slower than the same
comprehension enclosed in '[...]' ?

Regarding (2) the unparenthesized expression in join is *not* a list
comprehension but a generator expression.

And as such it involves join calling next() on the generator object
repeatedly, with each next() call involving a light-weight context shift.

In addition the docs mumble something about "lazy" evaluation, and that may
also contribute to the overhead.

I think that in contrast, the interpreter can evaluate a list comprehension,
[x for x in blah], directly without any context shifting, just by
transforming it to equivalent code and putting the target expressions
innermost there.

And so the main factor causing a slowdown for a list comprehension would, I
think, be paging and such if the list it produced was Really Huge, while for
the generator there's no memory issue but rather much calling & context
shifting.


Cheers & hth.,

- Alf
 
W

Wolfram Hinderer

Timer("' '.join([x for x in l])", 'l = map(str,range(10))').timeit()
2.9967339038848877
Timer("' '.join(x for x in l)", 'l = map(str,range(10))').timeit()
7.2045478820800781

[...]

2. Why should the "pure" list comprehension be slower than the same
comprehension enclosed in '[...]' ?

Others have already commented on "generator expression" vs. "list
comprehension". I'll try to shed some light on the cause of the
slowness.

For me it's
Timer("' '.join([x for x in l])", 'l = map(str,range(10))').timeit() 0.813948839866498
Timer("' '.join(x for x in l)", 'l = map(str,range(10))').timeit()
2.226825476422391

But wait! I'm on Python 3.1 and the setup statement has to be changed
to make this test meaningful.
Timer("' '.join([x for x in l])", 'l = list(map(str,range(10)))').timeit() 2.5788493369966545
Timer("' '.join(x for x in l)", 'l = list(map(str,range(10)))').timeit()
3.7431774848480472

Much smaller factor now.
But wait! If we want to test list comprehension against generator
comprehension, we should try a function that just consumes the
iterable.
.... def f(a):
.... for i in a: pass
.... """
Timer("f([x for x in l])", setup).timeit() 3.288511528699928
Timer("f(x for x in l)", setup).timeit()
2.410873798206012

Oops! Iteration over generator expressions is not inherently more
expension than iteration over list comprehensions. But certainly
building a list from a generator expression is more expensive than a
list comprehension?
Timer("[x for x in l]", 'l = list(map(str,range(10)))').timeit() 2.088602950933364
Timer("list(x for x in l)", 'l = list(map(str,range(10)))').timeit()
3.691566805277944

Yes, list building from a generator expression *is* expensive. And
join has to do it, because it has to iterate twice over the iterable
passed in: once for calculating the memory needed for the joined
string, and once more to actually do the join (this is implementation
dependent, of course). If the iterable is a list already, the list
building is not needed.
 
G

Gerald Britton

[snip]
Yes, list building from a generator expression *is* expensive. And
join has to do it, because it has to iterate twice over the iterable
passed in: once for calculating the memory needed for the joined
string, and once more to actually do the join (this is implementation
dependent, of course). If the iterable is a list already, the list
building is not needed.

if join has to iterate twice over the iterable, how does this work?

$ python3.1
Python 3.1.1+ (r311:74480, Nov 2 2009, 14:49:22)
[GCC 4.4.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
If join had to iterate twice over l, it would be consumed on the first
pass. If it works as you say then join would have to copy the
iterable on the first pass, effectively turning it into a list.
Though I didn't read through it, I would suppose that join could use a
dynamic-table approach to hold the result, starting with some
guesstimate then expanding the result buffer if and when needed.
 
A

Arnaud Delobelle

Gerald Britton said:
[snip]
Yes, list building from a generator expression *is* expensive. And
join has to do it, because it has to iterate twice over the iterable
passed in: once for calculating the memory needed for the joined
string, and once more to actually do the join (this is implementation
dependent, of course). If the iterable is a list already, the list
building is not needed.

if join has to iterate twice over the iterable, how does this work?

$ python3.1
Python 3.1.1+ (r311:74480, Nov 2 2009, 14:49:22)
[GCC 4.4.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
If join had to iterate twice over l, it would be consumed on the first
pass. If it works as you say then join would have to copy the
iterable on the first pass, effectively turning it into a list.
Though I didn't read through it, I would suppose that join could use a
dynamic-table approach to hold the result, starting with some
guesstimate then expanding the result buffer if and when needed.

Looking at the source (py3k):

PyObject *
PyUnicode_Join(PyObject *separator, PyObject *seq)
{
[skip declarations]

fseq = PySequence_Fast(seq, "");
if (fseq == NULL) {
return NULL;
}

[code that works out the length of the joined string then allocates
memory, then fills it]
}

Where PySequence_Fast(seq, "") returns seq if seq is already a tuple or
a list and otherwise returns a new tuple built from the elements of seq.

So no, it doesn't guess the size of the joined string and yes, it
iterates twice over the "sequence" (I would have thought it should be
called an iterable) by copying it into a tuple.
 
W

Wolfram Hinderer

[snip]


Yes, list building from a generator expression *is* expensive. And
join has to do it, because it has to iterate twice over the iterable
passed in: once for calculating the memory needed for the joined
string, and once more to actually do the join (this is implementation
dependent, of course). If the iterable is a list already, the list
building is not needed.

if join has to iterate twice over the iterable, how does this work?

$ python3.1
Python 3.1.1+ (r311:74480, Nov  2 2009, 14:49:22)
[GCC 4.4.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
'1.3.5.7.9'

If join had to iterate twice over l, it would be consumed on the first
pass.

Yes. (Coincidentally, l is consumed in the first execution of the Timer
()-statement, which is why I had to add the call to list. Not to
mention the xrange example of Stephen. But all this is not linked to
what join does internally.)
If it works as you say then join would have to copy the
iterable on the first pass, effectively turning it into a list.

Yes, that's what I'm saying above in the first two lines of what you
quoted.
I should have added somehting like "AFAIK CPython does it that way".
Though I didn't read through it, I would suppose that join could use a
dynamic-table approach to hold the result, starting with some
guesstimate then expanding the result buffer if and when needed.

Probably. But that's not what happens. Try
"".join("" for x in range(10**10)
and watch join eating memory.
 
G

Gerald Britton

That's surprising. I wouldn't implement it that way at all. I'd use a
dynamically-expanding buffer as I suggested. That way you get a
single pass and don't have to calculate anything before you begin. In
the best case, you'd use half the memory (the result just fits in the
buffer after its last expansion and no saved tuple). In the worst
case, the memory use is about the same (you just expanded the buffer
using a 2x expansion rule then you hit the last item).

Still I suppose the author thought of that approach and rejected it
for reasons I can't yet see.

Gerald Britton said:
[snip]
Yes, list building from a generator expression *is* expensive. And
join has to do it, because it has to iterate twice over the iterable
passed in: once for calculating the memory needed for the joined
string, and once more to actually do the join (this is implementation
dependent, of course). If the iterable is a list already, the list
building is not needed.

if join has to iterate twice over the iterable, how does this work?

$ python3.1
Python 3.1.1+ (r311:74480, Nov  2 2009, 14:49:22)
[GCC 4.4.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
l = map(str, (x for x in range(10) if int(x)%2))
'.'.join(l) '1.3.5.7.9'

If join had to iterate twice over l, it would be consumed on the first
pass.  If it works as you say then join would have to copy the
iterable on the first pass, effectively turning it into a list.
Though I didn't read through it, I would suppose that join could use a
dynamic-table approach to hold the result, starting with some
guesstimate then expanding the result buffer if and when needed.

Looking at the source (py3k):

PyObject *
PyUnicode_Join(PyObject *separator, PyObject *seq)
{
   [skip declarations]

   fseq = PySequence_Fast(seq, "");
   if (fseq == NULL) {
       return NULL;
   }

   [code that works out the length of the joined string then allocates
    memory, then fills it]
}

Where PySequence_Fast(seq, "") returns seq if seq is already a tuple or
a list and otherwise returns a new tuple built from the elements of seq.

So no, it doesn't guess the size of the joined string and yes, it
iterates twice over the "sequence" (I would have thought it should be
called an iterable) by copying it into a tuple.
 
R

Raymond Hettinger

[Wolfram Hinderer]
Yes, list building from a generator expression *is* expensive. And
join has to do it, because it has to iterate twice over the iterable
passed in: once for calculating the memory needed for the joined
string, and once more to actually do the join (this is implementation
dependent, of course). If the iterable is a list already, the list
building is not needed.

Good analysis. That is exactly what is going on under the hood.


Raymond
 
A

Arnaud Delobelle

Gerald Britton said:
That's surprising. I wouldn't implement it that way at all. I'd use a
dynamically-expanding buffer as I suggested. That way you get a
single pass and don't have to calculate anything before you begin. In
the best case, you'd use half the memory (the result just fits in the
buffer after its last expansion and no saved tuple). In the worst
case, the memory use is about the same (you just expanded the buffer
using a 2x expansion rule then you hit the last item).

Still I suppose the author thought of that approach and rejected it
for reasons I can't yet see.

I don't know the reasons, but I'm guessing they could be historic.
Before Python had iterators, str.join would mostly have been only given lists
and tuples as arguments, in which case the current approach seems to be
the most appropriate. Later, when things like generator functions and
generator expressions were introduced, perhaps str.join wasn't optimized
to accomodate them.
 
S

Steven D'Aprano

That's surprising. I wouldn't implement it that way at all. I'd use a
dynamically-expanding buffer as I suggested. That way you get a single
pass and don't have to calculate anything before you begin. In the best
case, you'd use half the memory (the result just fits in the buffer
after its last expansion and no saved tuple). In the worst case, the
memory use is about the same (you just expanded the buffer using a 2x
expansion rule then you hit the last item).

In the worst case, you waste 50% of the memory allocated. And because
strings are immutable (unlike lists and dicts, which also use this
approach), you can never use that memory until the string is garbage
collected.

In the current approach, join produces a temporary sequence, but it
doesn't last very long. With your suggested approach, you could end up
with a large number of long-lasting strings up to twice the size
necessary. Since join is particularly useful for building large strings,
this could be a significant memory pessimation.

The obvious fix is for join to shrink the buffer once it has finished
building the string, but the success of that may be operating system
dependent. I don't know -- it sounds like a recipe for memory
fragmentation to me. And even if it can be done, reclaiming the memory
will take time, potentially more time than the current approach.

Still, it's probably worth trying, and seeing if it speeds join up.
 
A

Alf P. Steinbach

* Steven D'Aprano:
In the worst case, you waste 50% of the memory allocated.

Yes. That is a good argument for not doing the expanding buffer thing. But such
buffers may be generally present anyway, resulting from optimization of "+".

Using CPython 2.6.4 in Windows XP:

.... return timeit.timeit( f, number = n_calls )
........ def makestr( n = n ):
.... s = ""
.... for i in xrange( n ):
.... s = s + "A"
.... return s
.... return makestr
........ print( elapsed_time_for( appender( 10000*(i+1) ), 100 ) )
....
0.782596670811
1.37728454314
2.10189898437
2.76442173517
3.34536707878
4.08251830889
4.79620119317
5.42201844089
6.12892811796
6.84236460221

Here the linear increase of times indicate that the "+" is being optimized using
an expanding buffer for the string. If only the minimal space was allocated each
time then one would expect O(n^2) behavior instead of the apparently O(n) above.
Example of that O(n^2) behavior given below.

.... def makestr( n = n ):
.... s = ""
.... for i in xrange( n ):
.... new_s = s + "A"
.... s = new_s
.... return s
.... return makestr
........ print( elapsed_time_for( exact_appender( 10000*(i+1) ), 100 ) )
....
3.28094241027
9.30584501661
19.5319170453
33.6563767183
52.3327800042
66.5475022663
84.8809736992
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "<stdin>", line 2, in elapsed_time_for
File "C:\Program Files\cpython\python26\lib\timeit.py", line 227, in timeit
return Timer(stmt, setup, timer).timeit(number)
File "C:\Program Files\cpython\python26\lib\timeit.py", line 193, in timeit
timing = self.inner(it, self.timer)
File "C:\Program Files\cpython\python26\lib\timeit.py", line 99, in inner
_func()


So, given that apparently the simple '+' in the first example is optimized using
an expanding buffer, which then hangs around, it's not clear to me that the
space optimization in 'join' really helps. It may be (but isn't necessarily)
like shoveling snow in a snowstorm. Then the effort/cost could be for naught.

And because
strings are immutable (unlike lists and dicts, which also use this
approach), you can never use that memory until the string is garbage
collected.

I think that the simple '+', with the apparent optimization shown in the first
example above, can use that space. I know for a fact that when you control a
string implementation then it can do that (since I've implemented that). But I
don't know for a fact that it's practical to do so in Python. In order to use
the space the implementation must know that there's only one reference to the
string. And I don't know whether that information is readily available in a
CPython implementation (say), although I suspect that it is.


Cheers,

- Alf
 
S

Steven D'Aprano

* Steven D'Aprano:

Yes. That is a good argument for not doing the expanding buffer thing.
But such buffers may be generally present anyway, resulting from
optimization of "+".


As near as I can determine, the CPython optimization you are referring to
doesn't use the "double the buffer when needed" technique. It operates on
a completely different strategy. As near as I can tell (as a non-C
speaker), it re-sizes the string in place to the size actually needed,
thus reducing the amount of copying needed.

The optimization patch is here:

http://bugs.python.org/issue980695

and some history (including Guido's opposition to the patch) here:

http://mail.python.org/pipermail/python-dev/2004-August/046686.html

Nevertheless, the patch is now part of CPython since 2.4, but must be
considered an implementation-specific optimization. It doesn't apply to
Jython, and likely other implementations.


Using CPython 2.6.4 in Windows XP: [snip time measurements]
Here the linear increase of times indicate that the "+" is being
optimized using an expanding buffer for the string.

Be careful of jumping to the conclusion from timings that a certain
algorithm is used. All you can really tell is that, whatever the
algorithm is, it has such-and-such big-oh behaviour.
If only the minimal
space was allocated each time then one would expect O(n^2) behavior
instead of the apparently O(n) above.

I don't follow that reasoning.

Example of that O(n^2) behavior given below.

The example shown demonstrates that the + optimization only applies in
certain specific cases. In fact, it only applies to appending:

s += t
s = s + t

but not prepending or concatenation with multiple strings:

s = t + s
s += t + u


However, keep in mind that the CPython keyhole optimizer will take
something like this:

s += "x" + "y"

and turn it into this:

s += "xy"

which the concatenation optimization does apply to. Optimizations make it
hard to reason about the behaviour of algorithms!
 
S

Stefan Behnel

Steven D'Aprano, 20.01.2010 07:12:
That is a good argument for not doing the expanding buffer thing.
But such buffers may be generally present anyway, resulting from
optimization of "+".

As near as I can determine, the CPython optimization you are referring to
doesn't use the "double the buffer when needed" technique. It operates on
a completely different strategy. As near as I can tell (as a non-C
speaker), it re-sizes the string in place to the size actually needed,
thus reducing the amount of copying needed.
Using CPython 2.6.4 in Windows XP: [snip time measurements]
Here the linear increase of times indicate that the "+" is being
optimized using an expanding buffer for the string.

Be careful of jumping to the conclusion from timings that a certain
algorithm is used. All you can really tell is that, whatever the
algorithm is, it has such-and-such big-oh behaviour.

Which is particularly tricky here since the algorithms depends more on the
OS than on the code in CPython. The better timings come from the fact that
the OS does *not* need to copy the buffer on each iteration, but does
smarter things when asked to enlarge the buffer. If you ran the benchmark
on an OS that *did* copy the buffer each time, the runtime would really be
quadratic.

BTW, I think it would actually be worth trying to apply the same approach
to str.join() if the argument is not a sequence (obviously followed by a
benchmark on different platforms).

Stefan
 
A

Alf P. Steinbach

* Steven D'Aprano:
As near as I can determine, the CPython optimization you are referring to
doesn't use the "double the buffer when needed" technique. It operates on
a completely different strategy. As near as I can tell (as a non-C
speaker), it re-sizes the string in place to the size actually needed,
thus reducing the amount of copying needed.

The optimization patch is here:

http://bugs.python.org/issue980695

and some history (including Guido's opposition to the patch) here:

http://mail.python.org/pipermail/python-dev/2004-August/046686.html

Nevertheless, the patch is now part of CPython since 2.4, but must be
considered an implementation-specific optimization. It doesn't apply to
Jython, and likely other implementations.

Provided that the CPython code is that code then you're right, it only calls
PyObject_REALLOC, depending on that operation having (amortized) constant time.

And PyObject_REALLOC can in principle achieve that by relying on paging, e.g.
using the OS' virtual memory allocator, instead of doubling and copying.

However, I'm baffled by the code I find via Google Code search; there
PyObject_REALLOC simply calls malloc and copies, which if that were the code
used in CPython 2.4 and greater, and if the '+' code is the code that you refer
to above, would produce O(n^2) time for the first '+' example.

Using CPython 2.6.4 in Windows XP: [snip time measurements]
Here the linear increase of times indicate that the "+" is being
optimized using an expanding buffer for the string.

Be careful of jumping to the conclusion from timings that a certain
algorithm is used. All you can really tell is that, whatever the
algorithm is, it has such-and-such big-oh behaviour.

Well, as for intended meaning you're right about that, it needs not be a
doubling buffer.

I don't follow that reasoning.

The reasoning assumes independent allocations rather than reallocation
(extension of existing allocation).

Then quadratic time for the example follows from sum(range(n)) = (n^2-n)/2 of
the sizes of the data copied.

But with PyObject_REALLOC as essentially constant time also 'join' could take
advantage of this. :)



Cheers,

- Alf
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top