Interesting list() un-optimization

Discussion in 'Python' started by Roy Smith, Mar 7, 2013.

  1. Roy Smith

    Roy Smith Guest

    I stumbled upon an interesting bit of trivia concerning lists and list
    comprehensions today.

    We use mongoengine as a database model layer. A mongoengine query
    returns an iterable object called a QuerySet. The "obvious" way to
    create a list of the query results would be:

    my_objects = list(my_query_set)

    and, indeed, that works. But, then I found this code:

    my_objects = [obj for obj in my_query_set]

    which seemed a bit silly. I called over the guy who wrote it and asked
    him why he didn't just write it using list(). I was astounded when it
    turned out there's a good reason!

    Apparently, list() has an "optimization" where it calls len() on its
    argument to try and discover the number of items it's going to put into
    the list. Presumably, list() uses this information to pre-allocate the
    right amount of memory the first time, without any resizing. If len()
    fails, it falls back to just iterating and resizing as needed.
    Normally, this would be a win.

    The problem is, QuerySets have a __len__() method. Calling it is a lot
    faster than iterating over the whole query set and counting the items,
    but it does result in an additional database query, which is a lot
    slower than the list resizing! Writing the code as a list comprehension
    prevents list() from trying to optimize when it shouldn't!
     
    Roy Smith, Mar 7, 2013
    #1
    1. Advertising

  2. Roy Smith

    Dave Angel Guest

    On 03/06/2013 10:20 PM, Roy Smith wrote:
    > I stumbled upon an interesting bit of trivia concerning lists and list
    > comprehensions today.
    >
    > We use mongoengine as a database model layer. A mongoengine query
    > returns an iterable object called a QuerySet. The "obvious" way to
    > create a list of the query results would be:
    >
    > my_objects = list(my_query_set)
    >
    > and, indeed, that works. But, then I found this code:
    >
    > my_objects = [obj for obj in my_query_set]
    >
    > which seemed a bit silly. I called over the guy who wrote it and asked
    > him why he didn't just write it using list(). I was astounded when it
    > turned out there's a good reason!
    >
    > Apparently, list() has an "optimization" where it calls len() on its
    > argument to try and discover the number of items it's going to put into
    > the list. Presumably, list() uses this information to pre-allocate the
    > right amount of memory the first time, without any resizing. If len()
    > fails, it falls back to just iterating and resizing as needed.
    > Normally, this would be a win.
    >
    > The problem is, QuerySets have a __len__() method. Calling it is a lot
    > faster than iterating over the whole query set and counting the items,
    > but it does result in an additional database query, which is a lot
    > slower than the list resizing! Writing the code as a list comprehension
    > prevents list() from trying to optimize when it shouldn't!
    >


    That is very interesting. list() assumes the __len__() method would be
    very quick.

    Perhaps list() should take an optional second argument that specifies
    the initial length to allocate. That way code that either doesn't want
    __len__() to be used, or that already knows a reasonable number to use,
    can supply the value to preallocate.

    --
    DaveA
     
    Dave Angel, Mar 7, 2013
    #2
    1. Advertising

  3. Roy Smith

    Tim Chase Guest

    On 2013-03-06 22:20, Roy Smith wrote:
    > I stumbled upon an interesting bit of trivia concerning lists and
    > list comprehensions today.


    I agree with Dave Angel that this is interesting. A little testing
    shows that this can be rewritten as

    my_objects = list(iter(my_query_set))

    which seems to then skip the costly __len__ call. Performance geeks
    are welcome to time it against the list-comprehension version :)

    -tkc


    class Foo(object):
    def __init__(self):
    self.items = range(10)
    def __iter__(self):
    return iter(self.items)
    def __len__(self):
    print "Calling costly __len__"
    return len(self.items)

    print "Ensuring we can iterate over it:"
    for x in Foo():
    print x

    print "\nJust call list():"
    lst = list(Foo())
    print lst

    print "\nCall list(iter())"
    lst = list(iter(Foo()))
    print lst
     
    Tim Chase, Mar 7, 2013
    #3
  4. Roy Smith

    Kev Dwyer Guest

    Roy Smith wrote:

    > I stumbled upon an interesting bit of trivia concerning lists and list
    > comprehensions today.
    >
    > We use mongoengine as a database model layer. A mongoengine query
    > returns an iterable object called a QuerySet. The "obvious" way to
    > create a list of the query results would be:
    >
    > my_objects = list(my_query_set)
    >
    > and, indeed, that works. But, then I found this code:
    >
    > my_objects = [obj for obj in my_query_set]
    >
    > which seemed a bit silly. I called over the guy who wrote it and asked
    > him why he didn't just write it using list(). I was astounded when it
    > turned out there's a good reason!
    >
    > Apparently, list() has an "optimization" where it calls len() on its
    > argument to try and discover the number of items it's going to put into
    > the list. Presumably, list() uses this information to pre-allocate the
    > right amount of memory the first time, without any resizing. If len()
    > fails, it falls back to just iterating and resizing as needed.
    > Normally, this would be a win.
    >
    > The problem is, QuerySets have a __len__() method. Calling it is a lot
    > faster than iterating over the whole query set and counting the items,
    > but it does result in an additional database query, which is a lot
    > slower than the list resizing! Writing the code as a list comprehension
    > prevents list() from trying to optimize when it shouldn't!



    Interesting discovery. Yet isn't this as much an issue with the mongoengine
    library as with list()? Queryset.count() can be called if the "length" of a
    resultset needs to be retrieved, so the __len__() methd seems redundant.
    And given that it's not unheard of to call list() on iterables, perhaps the
    library designers should either optimise the __len__() method, or document
    the performance implications of calling list on the queryset?

    Anyway, thanks for this thought-provoking post.

    Cheers,

    Kev
     
    Kev Dwyer, Mar 7, 2013
    #4
  5. Tim Chase <python.list <at> tim.thechases.com> writes:

    > On 2013-03-06 22:20, Roy Smith wrote:
    > > I stumbled upon an interesting bit of trivia concerning lists and
    > > list comprehensions today.

    >
    > A little testing
    > shows that this can be rewritten as
    >
    > my_objects = list(iter(my_query_set))
    >
    > which seems to then skip the costly __len__ call. Performance geeks
    > are welcome to time it against the list-comprehension version
    >
    > class Foo(object):
    > def __init__(self):
    > self.items = range(10)
    > def __iter__(self):
    > return iter(self.items)
    > def __len__(self):
    > print "Calling costly __len__"
    > return len(self.items)
    >


    Well, it skips the costly len() call because your iter(Foo()) returns
    iter(range()) under the hood and list() uses that object's __len__() method. In
    most cases, such a workaround will not be feasible. Why should iter(QuerySet())
    have a faster __len__() method defined than QuerySet() itself. Most likely,
    iter(QuerySet()) just returns self anyway?
    Best,
    Wolfgang
     
    Wolfgang Maier, Mar 7, 2013
    #5
  6. Roy Smith

    Ian Kelly Guest

    On Thu, Mar 7, 2013 at 4:22 AM, Wolfgang Maier
    <-freiburg.de> wrote:
    > Well, it skips the costly len() call because your iter(Foo()) returns
    > iter(range()) under the hood and list() uses that object's __len__() method.


    Iterators do not generally have __len__ methods.

    >>> len(iter(range(10)))

    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    TypeError: object of type 'range_iterator' has no len()

    > In
    > most cases, such a workaround will not be feasible. Why should iter(QuerySet())
    > have a faster __len__() method defined than QuerySet() itself.


    iter(QuerySet()) should not have any __len__ method defined at all,
    which is why the optimization would be skipped.

    > Most likely,
    > iter(QuerySet()) just returns self anyway?


    But on this point, you are correct. The mongoengine QuerySet.__iter__
    method is defined as:

    def __iter__(self):
    self.rewind()
    return self

    This is unfortunate design. Not only does it mean that the iterator's
    __len__ method cannot be trusted (what should the __len__ of a
    partially exhausted iterator return?), but it also means that requesting
    an iterator over the QuerySet will also silently invalidate any
    existing iterators.
     
    Ian Kelly, Mar 7, 2013
    #6
  7. Roy Smith

    Ian Kelly Guest

    On Thu, Mar 7, 2013 at 9:20 AM, Christian Heimes <> wrote:
    > Am 07.03.2013 17:00, schrieb Ian Kelly:
    >> On Thu, Mar 7, 2013 at 4:22 AM, Wolfgang Maier
    >> <-freiburg.de> wrote:
    >>> Well, it skips the costly len() call because your iter(Foo()) returns
    >>> iter(range()) under the hood and list() uses that object's __len__() method.

    >>
    >> Iterators do not generally have __len__ methods.
    >>
    >>>>> len(iter(range(10)))

    >> Traceback (most recent call last):
    >> File "<stdin>", line 1, in <module>
    >> TypeError: object of type 'range_iterator' has no len()

    >
    > But iterators have a length hint method that are used for some
    > optimizations and preallocations, too.
    >
    >>>> i = iter(range(10))
    >>>> i.__length_hint__()

    > 10
    >
    > See http://www.python.org/dev/peps/pep-0424/


    Didn't know about that, thanks. Presumably a proper iter(QuerySet())
    object could implement __length_hint__ in an efficient manner rather
    than by just calling the __len__ of the underlying QuerySet,
     
    Ian Kelly, Mar 7, 2013
    #7
  8. Roy Smith

    Ian Kelly Guest

    On Thu, Mar 7, 2013 at 12:19 PM, Stefan Behnel <> wrote:
    >> Didn't know about that, thanks. Presumably a proper iter(QuerySet())
    >> object could implement __length_hint__ in an efficient manner rather
    >> than by just calling the __len__ of the underlying QuerySet,

    >
    > And how exactly would it do that, without either doing what __len__ does or
    > reading the whole result set into memory?


    If the underlying cursor provides its own efficient length hint, it
    could return that. Or if the query is result-limited, use the limit
    as a length hint, provided it's not absurdly large. And if you really
    can't efficiently determine anything about the length of the result
    set at all, you can always fall back on returning NotImplemented.
     
    Ian Kelly, Mar 7, 2013
    #8
  9. Roy Smith

    Terry Reedy Guest

    On 3/7/2013 11:00 AM, Ian Kelly wrote:

    > But on this point, you are correct. The mongoengine QuerySet.__iter__
    > method is defined as:
    >
    > def __iter__(self):
    > self.rewind()
    > return self
    >
    > This is unfortunate design. Not only does it mean that the iterator's
    > __len__ method cannot be trusted (what should the __len__ of a
    > partially exhausted iterator return?), but it also means that requesting
    > an iterator over the QuerySet will also silently invalidate any
    > existing iterators.


    I view that design as a violation of the iterator protocol and hence a
    program bug. __iter__ should either *just* return self (if the self is
    an iterator) or return a new object (if self is a non-iterator
    iterable). File objects are iterators and .__iter__ does not rewind.

    >>> f = open("f:/python/mypy/tem.py")
    >>> next(f)

    'class myit(list):\n'
    >>> f2 = iter(f)
    >>> f2 is f

    True
    >>> next(f2)

    " def __bytes__(self): return b'hello'\n"

    --
    Terry Jan Reedy
     
    Terry Reedy, Mar 7, 2013
    #9
  10. Roy Smith

    Terry Reedy Guest

    On 3/7/2013 11:20 AM, Christian Heimes wrote:

    > But iterators have a length hint method that are used for some
    > optimizations and preallocations, too.


    This is easy when the base iterable has a length method, as do range
    objects.

    >>>> i = iter(range(10))
    >>>> i.__length_hint__()

    > 10


    And the length_hint can (should be) decremented with each next call.

    >>> next(i); next(i)

    0
    1
    >>> i.__length_hint__()

    8

    --
    Terry Jan Reedy
     
    Terry Reedy, Mar 7, 2013
    #10

  11. > >>> Iterators do not generally have __len__ methods.
    > >>>
    > >>> >>> len(iter(range(10)))
    > >>> Traceback (most recent call last):
    > >>> File "<stdin>", line 1, in <module>
    > >>> TypeError: object of type 'range_iterator' has no len()
    > >>
    > >> But iterators have a length hint method that are used for some
    > >> optimizations and preallocations, too.
    > >>
    > >> >>> i = iter(range(10))
    > >> >>> i.__length_hint__()
    > >> 10
    > >>
    > >> See http://www.python.org/dev/peps/pep-0424/

    > >


    very interesting (hadn't heard of it)! Just checked the PEP,
    then tested list()'s behavior, and it is just as described:

    class stupid(list):
    def __len__(self):
    print ('len() called')
    return NotImplemented

    def __length_hint__(self):
    print ('hint requested')
    l=iter(self).__length_hint__()
    print (l)
    return l

    a=stupid((1,2,3))
    len(d)
    ======>
    len() called

    Traceback (most recent call last):
    File "<pyshell#79>", line 1, in <module>
    len(d)
    TypeError: an integer is required

    list(d)
    ======>
    len() called
    hint requested
    3
    [1, 2, 3]

    so list() first tries to call the iterable's __len__ method. If that raises a
    TypeError it falls back to __length_hint__ .
    What I still don't know is how the listiterator object's __length_hint__ works.
    Why, in this case, does it know that it has a length of 3 ? The PEP does not
    provide any hint how a reasonable hint could be calculated.

    > And how exactly would it do that, without either doing what __len__ does or
    > reading the whole result set into memory?
    >
    > Stefan
    >


    a very good question.

    Best,
    Wolfgang
     
    Wolfgang Maier, Mar 7, 2013
    #11
  12. Roy Smith

    Terry Reedy Guest

    On 3/7/2013 3:41 PM, Wolfgang Maier wrote:
    >
    >>>>> Iterators do not generally have __len__ methods.
    >>>>>
    >>>>>>>> len(iter(range(10)))
    >>>>> Traceback (most recent call last):
    >>>>> File "<stdin>", line 1, in <module>
    >>>>> TypeError: object of type 'range_iterator' has no len()
    >>>>
    >>>> But iterators have a length hint method that are used for some
    >>>> optimizations and preallocations, too.
    >>>>
    >>>>>>> i = iter(range(10))
    >>>>>>> i.__length_hint__()
    >>>> 10
    >>>>
    >>>> See http://www.python.org/dev/peps/pep-0424/
    >>>

    >
    > very interesting (hadn't heard of it)! Just checked the PEP,
    > then tested list()'s behavior, and it is just as described:
    >
    > class stupid(list):
    > def __len__(self):
    > print ('len() called')
    > return NotImplemented
    >
    > def __length_hint__(self):
    > print ('hint requested')
    > l=iter(self).__length_hint__()
    > print (l)
    > return l
    >
    > a=stupid((1,2,3))
    > len(d)
    > ======>
    > len() called
    >
    > Traceback (most recent call last):
    > File "<pyshell#79>", line 1, in <module>
    > len(d)
    > TypeError: an integer is required
    >
    > list(d)
    > ======>
    > len() called
    > hint requested
    > 3
    > [1, 2, 3]
    >
    > so list() first tries to call the iterable's __len__ method. If that raises a
    > TypeError it falls back to __length_hint__ .
    > What I still don't know is how the listiterator object's __length_hint__ works.
    > Why, in this case, does it know that it has a length of 3 ? The PEP does not
    > provide any hint how a reasonable hint could be calculated.


    'How' depends on the iterator, but when it has an underlying concrete
    iterable of known length, it should be rather trivial as .__next__ will
    explicitly or implicitly use the count remaining in its operation. Part
    of the justification of adding __length_hint__ is that in many cases it
    is so easy. Iterators based on an iterator with a length_hint can just
    pass it along.

    The list iterator might work with a C pointer to the next item and a
    countdown count initialized to the list length. The __next__ method
    might be something like this mixed C and Python pseudocode:

    if (countdown--) return *(current--);
    else raise StopIteration

    (For non-C coders, postfix -- decrements value *after* retrieving it.)
    Then __length_hint__ would just return countdown. Tuples would work the
    same way. Sets and dicts would need current-- elaborated to skip over
    empty hash table slots. Range iterators would add the step to current,
    instead of 1.

    --
    Terry Jan Reedy
     
    Terry Reedy, Mar 7, 2013
    #12
  13. On Wed, 06 Mar 2013 22:20:11 -0500, Roy Smith wrote:

    > I stumbled upon an interesting bit of trivia concerning lists and list
    > comprehensions today.
    >
    > We use mongoengine as a database model layer. A mongoengine query
    > returns an iterable object called a QuerySet. The "obvious" way to
    > create a list of the query results would be:
    >
    > my_objects = list(my_query_set)
    >
    > and, indeed, that works. But, then I found this code:
    >
    > my_objects = [obj for obj in my_query_set]
    >
    > which seemed a bit silly. I called over the guy who wrote it and asked
    > him why he didn't just write it using list(). I was astounded when it
    > turned out there's a good reason!


    And why was that not documented in the code?

    I hope you took this fellow out and gave him a damned good thrashing!



    --
    Steven
     
    Steven D'Aprano, Mar 8, 2013
    #13
  14. Roy Smith

    Roy Smith Guest

    In article <513a26fa$0$30001$c3e8da3$>,
    Steven D'Aprano <> wrote:

    > On Wed, 06 Mar 2013 22:20:11 -0500, Roy Smith wrote:
    >
    > > I stumbled upon an interesting bit of trivia concerning lists and list
    > > comprehensions today.
    > >
    > > We use mongoengine as a database model layer. A mongoengine query
    > > returns an iterable object called a QuerySet. The "obvious" way to
    > > create a list of the query results would be:
    > >
    > > my_objects = list(my_query_set)
    > >
    > > and, indeed, that works. But, then I found this code:
    > >
    > > my_objects = [obj for obj in my_query_set]
    > >
    > > which seemed a bit silly. I called over the guy who wrote it and asked
    > > him why he didn't just write it using list(). I was astounded when it
    > > turned out there's a good reason!

    >
    > And why was that not documented in the code?
    >
    > I hope you took this fellow out and gave him a damned good thrashing!


    Nah. I'm a pacifist. Besides, he's my boss :)
     
    Roy Smith, Mar 8, 2013
    #14
  15. Roy Smith

    Roy Smith Guest

    In article <>,
    Roy Smith <> wrote:

    > The problem is, QuerySets have a __len__() method. Calling it is a lot
    > faster than iterating over the whole query set and counting the items,
    > but it does result in an additional database query, which is a lot
    > slower than the list resizing! Writing the code as a list comprehension
    > prevents list() from trying to optimize when it shouldn't!


    Hmmm, I think I've found a good solution.

    It turns out, we don't actually use QuerySet in our models. We've
    defined our own QuerySet subclass which adds a few convenience methods.
    Adding

    def __len__(self):
    raise NotImplemented

    to our subclass should do the job. It looks like list() respects that,
    calls __iter__(), and does the right thing. I can't find any place
    where that behavior for list() is documented, but it's logical and
    experimentally, it seems to work.

    Can anybody see any downside to this?
     
    Roy Smith, Mar 10, 2013
    #15
  16. Roy Smith

    Terry Reedy Guest

    On 3/10/2013 9:05 AM, Roy Smith wrote:
    > In article <>,
    > Roy Smith <> wrote:
    >
    >> The problem is, QuerySets have a __len__() method. Calling it is a lot
    >> faster than iterating over the whole query set and counting the items,
    >> but it does result in an additional database query, which is a lot
    >> slower than the list resizing! Writing the code as a list comprehension
    >> prevents list() from trying to optimize when it shouldn't!

    >
    > Hmmm, I think I've found a good solution.
    >
    > It turns out, we don't actually use QuerySet in our models. We've
    > defined our own QuerySet subclass which adds a few convenience methods.
    > Adding
    >
    > def __len__(self):
    > raise NotImplemented
    >
    > to our subclass should do the job. It looks like list() respects that,
    > calls __iter__(), and does the right thing. I can't find any place
    > where that behavior for list() is documented,


    It is a cpython implementation detail that probably has changed with the
    versions.

    > but it's logical and experimentally, it seems to work.


    > Can anybody see any downside to this?


    No. Give it a try.


    --
    Terry Jan Reedy
     
    Terry Reedy, Mar 10, 2013
    #16
  17. Roy Smith

    Roy Smith Guest

    In article <>,
    Terry Reedy <> wrote:

    > > It turns out, we don't actually use QuerySet in our models. We've
    > > defined our own QuerySet subclass which adds a few convenience methods.
    > > Adding
    > >
    > > def __len__(self):
    > > raise NotImplemented
    > >
    > > to our subclass should do the job. It looks like list() respects that,
    > > calls __iter__(), and does the right thing. I can't find any place
    > > where that behavior for list() is documented,

    >
    > It is a cpython implementation detail that probably has changed with the
    > versions.


    Yeah, that's what I was afraid of. The "obvious" solution of:

    class QuerySet(mongoengine.queryset.QuerySet):
    def __init__(self, document, collection):
    super(QuerySet, self).__init__(document, collection)
    [...]
    del self.__len__

    results in:

    [rest of stack dump elided]
    del self.__len__
    AttributeError: __len__

    which I don't understand.
     
    Roy Smith, Mar 10, 2013
    #17
  18. On Sun, 10 Mar 2013 18:34:58 -0400, Roy Smith wrote:

    > Yeah, that's what I was afraid of. The "obvious" solution of:
    >
    > class QuerySet(mongoengine.queryset.QuerySet):
    > def __init__(self, document, collection):
    > super(QuerySet, self).__init__(document, collection) [...]
    > del self.__len__
    >
    > results in:
    >
    > [rest of stack dump elided]
    > del self.__len__
    > AttributeError: __len__
    >
    > which I don't understand.


    You don't define a per-instance custom QuerySet attribute called __len__,
    so you can't delete it from the instance.

    The existing __len__ attribute is attached to the mongoengine
    queryset.QuerySet class, not the instance. You could monkeypatch the
    parent class, but that will probably break something else:

    del mongoengine.queryset.QuerySet.__len__


    or you could try over-riding __len__ with a fake that pretends it doesn't
    exist:


    def __len__(self):
    raise AttributeError


    Try that and see it it works. (It may not.)


    --
    Steven
     
    Steven D'Aprano, Mar 10, 2013
    #18
  19. Roy Smith

    Roy Smith Guest

    In article <513d18d6$0$6512$c3e8da3$>,
    Steven D'Aprano <> wrote:

    > On Sun, 10 Mar 2013 18:34:58 -0400, Roy Smith wrote:
    >
    > > Yeah, that's what I was afraid of. The "obvious" solution of:
    > >
    > > class QuerySet(mongoengine.queryset.QuerySet):
    > > def __init__(self, document, collection):
    > > super(QuerySet, self).__init__(document, collection) [...]
    > > del self.__len__
    > >
    > > results in:
    > >
    > > [rest of stack dump elided]
    > > del self.__len__
    > > AttributeError: __len__
    > >
    > > which I don't understand.

    >
    > You don't define a per-instance custom QuerySet attribute called __len__,
    > so you can't delete it from the instance.
    >
    > The existing __len__ attribute is attached to the mongoengine
    > queryset.QuerySet class, not the instance. You could monkeypatch the
    > parent class, but that will probably break something else:
    >
    > del mongoengine.queryset.QuerySet.__len__
    >
    >
    > or you could try over-riding __len__ with a fake that pretends it doesn't
    > exist:
    >
    >
    > def __len__(self):
    > raise AttributeError
    >
    >
    > Try that and see it it works. (It may not.)


    Yeah, that was one of the things I experimented with. It seems to work,
    but like most of these other things, I suspect it's version specific.
    If list() does:

    try:
    l = obj.__len__()
    except AttributeErrror:
    l = 0

    then we're good. On the other hand, if it does:

    if obj.getattr('__len__', None):
    # etc

    then it fails.

    At this point, I'm thinking the monkeypatch is the right solution.
     
    Roy Smith, Mar 10, 2013
    #19
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. John
    Replies:
    2
    Views:
    4,695
    Mythran
    Oct 23, 2004
  2. DENG
    Replies:
    6
    Views:
    314
  3. Replies:
    40
    Views:
    931
    Gabriel Genellina
    May 18, 2007
  4. El Kabong

    Interesting list

    El Kabong, Jun 13, 2007, in forum: HTML
    Replies:
    15
    Views:
    604
    Blinky the Shark
    Jun 15, 2007
  5. Ravikiran

    Zero Optimization and Sign Optimization???

    Ravikiran, Nov 17, 2008, in forum: C Programming
    Replies:
    22
    Views:
    900
    Thad Smith
    Nov 24, 2008
Loading...

Share This Page