Re: Dangerous behavior of list(generator)

Discussion in 'Python' started by Gabriel Genellina, Dec 13, 2009.

  1. En Sat, 12 Dec 2009 23:43:20 -0300, Ned Deily <> escribió:
    > In article <>,
    > Ned Deily <> wrote:
    >> In article
    >> <>,
    >> Benjamin Kaplan <> wrote:
    >> > On Sat, Dec 12, 2009 at 7:15 PM, Tom Machinski

    >> <>
    >> > wrote:



    >> > > >>> def sit(): raise StopIteration()
    >> > > ...
    >> > > >>> [f() for f in (lambda:1, sit, lambda:2)]
    >> > > Traceback (most recent call last):
    >> > > File "<stdin>", line 1, in <module>
    >> > > File "<stdin>", line 1, in sit
    >> > > StopIteration
    >> > > >>> list(f() for f in (lambda:1, sit, lambda:2))
    >> > > [1]


    >> > > In most cases, `list(generator)` works as expected. Thus,
    >> > > `list(<generator expression>)` is generally equivalent to

    >> `[<generator
    >> > > expression>]`.


    >> > Actually, it's list(generator) vs. a list comprehension. I agree that
    >> > it can be confusing, but Python considers them to be two different
    >> > constructs.


    I think nobody has addressed the OP arguments (as I understand them).

    First, except the obvious outer delimiters (and some corner cases in 2.x,
    fixed in Python 3), a list comprehension and a generator expression share
    the same syntax: (x for x in some_values) vs [x for x in some_values].

    Also, *almost* always, both list(<comprehension>) and [<comprehension>],
    when evaluated, yield the same result [1]. *Almost* because StopIteration
    is handled differently as the OP discovered: the list comprehension
    propagates a StopIteration exception to its caller; the list constructor
    swallows the exception and the caller never sees it.

    Despite a promise in PEP 289, generator expressions semantics isn't
    explained in detail in the language reference. I can't tell if the
    difference is intentional, accidental, undocumented behavior, an
    implementation accident, a bug, or what...

    [1] <comprehension> being a syntactic construct like:
    x**2 for x in range(5)
    or:
    f() for f in [lambda:1, sit, lambda:2]

    --
    Gabriel Genellina
    Gabriel Genellina, Dec 13, 2009
    #1
    1. Advertising

  2. In article <>,
    Gabriel Genellina <> wrote:
    <SNIP>
    >
    >Despite a promise in PEP 289, generator expressions semantics isn't
    >explained in detail in the language reference. I can't tell if the
    >difference is intentional, accidental, undocumented behavior, an
    >implementation accident, a bug, or what...


    Philosophically speaking ...
    An important feature that is not documented is a severe defect.
    (important maps to severe).
    Before it is documented, there can be no discrepancy between
    specification and implementation so other defects are formally
    not present in relation to this situation.

    >--
    >Gabriel Genellina
    >


    Groetjes Albert.


    --
    --
    Albert van der Horst, UTRECHT,THE NETHERLANDS
    Economic growth -- being exponential -- ultimately falters.
    albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst
    Albert van der Horst, Dec 18, 2009
    #2
    1. Advertising

  3. Albert van der Horst wrote:

    > An important feature that is not documented is a severe defect.


    This isn't something that I would expect to find documented
    under the heading of generator expressions, because it doesn't
    have anything to do with them. It's an interaction between
    the iterator protocol and the list() constructor. Any other
    iterable that leaked a StopIteration exception would cause
    the same effect.

    --
    Greg
    Gregory Ewing, Dec 19, 2009
    #3
  4. Thanks for the comment and discussion guys.

    Bottom line, I'm going to have to remove this pattern from my code:

    foo = (foo for foo in foos if foo.bar).next()

    I used to have that a lot in cases where not finding at least one
    valid foo is an actual fatal error. But using StopIteration to signal
    a fatal condition becomes a bug when interacting with list() as shown
    in the original post.

    It would be nice if there was a builtin for "get the first element in
    a genexp, or raise an exception (which isn't StopIteration)", sort of
    like:

    from itertools import islice

    def first_or_raise(genexp):
    L = list(islice(genexp, 1))
    if not L:
    raise RuntimeError('no elements found')
    return L[0]

    I also think Jean-Paul's had a good point about how the problems in
    the list/genexp interaction could be addressed.

    Thank you,

    -- Tom
    Tom Machinski, Dec 30, 2009
    #4
  5. On Wed, 30 Dec 2009 15:18:11 -0800, Tom Machinski wrote:

    > Thanks for the comment and discussion guys.
    >
    > Bottom line, I'm going to have to remove this pattern from my code:
    >
    > foo = (foo for foo in foos if foo.bar).next()


    I don't see why. What's wrong with it? Unless you embed it in a call to
    list, or similar, it will explicitly raise StopIteration as expected.


    > I used to have that a lot in cases where not finding at least one valid
    > foo is an actual fatal error.


    What's wrong with the obvious solution?

    if not any(foo for foo in foos if foo.bar):
    raise ValueError('need at least one valid foo')



    > But using StopIteration to signal a fatal
    > condition becomes a bug when interacting with list() as shown in the
    > original post.


    You shouldn't use StopIteration to signal fatal conditions, because
    that's not what it is for. It's acceptable to catch it when *directly*
    calling next, but otherwise you should expect that StopIteration will be
    caught and suppressed by just about anything.


    > It would be nice if there was a builtin for "get the first element in a
    > genexp, or raise an exception (which isn't StopIteration)",


    Not everything needs to be a built-in.

    def get_first_or_fail(iterable_or_sequence):
    it = iter(iterable_or_sequence)
    try:
    return it.next() # use next(it) in Python 3
    except StopIteration:
    raise ValueError('empty iterable')

    This is perfectly usable as a helper function, or it's short enough to be
    used in-line if you prefer.


    --
    Steven
    Steven D'Aprano, Dec 31, 2009
    #5
  6. On Wed, Dec 30, 2009 at 7:01 PM, Steven D'Aprano
    <> wrote:
    >
    > I don't see why. What's wrong with it? Unless you embed it in a call to
    > list, or similar, it will explicitly raise StopIteration as expected.
    >
    >
    >> I used to have that a lot in cases where not finding at least one valid
    >> foo is an actual fatal error.

    >
    > What's wrong with the obvious solution?
    >
    > if not any(foo for foo in foos if foo.bar):
    >    raise ValueError('need at least one valid foo')


    That would require 2 iterations through foos- once in the test, once
    for the assignment if successful. If foos takes a long time to iterate
    through, it might be faster to put a try-except around the original
    statement, catch the StopIteration, and raise a ValueError in its
    place. Which I agree is much better practice than letting the
    StopIteration signal the fatal error.
    Benjamin Kaplan, Dec 31, 2009
    #6
  7. Gabriel Genellina

    Peter Otten Guest

    Tom Machinski wrote:

    > It would be nice if there was a builtin for "get the first element in
    > a genexp, or raise an exception (which isn't StopIteration)", sort of
    > like:
    >
    > from itertools import islice
    >
    > def first_or_raise(genexp):
    > L = list(islice(genexp, 1))
    > if not L:
    > raise RuntimeError('no elements found')
    > return L[0]


    Somewhat related in 2.6 there's the next() built-in which accepts a default
    value. You can provide a sentinel and test for that instead of using
    try...except:

    >>> from random import randrange
    >>> from functools import partial
    >>> def g():

    .... return iter(partial(randrange, 3), 2)
    ....
    >>> next(g(), "empty")

    1
    >>> next(g(), "empty")

    1
    >>> next(g(), "empty")

    'empty'
    >>> next(g(), "empty")

    'empty'
    >>> next(g(), "empty")

    'empty'
    >>> next(g(), "empty")

    0

    Peter
    Peter Otten, Dec 31, 2009
    #7
  8. On Wed, 30 Dec 2009 23:20:06 -0500, Benjamin Kaplan wrote:

    >>> I used to have that a lot in cases where not finding at least one
    >>> valid foo is an actual fatal error.

    >>
    >> What's wrong with the obvious solution?
    >>
    >> if not any(foo for foo in foos if foo.bar):
    >>    raise ValueError('need at least one valid foo')

    >
    > That would require 2 iterations through foos- once in the test, once for
    > the assignment if successful.


    Remember though that any is a lazy test: it returns as soon as it gets a
    result. In the case of an empty list, it returns immediately with False,
    and in the case of a non-empty list, it returns immediately it reaches a
    true item. It doesn't matter if there are twenty thousand items, it will
    only look at the first so long as it is true.

    Which of course answers my own question... what's wrong with using any is
    that it fails if the objects are all considered false in a boolean
    context, or if they might be. That means it will work for some objects
    (e.g. the re module's MatchObject instances which are always true), but
    not for arbitrary objects which may be false.



    --
    Steven
    Steven D'Aprano, Dec 31, 2009
    #8
  9. On Wed, Dec 30, 2009 at 4:01 PM, Steven D'Aprano
    <> wrote:
    > On Wed, 30 Dec 2009 15:18:11 -0800, Tom Machinski wrote:
    >> Bottom line, I'm going to have to remove this pattern from my code:
    >>
    >>   foo = (foo for foo in foos if foo.bar).next()

    >
    > I don't see why. What's wrong with it? Unless you embed it in a call to
    > list, or similar, it will explicitly raise StopIteration as expected.


    Exactly; this seems innocuous, but if some caller of this code uses it
    in a list() constructor, a very subtle and dangerous bug is introduced
    - see OP. This is the entire point of this post.

    In a large, non-trivial application, you simply cannot afford the
    assumption that no caller will ever do that. Even if you have perfect
    memory, some of your other developers or library users may not.

    As for what's wrong with the "if not any" solution, Benjamin Kaplan's
    post hits the nail on its head. This is a bioinformatics application,
    so the iterable "foos" tends to be very large, so saving half the
    runtime makes a big difference.

    -- Tom
    Tom Machinski, Dec 31, 2009
    #9
  10. On Thu, Dec 31, 2009 at 1:54 AM, Peter Otten <> wrote:
    > Somewhat related in 2.6 there's the next() built-in which accepts a default
    > value. You can provide a sentinel and test for that instead of using
    > try...except:


    Thanks. This can be useful in some of the simpler cases. As you surely
    realize, to be perfectly safe, especially when the iterable can
    contain any value (including your sentinel), we must use an
    out-of-band return value, hence an exception is the only truly safe
    solution.

    -- Tom
    Tom Machinski, Dec 31, 2009
    #10
  11. On Thu, Dec 31, 2009 at 12:18 PM, Stephen Hansen <> wrote:
    > Hmm? Just use a sentinel which /can't/ exist in the list: then its truly
    > safe. If the list can contain all the usual sort of sentinels (False, None,
    > 0, -1, whatever), then just make a unique one all your own.
    > sentinel = object()
    > if next(g(), sentinel) is sentinel:
    >     ...
    > Its impossible to get a false-positive then, as nothing g() can ever produce
    > would ever be precisely "sentinel" (which would usually for me be some
    > global const if I need to do such things in multiple places).
    > --S


    That's not a bad idea. Another nice feature is support for callable
    "default" values; it would make several new things easier, including
    raising an exception when you really want that (i.e. if not finding a
    single element is truly exceptional).

    -- Tom
    Tom Machinski, Jan 1, 2010
    #11
  12. On Thu, 31 Dec 2009 11:34:39 -0800, Tom Machinski wrote:

    > On Wed, Dec 30, 2009 at 4:01 PM, Steven D'Aprano
    > <> wrote:
    >> On Wed, 30 Dec 2009 15:18:11 -0800, Tom Machinski wrote:
    >>> Bottom line, I'm going to have to remove this pattern from my code:
    >>>
    >>>   foo = (foo for foo in foos if foo.bar).next()

    >>
    >> I don't see why. What's wrong with it? Unless you embed it in a call to
    >> list, or similar, it will explicitly raise StopIteration as expected.

    >
    > Exactly; this seems innocuous, but if some caller of this code uses it
    > in a list() constructor, a very subtle and dangerous bug is introduced -
    > see OP. This is the entire point of this post.


    Then don't use it in a list() constructor.

    That's a glib answer, of course. A better answer is to point out that the
    problem is not with the above expression, but with letting StopIteration
    bubble up as an error exception instead of dealing with it immediately.
    That's not what it's for, and you can't trust it not to be captured by
    something. If StopIteration represents an error condition, you need to
    deal with it immediately and convert it to an exception which isn't
    likely to disappear.


    > In a large, non-trivial application, you simply cannot afford the
    > assumption that no caller will ever do that. Even if you have perfect
    > memory, some of your other developers or library users may not.


    You shouldn't put the responsibility of dealing with the StopIteration on
    the caller, because StopIteraction is a signal not an error condition,
    and you can't tell when that signal will disappear. The responsibility
    lies on the writer of the function containing the line (that is, the
    Original Poster of this thread).

    So you need something like this:

    def my_function():
    try:
    foo = (foo for foo in foos if foo.bar).next()
    except StopIteration:
    handle_empty_foos()
    else:
    handle_at_least_one_foo()


    handle_empty_foos may be as simple as raising a new exception and letting
    that bubble up to whatever layer of the application is expected to deal
    with it.



    > As for what's wrong with the "if not any" solution, Benjamin Kaplan's
    > post hits the nail on its head. This is a bioinformatics application, so
    > the iterable "foos" tends to be very large, so saving half the runtime
    > makes a big difference.


    Possibly you haven't seen my reply to Benjamin, so I'll paraphrase:
    that's incorrect, because any() is lazy and will return as soon as it
    hits a non-false item. See the docs:

    http://docs.python.org/library/functions.html#any

    If the foo items are considered true (e.g. non-empty strings), then you
    can guarantee that any() will return on the very first item.

    If the foo items are arbitrary objects which have an equal chance of
    being considered true or false, then on average it will have to look at
    half the list, which is O(N) and may be a tad expensive for large N. But
    how likely is that? One has to be realistic here, and consider the type
    of data you realistically need to deal with and not pathological cases.
    There's no limit to the problems you may have with sufficiently
    pathological data:

    class Evil(object):
    @property
    def bar(self):
    import time
    time.sleep(1e8)
    return True

    foos = [Evil(), "a", "b", "c", "d"]
    foo = (foo for foo in foos if foo.bar).next()


    any() is the standard, idiomatic solution for solving this sort of
    problem. Before rejecting it on the basis of slowness, you need to
    determine that long runs of false items ahead of the first true item is a
    realistic scenario, and that calling any() really is a bottleneck.
    Anything less is premature optimization.



    --
    Steven
    Steven D'Aprano, Jan 1, 2010
    #12
  13. On 1 Jan., 02:47, Steven D'Aprano <st...@REMOVE-THIS-
    cybersource.com.au> wrote:
    > On Thu, 31 Dec 2009 11:34:39 -0800, Tom Machinski wrote:



    > > On Wed, Dec 30, 2009 at 4:01 PM, Steven D'Aprano
    > > <> wrote:
    > >> On Wed, 30 Dec 2009 15:18:11 -0800, Tom Machinski wrote:
    > >>> Bottom line, I'm going to have to remove this pattern from my code:

    >
    > >>>   foo = (foo for foo in foos if foo.bar).next()

    >
    > >> I don't see why. What's wrong with it? Unless you embed it in a call to
    > >> list, or similar, it will explicitly raise StopIteration as expected.

    >
    > > Exactly; this seems innocuous, but if some caller of this code uses it
    > > in a list() constructor, a very subtle and dangerous bug is introduced -
    > > see OP. This is the entire point of this post.

    >
    > Then don't use it in a list() constructor.
    >
    > That's a glib answer, of course. A better answer is to point out that the
    > problem is not with the above expression, but with letting StopIteration
    > bubble up as an error exception instead of dealing with it immediately.
    > That's not what it's for, and you can't trust it not to be captured by
    > something. If StopIteration represents an error condition, you need to
    > deal with it immediately and convert it to an exception which isn't
    > likely to disappear.
    >
    > > In a large, non-trivial application, you simply cannot afford the
    > > assumption that no caller will ever do that. Even if you have perfect
    > > memory, some of your other developers or library users may not.

    >
    > You shouldn't put the responsibility of dealing with the StopIteration on
    > the caller, because StopIteraction is a signal not an error condition,
    > and you can't tell when that signal will disappear. The responsibility
    > lies on the writer of the function containing the line (that is, the
    > Original Poster of this thread).
    >
    > So you need something like this:
    >
    > def my_function():
    >     try:
    >         foo = (foo for foo in foos if foo.bar).next()
    >     except StopIteration:
    >         handle_empty_foos()
    >     else:
    >         handle_at_least_one_foo()
    >
    > handle_empty_foos may be as simple as raising a new exception and letting
    > that bubble up to whatever layer of the application is expected to deal
    > with it.
    >


    > > As for what's wrong with the "if not any" solution, Benjamin Kaplan's
    > > post hits the nail on its head. This is a bioinformatics application, so
    > > the iterable "foos" tends to be very large, so saving half the runtime
    > > makes a big difference.

    >
    > Possibly you haven't seen my reply to Benjamin, so I'll paraphrase:
    > that's incorrect, because any() is lazy and will return as soon as it
    > hits a non-false item.


    Tom's point is that
    if not any(foo for foo in foos if foo.bar):
    foo = (foo for foo in foos if foo.bar).next()
    iterates twice over (the same first few elements of) foos, which
    should take about twice as long as iterating once. The lazyness of
    "any" does not seem to matter here.
    Of course, you're right that the iteration might or might not be the
    bottleneck. On the other hand, foos might not even be reiterable.

    > If the foo items are arbitrary objects which have an equal chance of
    > being considered true or false, then on average it will have to look at
    > half the list,


    By which definition of chance? :)

    Wolfram
    Wolfram Hinderer, Jan 1, 2010
    #13
  14. On Fri, 01 Jan 2010 05:19:02 -0800, Wolfram Hinderer wrote:

    > On 1 Jan., 02:47, Steven D'Aprano <st...@REMOVE-THIS-
    > cybersource.com.au> wrote:
    >> On Thu, 31 Dec 2009 11:34:39 -0800, Tom Machinski wrote:

    [...]
    >> > As for what's wrong with the "if not any" solution, Benjamin Kaplan's
    >> > post hits the nail on its head. This is a bioinformatics application,
    >> > so the iterable "foos" tends to be very large, so saving half the
    >> > runtime makes a big difference.

    >>
    >> Possibly you haven't seen my reply to Benjamin, so I'll paraphrase:
    >> that's incorrect, because any() is lazy and will return as soon as it
    >> hits a non-false item.

    >
    > Tom's point is that
    > if not any(foo for foo in foos if foo.bar):
    > foo = (foo for foo in foos if foo.bar).next()
    > iterates twice over (the same first few elements of) foos, which should
    > take about twice as long as iterating once. The lazyness of "any" does
    > not seem to matter here.


    That's no different from any "Look Before You Leap" idiom. If you do this:

    if key in dict:
    x = dict[key]

    you search the dict twice, once to see if the key is there, and the
    second time to fetch the value. Whether that is better or faster than the
    alternative:

    try:
    x = dict[key]
    except KeyError:
    pass

    depends on how often you expect the lookup to fail.

    In any case, I would claim that Tom's argument is a classic example of
    premature optimization: by his own admission:

    'the iterable "foos" tends to be very large'

    which implies that whatever happens to the foos after this test, it will
    probably be very time consuming. If it takes (for the sake of the
    argument) 10 milliseconds to process the entire iterable, who cares
    whether it takes 0.01 or 0.02 ms to check that the iterable is valid?



    > Of course, you're right that the iteration might or might not be the
    > bottleneck. On the other hand, foos might not even be reiterable.


    If that's the case, then the existing solution potentially throws away
    the first value of foos every time the caller tests to see if it is empty.

    Dealing with non-reiterable iterators can be a nuisance. In such a case,
    it may be best to avoid Look Before You Leap altogether:

    empty = True
    for foo in foos:
    if foo.bar:
    empty = False
    process(foo)
    if empty:
    handle_error_condition()




    --
    Steven
    Steven D'Aprano, Jan 1, 2010
    #14
  15. >> Bottom line, I'm going to have to remove this pattern from my code:
    >>
    >> foo = (foo for foo in foos if foo.bar).next()


    I recommend to rewrite this like so:

    def first(gen):
    try:
    return gen.next()
    except StopIteration:
    raise ValueError, "No first value"

    foo = first(foo for foo in foos if foo.bar)

    As others have said: don't let StopIteration appear unexpectedly;
    IOW, consume generators right away in a loop construct (where
    this first function is a loop construct as well). A different
    way of writing it would be

    def first(gen):
    for value in gen:
    return value
    raise ValueError, "empty collection"

    Regards,
    Martin
    Martin v. Loewis, Jan 2, 2010
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Martin Maurer
    Replies:
    3
    Views:
    4,752
    Peter
    Apr 19, 2006
  2. Tom Machinski

    Dangerous behavior of list(generator)

    Tom Machinski, Dec 13, 2009, in forum: Python
    Replies:
    0
    Views:
    220
    Tom Machinski
    Dec 13, 2009
  3. Replies:
    10
    Views:
    415
    Peter Otten
    Dec 14, 2009
  4. Replies:
    3
    Views:
    274
    Steven D'Aprano
    Dec 15, 2009
  5. Replies:
    4
    Views:
    251
    Martin v. Loewis
    Jan 2, 2010
Loading...

Share This Page