Big time WTF with generators - bug?

Discussion in 'Python' started by James Stroud, Feb 13, 2008.

  1. James Stroud

    James Stroud Guest

    Hello,

    I'm boggled.

    I have this function which takes a keyer that keys a table (iterable). I
    filter based on these keys, then groupby based on the filtered keys and
    a keyfunc. Then, to make the resulting generator behave a little nicer
    (no requirement for user to unpack the keys), I strip the keys in a
    generator expression that unpacks them and generates the k,g pairs I
    want ("regrouped"). I then append the growing list of series generator
    into the "serieses" list ("serieses" is plural for series if your
    vocablulary isn't that big).

    Here's the function:

    def serialize(table, keyer=_keyer,
    selector=_selector,
    keyfunc=_keyfunc,
    series_keyfunc=_series_keyfunc):
    keyed = izip(imap(keyer, table), table)
    filtered = ifilter(selector, keyed)
    serialized = groupby(filtered, series_keyfunc)
    serieses = []
    for s_name, series in serialized:
    grouped = groupby(series, keyfunc)
    regrouped = ((k, (v[1] for v in g)) for (k,g) in grouped)
    serieses.append((s_name, regrouped))
    for s in serieses:
    yield s


    I defined a little debugging function called iterprint:

    def iterprint(thing):
    if isinstance(thing, str):
    print thing
    elif hasattr(thing, 'items'):
    print thing.items()
    else:
    try:
    for x in thing:
    iterprint(x)
    except TypeError:
    print thing

    The gist is that iterprint will print any generator down to its
    non-iterable components--it works fine for my purpose here, but I
    included the code for the curious.

    When I apply iterprint in the following manner (only change is the
    iterprint line) everything looks fine and my "regrouped" generators in
    "serieses" generate what they are supposed to when iterprinting. The
    iterprint at this point shows that everything is working just the way I
    want (I can see the last item in "serieses" iterprints just fine).

    def serialize(table, keyer=_keyer,
    selector=_selector,
    keyfunc=_keyfunc,
    series_keyfunc=_series_keyfunc):
    keyed = izip(imap(keyer, table), table)
    filtered = ifilter(selector, keyed)
    serialized = groupby(filtered, series_keyfunc)
    serieses = []
    for s_name, series in serialized:
    grouped = groupby(series, keyfunc)
    regrouped = ((k, (v[1] for v in g)) for (k,g) in grouped)
    serieses.append((s_name, regrouped))
    iterprint(serieses)
    for s in serieses:
    yield s

    Now, here's the rub. When I apply iterprint in the following manner, it
    looks like my generator ("regrouped") gets consumed (note the only
    change is a two space de-dent of the iterprint call--the printing is
    outside the loop):

    def serialize(table, keyer=_keyer,
    selector=_selector,
    keyfunc=_keyfunc,
    series_keyfunc=_series_keyfunc):
    keyed = izip(imap(keyer, table), table)
    filtered = ifilter(selector, keyed)
    serialized = groupby(filtered, series_keyfunc)
    serieses = []
    for s_name, series in serialized:
    grouped = groupby(series, keyfunc)
    regrouped = ((k, (v[1] for v in g)) for (k,g) in grouped)
    serieses.append((s_name, regrouped))
    iterprint(serieses)
    for s in serieses:
    yield s

    Now, what is consuming my "regrouped" generator when going from inside
    the loop to outside?

    Thanks in advance for any clue.

    py> print version
    2.5.1 (r251:54869, Apr 18 2007, 22:08:04)
    [GCC 4.0.1 (Apple Computer, Inc. build 5367)]

    --
    James Stroud
    UCLA-DOE Institute for Genomics and Proteomics
    Box 951570
    Los Angeles, CA 90095

    http://www.jamesstroud.com
     
    James Stroud, Feb 13, 2008
    #1
    1. Advertisements

  2. James Stroud

    Paul Rubin Guest

    of course this mutates the thing that is being printed. Try using
    itertools.tee to fork a copy of the iterator and print from that.
    I didn't look at the rest of your code enough to spot any errors
    but take note of the warnings in the groupby documentation about
    pitfalls with using the results some number of times other than
    exactly once.
     
    Paul Rubin, Feb 13, 2008
    #2
    1. Advertisements

  3. James Stroud

    James Stroud Guest

    Thank you for your answer, but I am aware of this caveat. Something is
    consuming my generator *before* I iterprint it. Please give it another
    look if you would be so kind.

    --
    James Stroud
    UCLA-DOE Institute for Genomics and Proteomics
    Box 951570
    Los Angeles, CA 90095

    http://www.jamesstroud.com
     
    James Stroud, Feb 13, 2008
    #3
  4. James Stroud

    James Stroud Guest

    I can see I didn't explain so well. This one must be a bug if my code
    looks good to you. Here is a summary:

    - If I iterprint inside the loop, iterprint looks correct.
    - If I iterprint outside the loop, my generator gets consumed and I am
    only left with the last item, so my iterprint prints only one item
    outside the loop.

    Conclusion: something consumes my generator going from inside the loop
    to outside.

    Please note that I am not talking about the yielded values, or the
    for-loop that creates them. I left them there to show my intent with the
    function. The iterprint function is there to show that the generator
    gets consumed just moving from inside the loop to outside.

    I know this one is easy to dismiss to my consuming the generator with
    the iterprint, as this would be a common mistake.

    James


    --
    James Stroud
    UCLA-DOE Institute for Genomics and Proteomics
    Box 951570
    Los Angeles, CA 90095

    http://www.jamesstroud.com
     
    James Stroud, Feb 13, 2008
    #4
  5. James Stroud

    Paul Rubin Guest

    I'll see if I can look at it some more later, I'm in the middle of
    something else right now. All I can say at the moment is that I've
    encountered problems like this in my own code many times, and it's
    always been a matter of having to carefully keep track of how the
    nested iterators coming out of groupby are being consumed. I doubt
    there is a library bug. Using groupby for things like this is
    powerful, but unfortunately bug-prone because of how these mutable
    iterators work. I suggest making some sample sequences and stepping
    through with a debugger seeing just how the iterators advance.
     
    Paul Rubin, Feb 13, 2008
    #5
  6. James Stroud

    Paul Rubin Guest

    I didn't spot any obvious errors, but I didn't look closely enough
    to say that the code looked good or bad.
    I'm not so sure of this, the thing is you're building these internal
    grouper objects that don't expect to be consumed in the wrong order, etc.

    Really, I'd try making a test iterator that prints something every
    time you advance it, then step through your function with a debugger.
     
    Paul Rubin, Feb 13, 2008
    #6
  7. James Stroud

    James Stroud Guest

    Thank you for your suggestion. I replied twice to your first post before
    you made your suggestion to step through with a debugger, so it looks
    like I ignored it.

    Thanks again.

    James

    --
    James Stroud
    UCLA-DOE Institute for Genomics and Proteomics
    Box 951570
    Los Angeles, CA 90095

    http://www.jamesstroud.com
     
    James Stroud, Feb 13, 2008
    #7
  8. James Stroud

    Peter Otten Guest

    James Stroud wrote:

    groupby() is "all you can eat", but "no doggy bag".
    You are trying to store a group for later consumption here.
    That doesn't work:
    .... print list(g)
    ....
    []
    []
    []
    [9]

    You cannot work around that because what invalidates a group is the call of
    groups.next():
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    StopIteration

    Perhaps Python should throw an out-of-band exception for an invalid group
    instead of yielding bogus data.

    Peter
     
    Peter Otten, Feb 13, 2008
    #8
  9. James Stroud

    Paul Rubin Guest

    Good catch, the solution is to turn that loop into a generator,
    but then it has to be consumed very carefully. This stuff
    maybe presses the limits of what one can do with Python iterators
    while staying sane.
     
    Paul Rubin, Feb 13, 2008
    #9
  10. James Stroud

    James Stroud Guest

    Thank you for your clear explanation--a satisfying conclusion to nine
    hours of head scratching.

    James

    --
    James Stroud
    UCLA-DOE Institute for Genomics and Proteomics
    Box 951570
    Los Angeles, CA 90095

    http://www.jamesstroud.com
     
    James Stroud, Feb 13, 2008
    #10
  11. James Stroud

    James Stroud Guest

    Brilliant suggestion. Worked like a charm. Here is the final product:


    def dekeyed(serialized, keyfunc):
    for name, series in serialized:
    grouped = groupby(series, keyfunc)
    regrouped = ((k, (v[1] for v in g)) for (k,g) in grouped)
    yield (name, regrouped)

    def serialize(table, keyer=_keyer,
    selector=_selector,
    keyfunc=_keyfunc,
    series_keyfunc=_series_keyfunc):
    keyed = izip(imap(keyer, table), table)
    filtered = ifilter(selector, keyed)
    serialized = groupby(filtered, series_keyfunc)
    return dekeyed(serialized, keyfunc)


    Thank you!

    James



    --
    James Stroud
    UCLA-DOE Institute for Genomics and Proteomics
    Box 951570
    Los Angeles, CA 90095

    http://www.jamesstroud.com
     
    James Stroud, Feb 13, 2008
    #11
  12. James Stroud

    Paul Rubin Guest

    Cool, glad it worked out. When writing this type of code I like to
    use doctest to spell out some solid examples of what each function is
    supposed to do, as part of the function. It's the only way I can
    remember the shapes of the sequences going in and out, and the
    automatic testing doesn't hurt either. Even with that though, at
    least for me, Python starts feeling really scary when the iterators
    get this complicated. I start wishing for a static type system,
    re-usable iterators, etc.
     
    Paul Rubin, Feb 13, 2008
    #12
  13. James Stroud

    Steve Holden Guest

    Not as big as your ego, apparently ;-) And don't be coming back with any
    argumentses.

    regardses
    Steve
     
    Steve Holden, Feb 13, 2008
    #13
  14. James Stroud

    Jeff Schwab Guest

    Nasty hobbitses... We hates them!
     
    Jeff Schwab, Feb 13, 2008
    #14
  15. James Stroud

    James Stroud Guest

    Its getting so you can't make a post on this list without getting
    needled by this irritating minority who never know when to quit. If you
    have nothing to add to the thread, you might want to practice humor on
    your own time--and you need practice because needling is not funny, just
    irritating.

    --
    James Stroud
    UCLA-DOE Institute for Genomics and Proteomics
    Box 951570
    Los Angeles, CA 90095

    http://www.jamesstroud.com
     
    James Stroud, Feb 13, 2008
    #15
  16. James Stroud

    James Stroud Guest

    Where is this coming from? Please see posts by Otten and Rubin for
    proper human conduct.

    --
    James Stroud
    UCLA-DOE Institute for Genomics and Proteomics
    Box 951570
    Los Angeles, CA 90095

    http://www.jamesstroud.com
     
    James Stroud, Feb 13, 2008
    #16
  17. Paul Rubin:
    This is an interesting topic. I agree with you, I too was scared in a
    similar situation. The language features allow you to do some things
    in a simple way, but if you pile too much of them, you end losing
    track of what you are doing, etc.
    The D language has static typing and its classes allow a standard
    opApply method that allows lazy iteration, they are re-usable
    iterators (but to scan two iterators in parallel you need a big trick,
    it's a matter of stack). They require more syntax, and it gets in the
    way, so in the end I am more able to write recursive generators in
    Python because its less cluttered syntax allows my brain to manage
    that extra part of algorithmic complexity necessary for that kind of
    convoluted code.
    The Haskall language is often uses by very intelligent programmers, it
    often allows to use lazy computations and iterations, but it has the
    advantage that its iterators behave better, and during the generation
    of some items you can, when you want, refer and use the items already
    generated. Those things make lazy Python code very different from lazy
    Haskell code.

    Bye,
    bearophile
     
    bearophileHUGS, Feb 13, 2008
    #17
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.