Big time WTF with generators - bug?

J

James Stroud

Hello,

I'm boggled.

I have this function which takes a keyer that keys a table (iterable). I
filter based on these keys, then groupby based on the filtered keys and
a keyfunc. Then, to make the resulting generator behave a little nicer
(no requirement for user to unpack the keys), I strip the keys in a
generator expression that unpacks them and generates the k,g pairs I
want ("regrouped"). I then append the growing list of series generator
into the "serieses" list ("serieses" is plural for series if your
vocablulary isn't that big).

Here's the function:

def serialize(table, keyer=_keyer,
selector=_selector,
keyfunc=_keyfunc,
series_keyfunc=_series_keyfunc):
keyed = izip(imap(keyer, table), table)
filtered = ifilter(selector, keyed)
serialized = groupby(filtered, series_keyfunc)
serieses = []
for s_name, series in serialized:
grouped = groupby(series, keyfunc)
regrouped = ((k, (v[1] for v in g)) for (k,g) in grouped)
serieses.append((s_name, regrouped))
for s in serieses:
yield s


I defined a little debugging function called iterprint:

def iterprint(thing):
if isinstance(thing, str):
print thing
elif hasattr(thing, 'items'):
print thing.items()
else:
try:
for x in thing:
iterprint(x)
except TypeError:
print thing

The gist is that iterprint will print any generator down to its
non-iterable components--it works fine for my purpose here, but I
included the code for the curious.

When I apply iterprint in the following manner (only change is the
iterprint line) everything looks fine and my "regrouped" generators in
"serieses" generate what they are supposed to when iterprinting. The
iterprint at this point shows that everything is working just the way I
want (I can see the last item in "serieses" iterprints just fine).

def serialize(table, keyer=_keyer,
selector=_selector,
keyfunc=_keyfunc,
series_keyfunc=_series_keyfunc):
keyed = izip(imap(keyer, table), table)
filtered = ifilter(selector, keyed)
serialized = groupby(filtered, series_keyfunc)
serieses = []
for s_name, series in serialized:
grouped = groupby(series, keyfunc)
regrouped = ((k, (v[1] for v in g)) for (k,g) in grouped)
serieses.append((s_name, regrouped))
iterprint(serieses)
for s in serieses:
yield s

Now, here's the rub. When I apply iterprint in the following manner, it
looks like my generator ("regrouped") gets consumed (note the only
change is a two space de-dent of the iterprint call--the printing is
outside the loop):

def serialize(table, keyer=_keyer,
selector=_selector,
keyfunc=_keyfunc,
series_keyfunc=_series_keyfunc):
keyed = izip(imap(keyer, table), table)
filtered = ifilter(selector, keyed)
serialized = groupby(filtered, series_keyfunc)
serieses = []
for s_name, series in serialized:
grouped = groupby(series, keyfunc)
regrouped = ((k, (v[1] for v in g)) for (k,g) in grouped)
serieses.append((s_name, regrouped))
iterprint(serieses)
for s in serieses:
yield s

Now, what is consuming my "regrouped" generator when going from inside
the loop to outside?

Thanks in advance for any clue.

py> print version
2.5.1 (r251:54869, Apr 18 2007, 22:08:04)
[GCC 4.0.1 (Apple Computer, Inc. build 5367)]

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com
 
P

Paul Rubin

James Stroud said:
I defined a little debugging function called iterprint:

def iterprint(thing): ...
for x in thing:
iterprint(x)

of course this mutates the thing that is being printed. Try using
itertools.tee to fork a copy of the iterator and print from that.
I didn't look at the rest of your code enough to spot any errors
but take note of the warnings in the groupby documentation about
pitfalls with using the results some number of times other than
exactly once.
 
J

James Stroud

Paul said:
of course this mutates the thing that is being printed. Try using
itertools.tee to fork a copy of the iterator and print from that.
I didn't look at the rest of your code enough to spot any errors
but take note of the warnings in the groupby documentation about
pitfalls with using the results some number of times other than
exactly once.

Thank you for your answer, but I am aware of this caveat. Something is
consuming my generator *before* I iterprint it. Please give it another
look if you would be so kind.

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com
 
J

James Stroud

Paul said:
of course this mutates the thing that is being printed. Try using
itertools.tee to fork a copy of the iterator and print from that.
I didn't look at the rest of your code enough to spot any errors
but take note of the warnings in the groupby documentation about
pitfalls with using the results some number of times other than
exactly once.

I can see I didn't explain so well. This one must be a bug if my code
looks good to you. Here is a summary:

- If I iterprint inside the loop, iterprint looks correct.
- If I iterprint outside the loop, my generator gets consumed and I am
only left with the last item, so my iterprint prints only one item
outside the loop.

Conclusion: something consumes my generator going from inside the loop
to outside.

Please note that I am not talking about the yielded values, or the
for-loop that creates them. I left them there to show my intent with the
function. The iterprint function is there to show that the generator
gets consumed just moving from inside the loop to outside.

I know this one is easy to dismiss to my consuming the generator with
the iterprint, as this would be a common mistake.

James


--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com
 
P

Paul Rubin

James Stroud said:
Thank you for your answer, but I am aware of this caveat. Something is
consuming my generator *before* I iterprint it. Please give it another
look if you would be so kind.

I'll see if I can look at it some more later, I'm in the middle of
something else right now. All I can say at the moment is that I've
encountered problems like this in my own code many times, and it's
always been a matter of having to carefully keep track of how the
nested iterators coming out of groupby are being consumed. I doubt
there is a library bug. Using groupby for things like this is
powerful, but unfortunately bug-prone because of how these mutable
iterators work. I suggest making some sample sequences and stepping
through with a debugger seeing just how the iterators advance.
 
P

Paul Rubin

James Stroud said:
I can see I didn't explain so well. This one must be a bug if my code
looks good to you.

I didn't spot any obvious errors, but I didn't look closely enough
to say that the code looked good or bad.
Conclusion: something consumes my generator going from inside the loop
to outside.

I'm not so sure of this, the thing is you're building these internal
grouper objects that don't expect to be consumed in the wrong order, etc.

Really, I'd try making a test iterator that prints something every
time you advance it, then step through your function with a debugger.
 
J

James Stroud

Paul said:
I didn't spot any obvious errors, but I didn't look closely enough
to say that the code looked good or bad.


I'm not so sure of this, the thing is you're building these internal
grouper objects that don't expect to be consumed in the wrong order, etc.

Really, I'd try making a test iterator that prints something every
time you advance it, then step through your function with a debugger.

Thank you for your suggestion. I replied twice to your first post before
you made your suggestion to step through with a debugger, so it looks
like I ignored it.

Thanks again.

James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com
 
P

Peter Otten

James Stroud wrote:

groupby() is "all you can eat", but "no doggy bag".
def serialize(table, keyer=_keyer,
                      selector=_selector,
                      keyfunc=_keyfunc,
                      series_keyfunc=_series_keyfunc):
   keyed = izip(imap(keyer, table), table)
   filtered = ifilter(selector, keyed)
   serialized = groupby(filtered, series_keyfunc)
   serieses = []
   for s_name, series in serialized:
     grouped = groupby(series, keyfunc)
     regrouped = ((k, (v[1] for v in g)) for (k,g) in grouped)
     serieses.append((s_name, regrouped))

You are trying to store a group for later consumption here.
   for s in serieses:
     yield s

That doesn't work:
groups = [g for k, g in groupby(range(10), lambda x: x//3)]
for g in groups:
.... print list(g)
....
[]
[]
[]
[9]

You cannot work around that because what invalidates a group is the call of
groups.next():
groups = groupby(range(10), lambda x: x//3)
g = groups.next()[1]
g.next() 0
groups.next()
(1 said:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration

Perhaps Python should throw an out-of-band exception for an invalid group
instead of yielding bogus data.

Peter
 
P

Paul Rubin

Peter Otten said:
   for s_name, series in serialized:
     grouped = groupby(series, keyfunc)
     regrouped = ((k, (v[1] for v in g)) for (k,g) in grouped)
     serieses.append((s_name, regrouped))

You are trying to store a group for later consumption here.

Good catch, the solution is to turn that loop into a generator,
but then it has to be consumed very carefully. This stuff
maybe presses the limits of what one can do with Python iterators
while staying sane.
 
J

James Stroud

Peter said:
groupby() is "all you can eat", but "no doggy bag".

Thank you for your clear explanation--a satisfying conclusion to nine
hours of head scratching.

James

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com
 
J

James Stroud

Paul said:
Good catch, the solution is to turn that loop into a generator,
but then it has to be consumed very carefully.

Brilliant suggestion. Worked like a charm. Here is the final product:


def dekeyed(serialized, keyfunc):
for name, series in serialized:
grouped = groupby(series, keyfunc)
regrouped = ((k, (v[1] for v in g)) for (k,g) in grouped)
yield (name, regrouped)

def serialize(table, keyer=_keyer,
selector=_selector,
keyfunc=_keyfunc,
series_keyfunc=_series_keyfunc):
keyed = izip(imap(keyer, table), table)
filtered = ifilter(selector, keyed)
serialized = groupby(filtered, series_keyfunc)
return dekeyed(serialized, keyfunc)


Thank you!

James



--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com
 
P

Paul Rubin

James Stroud said:
Brilliant suggestion. Worked like a charm. Here is the final product:

Cool, glad it worked out. When writing this type of code I like to
use doctest to spell out some solid examples of what each function is
supposed to do, as part of the function. It's the only way I can
remember the shapes of the sequences going in and out, and the
automatic testing doesn't hurt either. Even with that though, at
least for me, Python starts feeling really scary when the iterators
get this complicated. I start wishing for a static type system,
re-usable iterators, etc.
 
S

Steve Holden

James said:
[...] I then append the growing list of series generator
into the "serieses" list ("serieses" is plural for series if your
vocablulary isn't that big).
Not as big as your ego, apparently ;-) And don't be coming back with any
argumentses.

regardses
Steve
 
J

Jeff Schwab

Steve said:
James said:
[...] I then append the growing list of series generator
into the "serieses" list ("serieses" is plural for series if your
vocablulary isn't that big).
Not as big as your ego, apparently ;-) And don't be coming back with any
argumentses.

Nasty hobbitses... We hates them!
 
J

James Stroud

Steve said:
James said:
[...] I then append the growing list of series generator
into the "serieses" list ("serieses" is plural for series if your
vocablulary isn't that big).
Not as big as your ego, apparently ;-) And don't be coming back with any
argumentses.

Its getting so you can't make a post on this list without getting
needled by this irritating minority who never know when to quit. If you
have nothing to add to the thread, you might want to practice humor on
your own time--and you need practice because needling is not funny, just
irritating.

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com
 
J

James Stroud

Steve said:
James said:
[...] I then append the growing list of series generator
into the "serieses" list ("serieses" is plural for series if your
vocablulary isn't that big).
Not as big as your ego, apparently ;-) And don't be coming back with any
argumentses.

Where is this coming from? Please see posts by Otten and Rubin for
proper human conduct.

--
James Stroud
UCLA-DOE Institute for Genomics and Proteomics
Box 951570
Los Angeles, CA 90095

http://www.jamesstroud.com
 
B

bearophileHUGS

Paul Rubin:
Even with that though, at least for me, Python starts feeling really
scary when the iterators get this complicated. I start wishing for
a static type system, re-usable iterators, etc.

This is an interesting topic. I agree with you, I too was scared in a
similar situation. The language features allow you to do some things
in a simple way, but if you pile too much of them, you end losing
track of what you are doing, etc.
The D language has static typing and its classes allow a standard
opApply method that allows lazy iteration, they are re-usable
iterators (but to scan two iterators in parallel you need a big trick,
it's a matter of stack). They require more syntax, and it gets in the
way, so in the end I am more able to write recursive generators in
Python because its less cluttered syntax allows my brain to manage
that extra part of algorithmic complexity necessary for that kind of
convoluted code.
The Haskall language is often uses by very intelligent programmers, it
often allows to use lazy computations and iterations, but it has the
advantage that its iterators behave better, and during the generation
of some items you can, when you want, refer and use the items already
generated. Those things make lazy Python code very different from lazy
Haskell code.

Bye,
bearophile
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top