itertools.izip brokeness

R

rurpy

The code below should be pretty self-explanatory.
I want to read two files in parallel, so that I
can print corresponding lines from each, side by
side. itertools.izip() seems the obvious way
to do this.

izip() will stop interating when it reaches the
end of the shortest file. I don't know how to
tell which file was exhausted so I just try printing
them both. The exhausted one will generate a
StopInteration, the other will continue to be
iterable.

The problem is that sometimes, depending on which
file is the shorter, a line ends up missing,
appearing neither in the izip() output, or in
the subsequent direct file iteration. I would
guess that it was in izip's buffer when izip
terminates due to the exception on the other file.

This behavior seems plain out broken, especially
because it is dependent on order of izip's
arguments, and not documented anywhere I saw.
It makes using izip() for iterating files in
parallel essentially useless (unless you are
lucky enough to have files of the same length).

Also, it seems to me that this is likely a problem
with any iterables with different lengths.
I am hoping I am missing something...

#---------------------------------------------------------
# Task: print contents of file1 in column 1, and
# contents of file2 in column two. iterators and
# izip() are the "obvious" way to do it.

from itertools import izip
import cStringIO, pdb

def prt_files (file1, file2):

for line1, line2 in izip (file1, file2):
print line1.rstrip(), "\t", line2.rstrip()

try:
for line1 in file1:
print line1,
except StopIteration: pass

try:
for line2 in file2:
print "\t",line2,
except StopIteration: pass

if __name__ == "__main__":
# Use StringIO to simulate files. Real files
# show the same behavior.
f = cStringIO.StringIO

print "Two files with same number of lines work ok."
prt_files (f("abc\nde\nfgh\n"), f("xyz\nwv\nstu\n"))

print "\nFirst file shorter is also ok."
prt_files (f("abc\nde\n"), f("xyz\nwv\nstu\n"))

print "\nSecond file shorter is a problem."
prt_files (f("abc\nde\nfgh\n"), f("xyz\nwv\n"))
print "What happened to \"fgh\" line that should be in column
1?"

print "\nBut only a problem for one line."
prt_files (f("abc\nde\nfgh\nijk\nlm\n"), f("xyz\nwv\n"))
print "The line \"fgh\" is still missing, but following\n" \
"line(s) are ok! Looks like izip() ate a line."
 
P

Paul Rubin

The problem is that sometimes, depending on which file is the
shorter, a line ends up missing, appearing neither in the izip()
output, or in the subsequent direct file iteration. I would guess
that it was in izip's buffer when izip terminates due to the
exception on the other file.

Oh man, this is ugly. The problem is there's no way to tell whether
an iterator is empty, other than by reading from it.

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/413614

has a kludge that you can use inside a function but that's no good
for something like izip.

For a temporary hack you could make a wrapped iterator that allows
pushing items back onto the iterator (sort of like ungetc) and a
version of izip that uses it, or a version of izip that tests the
iterators you pass it using the above recipe.

It's probably not reasonable to ask that an emptiness test be added to
the iterator interface, since the zillion iterator implementations now
existing won't support it.

A different possible long term fix: change StopIteration so that it
takes an optional arg that the program can use to figure out what
happened. Then change izip so that when one of its iterator args runs
out, it wraps up the remaining ones in a new tuple and passes that
to the StopIteration it raises. Untested:

def izip(*iterlist):
while True:
z = []
finished = [] # iterators that have run out
still_alive = [] # iterators that are still alive
for i in iterlist:
try:
z.append(i.next())
still_alive.append(i)
except StopIteration:
finished.append(i)
if not finished:
yield tuple(z)
else:
raise StopIteration, (still_alive, finished)

You would want some kind of extended for-loop syntax (maybe involving
the new "with" statement) with a clean way to capture the exception info.
You'd then use it to continue the izip where it left off, with the
new (smaller) list of iterators.
 
B

bonono

But that is exactly the behaviour of python iterator, I don't see what
is broken.

izip/zip just read from the respectives streams and give back a tuple,
if it can get one from each, otherwise stop. And because python
iterator can only go in one direction, those consumed do lose in the
zip/izip calls.

I think you need to use map(None,...) which would not drop anything,
just None filled. Though you don't have a relatively lazy version as
imap(None,...) doesn't behave like map but a bit like zip.
 
D

David Murmann

[izip() eats one line]

as far as i can see the current implementation cannot be changed
to do the Right Thing in your case. pythons iterators don't allow
to "look ahead", so izip can only get the next element. if this
fails for an iterator, everything up to that point is lost.

maybe the documentation for izip should note that the given
iterators are not necessarily in a sane state afterwards.

for your problem you can do something like:

def izipall(*args):
iters = [iter(it) for it in args]
while iters:
result = []
for it in iters:
try:
x = it.next()
except StopIteration:
iters.remove(it)
else:
result.append(x)
yield tuple(result)

note that this does not yield tuples that are always the same
length, so "for x, y in izipall()" won't work. instead, do something
like "for seq in izipall(): print '\t'.join(seq)".

hope i was clear enough, David.
 
P

Peter Otten

The problem is that sometimes, depending on which
file is the shorter, a line ends up missing,
appearing neither in the izip() output, or in
the subsequent direct file iteration.  I would
guess that it was in izip's buffer when izip
terminates due to the exception on the other file.

With the current iterator protocol you cannot feed an item that you've read
from an iterator by calling its next() method back into it; but invoking
next() is the only way to see whether the iterator is exhausted. Therefore
the behaviour that breaks your prt_files() function has nothing to do with
the itertools.
I think of itertools more as of a toolbox instead of a set of ready-made
solutions and came up with

from itertools import izip, chain, repeat

def prt_files (file1, file2):
file1 = chain(file1, repeat(""))
file2 = chain(file2, repeat(""))
for line1, line2 in iter(izip(file1, file2).next, ("", "")):
print line1.rstrip(), "\t", line2.rstrip()

which can easily be generalized for an arbitrary number of files.

Peter
 
P

Paul Rubin

But that is exactly the behaviour of python iterator, I don't see what
is broken.

What's broken is the iterator interface is insufficient to deal with
this cleanly.
And because python iterator can only go in one direction, those
consumed do lose in the zip/izip calls.

Yes, that's the problem. It's proven useful for i/o streams to support
a pushback operation like ungetc. Maybe something like it can be done
for iterators.
I think you need to use map(None,...) which would not drop anything,
just None filled. Though you don't have a relatively lazy version as
imap(None,...) doesn't behave like map but a bit like zip.

I don't understand what you mean by this? None is not callable.

How about this (untested):

def myzip(iterlist):
"""return zip of smaller and smaller list of iterables as the
individual iterators run out"""
sentinel = object() # unique sentinel
def sentinel_append(iterable):
return itertools.chain(iterable, itertools.repeat(sentinel))
for i in itertools.izip(map(sentinel_append, iterlist)):
r = [x for x in i.next() if x is not sentinel]
if r: yield r
else: break
 
B

bonono

Paul said:
I don't understand what you mean by this? None is not callable.

zip([1,2,3],[4,5]) gives [(1,4),(2,5)]

map(None,[1,2,3],[4,5]) gives [(1,4),(2,5),(3,None)]

So the result of map() can be filtered out for special processing. Of
course, your empty/sentinel filled version is doing more or less the
same thing.
How about this (untested):

def myzip(iterlist):
"""return zip of smaller and smaller list of iterables as the
individual iterators run out"""
sentinel = object() # unique sentinel
def sentinel_append(iterable):
return itertools.chain(iterable, itertools.repeat(sentinel))
for i in itertools.izip(map(sentinel_append, iterlist)):
r = [x for x in i.next() if x is not sentinel]
if r: yield r
else: break
 
P

Paul Rubin

map(None,[1,2,3],[4,5]) gives [(1,4),(2,5),(3,None)]

I didn't know that until checking the docs just now. Oh man, what a
hack! I always thought Python should have a built-in identity
function for situations like that. I guess it does the above instead.
Thanks. Jeez ;-)
 
B

bonono

Paul said:
map(None,[1,2,3],[4,5]) gives [(1,4),(2,5),(3,None)]

I didn't know that until checking the docs just now. Oh man, what a
hack! I always thought Python should have a built-in identity
function for situations like that. I guess it does the above instead.
Thanks. Jeez ;-)
Of course, for OP's particular case, I think a specialized func() is
even better, as the None are turned into "" in the process which is
needed for string operation.

map(lambda *arg: tuple(map(lambda x: x is not None and x or "", arg)),
["a","b","c"],["d","e"])
 
D

Diez B. Roggisch

What's broken is the iterator interface is insufficient to deal with
this cleanly.

I don't consider it broken. You just think too much in terms of the OPs
problems or probably other fields where the actual data is available for
"rewinding".

But as iterators serve as abstraction for lots of things - especially
generatiors - you can't enhance the interface.
Yes, that's the problem. It's proven useful for i/o streams to support
a pushback operation like ungetc. Maybe something like it can be done
for iterators.

No. If you want that, use

list(iterable)

Then you have random access. If you _know_ there will be only so much data
needed to "unget", write yourself a buffered iterator like this:

buffered(iterable, size)

Maybe something like that _could_ go in the itertools. But I'm not really
convinced, as it is too tied to special cases - and besides that very
easily done.

How about this (untested):

def myzip(iterlist):
"""return zip of smaller and smaller list of iterables as the
individual iterators run out"""
sentinel = object() # unique sentinel
def sentinel_append(iterable):
return itertools.chain(iterable, itertools.repeat(sentinel))
for i in itertools.izip(map(sentinel_append, iterlist)):
r = [x for x in i.next() if x is not sentinel]
if r: yield r
else: break

If that fits your semantics - of course. But the general zip shouldn't
behave that way.

Regards,

Diez
 
P

Paul Rubin

Diez B. Roggisch said:
No. If you want that, use

list(iterable)

Then you have random access. If you _know_ there will be only so much data
needed to "unget", write yourself a buffered iterator like this:

You can't use list(iterable) in general because the iterable may be
infinite.
buffered(iterable, size)

Maybe something like that _could_ go in the itertools. But I'm not really
convinced, as it is too tied to special cases -

The usual pushback depth needed is just one item and it would solve
various situations like this. The ASPN recipe I cited for detecting
an empty iterable shows that such problems come up in more than one
place.

Note that besides the buffered iterable, there would also have to be a
version of izip that pushed unused items back onto its iterables.
and besides that very easily done.

Just about everything in itertools is very easily done, so that's not
a valid argument against it.
If that fits your semantics - of course. But the general zip shouldn't
behave that way.

Of course it shouldn't. In another post I suggested a way to extend
the general izip, to throw a list of the remaining non-empty iterables
once it hit an empty one. Maybe there's some problem with that too,
but if so, it's more subtle.

Any idea how Haskell would deal with this?
 
D

Duncan Booth

Peter said:
from itertools import izip, chain, repeat

def prt_files (file1, file2):
file1 = chain(file1, repeat(""))
file2 = chain(file2, repeat(""))
for line1, line2 in iter(izip(file1, file2).next, ("", "")):
print line1.rstrip(), "\t", line2.rstrip()

which can easily be generalized for an arbitrary number of files.

Generalizing for an arbitrary number of files and for an arbitrary value to
pad out the shorter sequences:

def paddedizip(pad, *args):
terminator = [pad] * (len(args)-1)
def padder():
if not terminator:
return
t = terminator.pop()
while 1:
yield t
return izip(*(chain(a, padder()) for a in args))
for (p,q) in paddedizip(0,[1,2,3],[4,5]):
print repr(p), repr(q)


1 4
2 5
3 0
for (p,q) in paddedizip(0,[1,2,3],[4,5,6,7,8]):
print repr(p), repr(q)


1 4
2 5
3 6
0 7
0 8
for (p,q) in paddedizip("",[1,2,3],[4,5,6,7,8]):
print repr(p), repr(q)


1 4
2 5
3 6
'' 7
'' 8
for (p,q,r) in paddedizip(None,[1,2,3],[4,5,6,7,8],[9]):
print repr(p), repr(q), repr(r)


1 4 9
2 5 None
3 6 None
None 7 None
None 8 None
 
B

bonono

Paul said:
Any idea how Haskell would deal with this?
I don't recall haskell has the map(None,...) behaviour in the standard
Prelude. But then, I don't see how the iterator concept would fit into
haskell as well.
 
T

Tom Anderson

A different possible long term fix: change StopIteration so that it
takes an optional arg that the program can use to figure out what
happened. Then change izip so that when one of its iterator args runs
out, it wraps up the remaining ones in a new tuple and passes that
to the StopIteration it raises.

+1

I think you also want to send back the items you read out of the iterators
which are still alive, which otherwise would be lost. Here's a somewhat
minimalist (but tested!) implementation:

def izip(*iters):
while True:
z = []
try:
for i in iters:
z.append(i.next())
yield tuple(z)
except StopIteration:
raise StopIteration, z

The argument you get back with the exception is z, the list of items read
before the first empty iterator was encountered; if you still have your
array iters hanging about, you can find the iterator which stopped with
iters[len(z)], the ones which are still going with iters[:len(z)], and the
ones which are in an uncertain state, since they were never tried, with
iters[(len(z) + 1):]. This code could easily be extended to return more
information explicitly, of course, but simple, sparse, etc.
You would want some kind of extended for-loop syntax (maybe involving
the new "with" statement) with a clean way to capture the exception
info.

How about for ... except?

for z in izip(a, b):
lovingly_fondle(z)
except StopIteration, leftovers:
angrily_discard(leftovers)

This has the advantage of not giving entirely new meaning to an existing
keyword. It does, however, afford the somewhat dubious use:

for z in izip(a, b):
lovingly_fondle(z)
except ValueError, leftovers:
pass # execution should almost certainly never get here

Perhaps that form should be taken as meaning:

try:
for z in izip(a, b):
lovingly_fondle(z)
except ValueError, leftovers:
pass # execution could well get here if the fondling goes wrong

Although i think it would be more strictly correct if, more generally, it
made:

for LOOP_VARIABLE in ITERATOR:
SUITE
except EXCEPTION:
HANDLER

Work like:

try:
while True:
try:
LOOP_VARIABLE = ITERATOR.next()
except EXCEPTION:
raise __StopIteration__, sys.exc_info()
except StopIteration:
break
SUITE
except __StopIteration__, exc_info:
somehow_set_sys_exc_info(exc_info)
HANDLER

As it stands, throwing a StopIteration in the suite inside a for loop
doesn't terminate the loop - the exception escapes; by analogy, the
for-except construct shouldn't trap exceptions from the loop body, only
those raised by the iterator.

tom
 
R

rurpy

But that is exactly the behaviour of python iterator, I don't see what
is broken.

izip/zip just read from the respectives streams and give back a tuple,
if it can get one from each, otherwise stop. And because python
iterator can only go in one direction, those consumed do lose in the
zip/izip calls.

[This is really a reply to the thread in general, not specifically
to your response Bonono...]

Yes, I can understand the how the properties of iterators
and izip's design lead to the behavior I observed.
I am saying that the unfortunate interaction of those
properties leads to behavior that make izip essentially
useless in many cases that one would naively expect it
not to be, that that behavior is not pointed out in the docs,
and is subtle enough that it is not realistic to expect
most users to realize it based of the properties of izip
and iterators alone.

izip's uses can be partitioned two ways:
1. All iterables have equal lengths
2. Iterables have different lengths.

Case 1 is no problem obviously.
In Case 2 there are two sub-cases:

2a. You don't care what values occur in the other iterators
after then end of the shortest.
2b. You do care.

In my experience 1 and 2b are the cases I encounter the most.
Seldom do I need case 2a. That is, when I can have iterators
of unequal length, usually I want to do *something* with the
extra items in the longer iterators. Seldom do I want to just
ignore them.

In case 2b one cannot (naively) use izip, because izip
irretrievably throws away data when the end of the
shortest iterable is reached.

The whole point of using izip is to make the code shorter,
more concise, and easier to write and understand. If I
have to add a lot of extra code to work around izip's problem,
or write my own izip function, then there is no point using
izip(). Or I could just write a simple while loop and handle
the iterators' exhaustions individually.
Ergo, izip is useless for situations involving case 2b.
This should be pointed out in the docs, particularly
since, depending on the order of izip's arguments,
it can appear to be working as one might initially
but erroneously think it should.

However, it would be better if izip could be made useful
fot case 2b situations. Or maybe, an izip2 (or something)
added.
 
R

rurpy

Duncan Booth said:
Peter said:
from itertools import izip, chain, repeat

def prt_files (file1, file2):
file1 = chain(file1, repeat(""))
file2 = chain(file2, repeat(""))
for line1, line2 in iter(izip(file1, file2).next, ("", "")):
print line1.rstrip(), "\t", line2.rstrip()

which can easily be generalized for an arbitrary number of files.

Generalizing for an arbitrary number of files and for an arbitrary value to
pad out the shorter sequences:

def paddedizip(pad, *args):
terminator = [pad] * (len(args)-1)
def padder():
if not terminator:
return
t = terminator.pop()
while 1:
yield t
return izip(*(chain(a, padder()) for a in args))
for (p,q) in paddedizip(0,[1,2,3],[4,5]):
print repr(p), repr(q)


1 4
2 5
3 0
[...more examples snipped...]

Here what I came up with:

def izipl (*iterables, **kwds):
sentinel = "" # Default value, maybe None would be better?
for k,v in kwds: # Look for "sentinel" arg, error on any
other.
if k != "sentinel":
raise TypeError, "got an unexpected keyword argument
'%s'" % k
else: sentinel = v
iterables = map (iter, iterables) # itertools.izip does this.

while iterables:
result = []; cnt = 0
for i in iterables:
try: result.append (i.next())
except exceptions.StopIteration:
result.append (sentinel)
cnt += 1
if cnt == len (iterables): raise StopIteration
yield tuple(result)

Hmm, your function returns an izip object, mine just returns
the results of the iteration. So I guess my function would
be the next() method of a izipl class? I still have not got
my head around this stuff :-(

But here is my real question...
Why isn't something like this in itertools, or why shouldn't
it go into itertools?

It is clear that there is a real need for iterating in parallel
over multiple iterators to the end of the longest one. Why
does something that stops at the shortest get included in
the standard library, but one that stops after the longest
doesn't? Is there any hope for something like this being
included in 2.5?
 
B

bonono

It is clear that there is a real need for iterating in parallel
over multiple iterators to the end of the longest one. Why
does something that stops at the shortest get included in
the standard library, but one that stops after the longest
doesn't? Is there any hope for something like this being
included in 2.5?
I wonder as well. map(None, ) does what you want, but it is not lazy.
imap(None,) for some reason don't do the same thing as map. And if you
don't want None as the sentinel, you can just write your own lambda(see
my other post). So for non-lazy need, map already supports what you
want.

The suggestion of "unget" in iterator is interesting too, as I also
once wrote a restartable iterator wrapper using tee() though later
scrap it as I found that turning them into list is easier so long I
don't need the lazy version.
 
R

rurpy

But here is my real question...
Why isn't something like this in itertools, or why shouldn't
it go into itertools?

Never mind...
I just read this in the source code for itertools.imap:

1) Itertools are designed to be easily combined and chained together.
Having all tools stop with the shortest input is a unifying
principle
that makes it easier to combine finite iterators (supplying data)
with
infinite iterators like count() and repeat() (for supplying
sequential
or constant arguments to a function).

2) In typical use cases for combining itertools, having one finite
data
supplier run out before another is likely to be an error condition
which
should not pass silently by automatically supplying None.

3) The use cases for automatic None fill-in are rare -- not many
functions
do something useful when a parameter suddenly switches type and
becomes
None.

4) If a need does arise, it can be met by __builtins__.map() or by
writing: chain(iterable, repeat(None)).

5) Similar toolsets in Haskell and SML do not have automatic None
fill-in.

I know I shouldn't post this but...

<FLAME ON>
Jezuz Freekin' Cripes!!
This is the crap that drives me up a wall using Python.
I spent 10 hours writing the original non-working code,
finding the probllem, posting to c.l.p, writing working
replacement code. Why? Because some pythonic
devdude sitting in his god throne declared that I don't
need to do what I needed to do!!

1) Itertools are designed to be easily combined and chained together.
Having all tools stop with the shortest input is a unifying
principle
that makes it easier to combine finite iterators (supplying data)
with
infinite iterators like count() and repeat() (for supplying
sequential
or constant arguments to a function).

There is not a count or a repeat to be seen anywhere in
my code. So let's see... I am prevented from using a
tool I need to use so that I will have the ability to use
some tools which I don't need. Wow, that makes a lot
of f'ing sense.

2) In typical use cases for combining itertools, having one finite
data
supplier run out before another is likely to be an error condition
which
should not pass silently by automatically supplying None.

Just plain wrong. Files are very commonly used iterators,
and the ttypical case with them is that they are of different
lengths. And I have run into plenty of non-file iterators of
differing lengths. Declaring by fiat that these are all
error cases and shouldn't be permitted is bullshit.

3) The use cases for automatic None fill-in are rare -- not many
functions
do something useful when a parameter suddenly switches type and
becomes
None.

So allow a caller-specified fill. Not rocket science.

4) If a need does arise, it can be met by __builtins__.map() or by
writing: chain(iterable, repeat(None)).

Yes, if youre a python guru. I don't even understand the
code presented in this thread that uses chain/repeat, let
alone have any chance of writing it in less than a week.
For the average python user, having a tool to iterate to
the longest input is about 4 orders of magnitude simpler.

5) Similar toolsets in Haskell and SML do not have automatic None
fill-in.

What the hell does this have to do with anything? Maybe
there are other reasons like they have alternate, better
ways of doing the same thing? Does the fact the C++
lack some feature justify leaving it out of Python?

Python is basically a pretty good language but there are
these big time holes in it. I spend WAY too much time
trying to figure out how to do something that should be
easy, but isn't because someone thought that it might
hurt the "purity" of the language or violate some "principle".

pissed-offedly-yr's, rurpy

<FLAME OFF>
 
R

Raymond Hettinger

izip's uses can be partitioned two ways:
1. All iterables have equal lengths
2. Iterables have different lengths.

Case 1 is no problem obviously.
In Case 2 there are two sub-cases:

2a. You don't care what values occur in the other iterators
after then end of the shortest.
2b. You do care.

In my experience 1 and 2b are the cases I encounter the most.
Seldom do I need case 2a. That is, when I can have iterators
of unequal length, usually I want to do *something* with the
extra items in the longer iterators. Seldom do I want to just
ignore them.

That is a reasonable use case that is not supported by zip() or izip()
as currently implemented.


The whole point of using izip is to make the code shorter,
more concise, and easier to write and understand.

That should be the point of using anything in Python. The specific
goal for izip() was for an iterator version of zip(). Unfortunately,
neither tool fits your problem. At the root of it is the iterator
protocol not having an unget() method for pushing back unused elements
of the data stream.


This should be pointed out in the docs,

I'll add a note to the docs.

However, it would be better if izip could be made useful
fot case 2b situations. Or maybe, an izip2 (or something)
added.

Feel free to submit a feature request to the SF tracker (surprisingly,
this behavior has not been previously reported, nor have there any
related feature requests, nor was the use case contemplated in the PEP
discussions: http://www.python.org/peps/pep-0201 ).



Raymond Hettinger
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,076
Latest member
OrderKetoBeez

Latest Threads

Top