Itertools

R

Ronald Legere

The new itertools stuff is pretty cool. One thing that bothers me though is
that
there seems to be no way to copy an iterator. How does one work around this?
Is
there a trick that I am missing?

Without being able to 'split' an iterator, I can't
think of how to do some of the cool things you can do with lazy lists in
Haskell. It seems
quite often you want to be able to do this 'splitting', and if I
had more coffee I would post an example :) The ones I
can think of are also defined recursively, which is another kettle of fish.

But I hope that someone knows
what I am talking about and has thought of a way...
Cheers!
 
G

Gonçalo Rodrigues

The new itertools stuff is pretty cool. One thing that bothers me though is
that
there seems to be no way to copy an iterator. How does one work around this?
Is
there a trick that I am missing?

Without being able to 'split' an iterator, I can't
think of how to do some of the cool things you can do with lazy lists in
Haskell. It seems
quite often you want to be able to do this 'splitting', and if I
had more coffee I would post an example :) The ones I
can think of are also defined recursively, which is another kettle of fish.

Iterables can be more far general than lazy lists: think of a socket
as an iterable churning out lines of text, for example. How would one
go about copying these general iterables?

You can take slices of iterables, though - can't remember the name,
islice? But since you don't know how to take copies of iterables in
general this may not be that useful.
But I hope that someone knows
what I am talking about and has thought of a way...
Cheers!

With my best regards,
G. Rodrigues
 
H

Heiko Wundram

Untested code that does basically this, but beware of any traps (read
and understand the code before trying to use it!!!), uses Python 2.3
functionality, but could be amended:

--- START OF CODE ---

import copy

class IDup(object):
def __init__(self,iterin):
self.__iter = iterin
self.__iterno = 0
self.__iteritems = []
self.__hasstopped = None

def registerIter():
iterno = self.__iterno
self.__iterno += 1
self.__iteritems.append([])

def getNext(self,iterno):
if self.__iteritems[iterno]:
iteritem = self.__iteritems[iterno].pop(0)
elif self.__hasstopped is not None:
raise self.__hasstopped
else:
try:
iteritem = self.__iter.next()
except StopIteration, e:
self.__hasstopped = e
raise
for id, i in enumerate(self.__iteritems):
if id <> iterno:
i.append(copy.deepcopy(iteritem))
return iteritem

class IDupped(object):
def __init__(self,idup):
self.__idup = idup
self.__iterno = idup.registerIter()

def next(self):
return self.__idup.getNext(self.__iterno)

def isplit(iterin,splitno=2):
idup = IDup(iterin)
iduppeds = []
for i in range(splitno):
iduppeds.append(IDupped(idup))
return tuple(iduppeds)

test = ["hello","how","are","you?"]
x, y = isplit(iter(test))

for i in x:
print i
for i in y:
print i

--- END OF CODE ---

Pitfalls:

1) You may not use the original iterator after it has been split.
2) Your iterator must return values which are copyable. If you wish for
the values to be truly equal (even in ID) or they are not copyable, then
just remove the copy.deepcopy from the code.

HTH!

Heiko.
 
H

Heiko Wundram

Just tested the code, amend as stated (remove the - lines, and add
the + lines):

--- START OF CODE ---

import copy

class IDup(object):
def __init__(self,iterin):
self.__iter = iterin
self.__iterno = 0
self.__iteritems = []
self.__hasstopped = None

- def registerIter():
+ def registerIter(self):
iterno = self.__iterno
self.__iterno += 1
self.__iteritems.append([])
+ return iterno

def getNext(self,iterno):
if self.__iteritems[iterno]:
iteritem = self.__iteritems[iterno].pop(0)
elif self.__hasstopped is not None:
raise self.__hasstopped
else:
try:
iteritem = self.__iter.next()
except StopIteration, e:
self.__hasstopped = e
raise
for id, i in enumerate(self.__iteritems):
if id <> iterno:
i.append(copy.deepcopy(iteritem))
return iteritem

class IDupped(object):
def __init__(self,idup):
self.__idup = idup
self.__iterno = idup.registerIter()

def next(self):
return self.__idup.getNext(self.__iterno)

+ def __iter__(self):
+ return self

def isplit(iterin,splitno=2):
idup = IDup(iterin)
iduppeds = []
for i in range(splitno):
iduppeds.append(IDupped(idup))
return tuple(iduppeds)

test = ["hello","how","are","you?"]
x, y = isplit(iter(test))

for i in x:
print i
for i in y:
print i

--- END OF CODE ---

This works as stated.

Heiko.
 
B

Beni Cherniavsky

Ronald Legere wrote on 2003-07-29:
The new itertools stuff is pretty cool. One thing that bothers me
though is that there seems to be no way to copy an iterator. How
does one work around this? Is there a trick that I am missing?

Without being able to 'split' an iterator, I can't think of how to
do some of the cool things you can do with lazy lists in Haskell. It
seems quite often you want to be able to do this 'splitting', and if
I had more coffee I would post an example :) The ones I can think of
are also defined recursively, which is another kettle of fish.
Strange, the need seems to come up more freqeuntly since I implemented
it ;-). Or is it just me paying more attention to it? After a couple
of design iterations, I've got:

http://www.technion.ac.il/~cben/python/streams.py

[The last changes were addition of __nonzero__ methods, clued by
discussion on c.l.py <wink>, and a `.force(count=None)` method to
force computing values.]

I concluded that it's best to separate the two abstractions, providing
ability to convert both ways:

- An iterable immutable (but lazily computed) linked list, with O(1)
tail access and ability to cons in front of it.

- An iterator object whose `.next()` method is necessarily
destructive, on another hand. Turns out the most flexible
implementation of the iterator is simply a pointer to a linked list;
this way you can split the iterator and even "unyield" values back
onto it. See the module for more details.

In general it's most clean to separate iterables from iterators.
Iterables should be changed by iterating.

Any feedback welcome. I'd like to make it as Pythonic as possible.
An perhaps make it standard. Particular open questions:

- Terminology. There are too many names for the same things ;-).

- The linked lists are called "streams" by me because they are lazy
and that's what such beasts are called in functional languages.
But I still have many refernces to them as linked lists,
especially w.r.t. the `Cons` class which happens to be the base
class of `Stream`. So `Stream` is defined as a lazy linked list
node. Should I try to squash refences to "linked lists", treating
them as "stream nodes which are already computed" to reduce
confusion with Python lists?

- car/cdr vs. first/rest vs. head/tail. I chose head/tail.
car/cdr would be opaque to people without lisp background.
first/rest are of different length :).

- `Cons` still requires lisp background but I don't know any name
that would be recognizable to non-lispers anyway. And it's not
a bad name.

- Is `StreamIter` the best name?

- I called the operation for creating new iterator from same place
"forking". Alternative names: `split`, `copy`, `clone`, `dup`,
"lookahead", etc. What's best?

- Consing in front of a StreamIter's stream is called `.unyeild`.
The point is that it's the opposite of the ``yield`` statement
but the name is a brave new invention. Alternatives: `prev`
(antonym of `.next()`, probably misguided since it doesn't
return the previous value but rather takes it from you),
`pushback`, `unread`, `stuff`, etc.

- Functionality:

- Should the operations (`.head`/`.tail`, indexing) on the
underlying stream be proxied by `StreamIter` or should the
programmer have to spell out e.g. ``iterator.stream.head`` as now?

- What to do about __repr__ and __str__? The streams could well be
infinite and forcing lookahead for even a few values is
unacceptable as this could be expensive.

- What to do about __hash__ and __eq__?

- Should I weakly memoize the `Stream` for each given iterable?
That would reduce the risk of wrapping a destructive iterator
twice with distinct streams, so that each produced item is seen by
only one stream, with access-order-dependant interleaving,
violating the functional intent of the laziness. OTOH, this would
make the programmer less aware of the risk. Besides, I'm not sure
what to return on the second attempt to wrap same iterator after
some of it has already been consumed since the first time.

- Would some convenience operations for lookahead, e.g.
`.startswith()` be useful?

- Anything else anybody needs?
 
M

Mike Rovner

Beni said:
Any feedback welcome. I'd like to make it as Pythonic as possible.
An perhaps make it standard. Particular open questions:
- The linked lists are called "streams" by me because they are lazy
and that's what such beasts are called in functional languages.

Stream is used differently in Tcl and Unix.
Minded separation Python from functional approach (deprecation of filter,
map;
doubtfulness of implementing tail recursion optimization) probably it is not
so
inspiring for regular python programmer.

Lazy "streams" are a good thing provided they can be easily used for
organizing data pathes and connections.
- `Cons` still requires lisp background but I don't know any name
that would be recognizable to non-lispers anyway. And it's not
a bad name.


Construct, Glue, Merge?
Taken readability any whole word is better.
- I called the operation for creating new iterator from same place
"forking". Alternative names: `split`, `copy`, `clone`, `dup`,
"lookahead", etc. What's best?

copy is already using construct in python. Why not use it?
- Anything else anybody needs?

If I got the intend of Stream correctly, I wish use them as Tcl streams,
so I'd need combining and filtering and i/o.

Just 0.02

Mike
 
B

Beni Cherniavsky

Mike Rovner wrote on 2003-07-30:
Stream is used differently in Tcl and Unix. Minded separation
Python from functional approach (deprecation of filter, map;
doubtfulness of implementing tail recursion optimization) probably
it is not so inspiring for regular python programmer.
Point taken. But what do you call it then? Lazy linked lists has
only one single-word name that I'm aware of and that is "streams";
there is no name for it in Tcl and Unix becauit is almost never used
in them, as far as I know.
Lazy "streams" are a good thing provided they can be easily used for
organizing data pathes and connections.
Python already provides the minimal construct for connecting producers
and consumers of data: the iterator protocol. It is minimalistic and
becomes inconvenient when you want to use the same value more than one
time, for lookahead, backtracking or simply connecting the same source
to many consumers. All this is quite easy with the head/tail access
to streams; it's a very powerfull abstraction. Since streams are
implemented as linked lists, they don't leak memory for infinite
iterators, unless you keep a reference to a fixed position in the
stream. Over streams I've built an iterator which also provides the
same powers but more in line with Python's iterator protocol.
Construct, Glue, Merge?
Taken readability any whole word is better.
`Construct` is too long. Compare with `str`, `int`, `dict` rather
than `string`, `integer` and `dictionary`.

How would `Glue` and `Merge` be meaningful here? The only
associasions of "glue" are what TeX puts between boxes and "glue
layers" between different languages or libraries. "Merge" sounds like
the process of taking several sequences of values and interlevaing
them in some way (like in merge sort; BTW, it needs a lookahead of 1
for each stream, so it should be convenient to implement with
streams). But this has nothing to do with what `Cons` does: creating
a new stream by prepending given head to given tail. `Cons` is the
only term I know of that unabigously refers to this operation, again
coming from functional languages <wink>.

There is another possibility: rename `Cons` to `Stream` and `Stream`
to `LazyStream`; `Stream` would then be a factory function creating
either depending on whether it got 1 or two arguments. Would this be
better?
copy is already using construct in python. Why not use it?
Copy was adviced against by Erik Max Francis, see:

http://groups.google.com/[email protected]

However, it really copies the iterator, it's just very fast. So I
think I will indeed name this `.copy()` (aliased as `.__copy__`).
If I got the intend of Stream correctly, I wish use them as Tcl streams,
so I'd need combining and filtering and i/o.
You can use a `Stream` or `StreamIter` over a file object to get
lookahead and mulitple passes over it, without requiring the file to
be seekable! That's one of the main uses. It also gives you ability
to "unread" any amount of data. Note that if you simply apply it to a
file the stream operates on a line at a time; you will need to wrap
the file with a custom generator if you wish to operate at character
level; working at token level might be the best compromise.

I'm not familiar with Tcl streams, so I don't understand the
"combining and filtering" reference - can you elaborate?
 
M

Mike Rovner

Beni said:
Mike Rovner wrote on 2003-07-30:

I'm sorry for taking so long to answer (I had a project deadline to meet
;) - done).
Point taken. But what do you call it then? Lazy linked lists has
only one single-word name that I'm aware of and that is "streams";
there is no name for it in Tcl and Unix becauit is almost never used
in them, as far as I know.

I made an intensive search on internet and convinced now that term 'streams'
is globaly associated with i/o. I still think using this term for universal
data
structure will be misleading. However I can't come with a good word
(my best is LazyList).
`Construct` is too long. Compare with `str`, `int`, `dict` rather
than `string`, `integer` and `dictionary`.

But they all are (long-lived) built-ins. AFAIK now there is a tendency
to make language more readable (isinstance, enumerate to name a few
hmm recent additions)
How would `Glue` and `Merge` be meaningful here? The only
associasions of "glue" are what TeX puts between boxes and "glue
layers" between different languages or libraries. "Merge" sounds like
the process of taking several sequences of values and interlevaing
them in some way (like in merge sort; BTW, it needs a lookahead of 1
for each stream, so it should be convenient to implement with
streams). But this has nothing to do with what `Cons` does: creating
a new stream by prepending given head to given tail. `Cons` is the
only term I know of that unabigously refers to this operation, again
coming from functional languages <wink>.

There is another possibility: rename `Cons` to `Stream` and `Stream`
to `LazyStream`; `Stream` would then be a factory function creating
either depending on whether it got 1 or two arguments. Would this be
better?

No. In discussion about list.insert() and/or range() difference in behaivor
dependent on args was prononced A Bad Thing (IIRC).
You can use a `Stream` or `StreamIter` over a file object to get
lookahead and mulitple passes over it, without requiring the file to
be seekable! That's one of the main uses. It also gives you ability
to "unread" any amount of data. Note that if you simply apply it to a
file the stream operates on a line at a time; you will need to wrap
the file with a custom generator if you wish to operate at character
level; working at token level might be the best compromise.

I'm not familiar with Tcl streams, so I don't understand the
"combining and filtering" reference - can you elaborate?

I didn't touch tcl for several years ;)
But the idea is pretty simple: to pass file object throw 'filter' to get
'another' file(-like) object.

import compress as Z
for line in tarfile.open('-','r',Z.open(filename)).extractiter()

instead of

stream = tarfile.open('-','r',Z.open(filename))
for elem in tarfile.open('-','r',Z.open(filename)).getmembers():
for line in stream.extract(elem):
...

or something like that. I admit that later example based on new 2.3 syntax
fulfill my (long being) expectations quite well.

Thanks,
Mike
 
N

Ng Pheng Siong

According to Mike Rovner said:
I made an intensive search on internet and convinced now that term 'streams'
is globaly associated with i/o. I still think using this term for universal
data
structure will be misleading. However I can't come with a good word
(my best is LazyList).

Once upon a time, in the days of 386BSD, certain people claimed to be
reinnovating Sys V streams for 386BSD, which were to be called "currents".

Somehow this episode stuck in my mind. (I think it was the sound of a
balloon bursting from too much hot air. ;-)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,008
Latest member
obedient dusk

Latest Threads

Top