This is referring to the case where your two iterators get out of sync
by a long way. If you only consume 3 extra items it will just store
those 3 items in a list.
Fair point.
As I understand it the above says that
items = infinite()
a, b = tee(items)
for item in islice(a, 1000):
pass
for pair in izip(a, b):
pass
stores 1000 items and can go on forever, but
items = infinite()
a, b = tee(items)
for item in a:
pass
will consume unbounded memory and that if items is finite using a list
instead of tee is more efficient. The documentation says nothing about
items = infinite()
a, b = tee(items)
del a
for item in b:
pass
so you have to trust Mr Hettinger or come up with a test case...
I'd say you've just devised a nice test to find out
$ cat tee.py
#!/usr/bin/env python
import sys
from itertools import tee
items = iter(range(int(sys.argv[1])))
while True:
for x in items:
items, discard = tee(items)
break
else:
break
print(x)
$ time py -3.3 ./tee.py 100000000
99999999
real 1m47.711s
user 0m0.015s
sys 0m0.000s
While running the above python.exe was using 6MB of memory (according
to Task Manager). I believe this is because tee() works as follows
(which I made up but it's how I imagine it).
When you call tee(iterator) it creates two _tee objects and one
_teelist object. The _teelist object stores all of the items that have
been seen by only one of _tee1 and _tee2, a reference to iterator and
a flag indicating which _tee object has seen more items. When say
_tee2 is deallocated the _teelist becomes singly owned and no longer
needs to ever accumulate items (so it doesn't). So the dereferenced
discard will not cause an arbitrary growth in memory usage.
There is a separate problem which is that if you call tee() multiple
times then you end up with a chain of tees and each next call would go
through each one of them. This would cause a linear growth in the time
taken to call next() leading to quadratic time performance overall.
However, this does not occur with the script I showed above. In
principle it's possible for a _tee object to realise that there is a
chain of singly owned _tee and _teelist objects and bypass them
calling next() on the original iterator but I don't know if this is
what happens.
However, when I ran the above script on Python 2.7 it did consume
massive amounts of memory (1.6GB) and ran slower so maybe this depends
on optimisations that were introduced in 3.x.
Here's an alternate iterator recipe that doesn't depend on these optimisations:
from itertools import islice
from collections import deque
class Peekable(object):
def __init__(self, iterable):
self.iterator = iter(iterable)
self.peeked = deque()
def __iter__(self):
while True:
while self.peeked:
yield self.peeked.popleft()
yield next(self.iterator)
def peek(self):
for p in self.peeked:
yield p
for val in self.iterator:
self.peeked.append(val)
yield val
with open("tmp.txt") as f:
f = Peekable(f)
for outer in f:
print outer,
if "*" in outer:
for inner in islice(f.peek(), 3):
print " ", inner,
Oscar