beginning index for generators

K

kosh

I was wondering if there is or there could be some way to pass a generator an
optional starting index so that if it supported that slicing could be made
more efficient. Right now if you do use a generator and do a [100:110] or any
other kind of slice it reads all the values up to 100 also.

I know a lot of generators have to do all previous parts before they can do
the next part but it would be a nice optimization that can start at any index
to take an optional argument to tell them what index they are currently
working on. This could make strides more efficient also since it would not
have to run the generator to skip the parts it was not using.

The reason I would like this to be part of the generator rather then making a
special function to do this is that it would make it easier to refactor a
piece of code to be a much more efficient generator from some other sequence
type without having to change how it is called.

I have no idea how feasible this is, what it would take to implement it etc
but I did not get any matches searching this on google and wanted to throw it
out here to see if others have run into situations in which this would be
very useful also and would further simplify code and make it more readable.

I would also love to know of other ways that this problem could be solved but
I would like it to be as transparent as possible to existing calling code.
 
D

Diez B. Roggisch

kosh said:
I was wondering if there is or there could be some way to pass a generator
an optional starting index so that if it supported that slicing could be
made more efficient. Right now if you do use a generator and do a
[100:110] or any other kind of slice it reads all the values up to 100
also.

I have no idea how feasible this is, what it would take to implement it
etc but I did not get any matches searching this on google and wanted to
throw it out here to see if others have run into situations in which this
would be very useful also and would further simplify code and make it more
readable.


There is no way of doing this implicit - only explicit, with deep knowledge
of what actually gets iterated over by the generator - and as the slicing
is a second, independent operation on the generator that only knows about
the generator implementing the iterable protocol, it can't jump to
arbitrary positions in that generator. And as you said before, the
generator needs to be invoked that 100 times so it has consistent internal
state.
I would also love to know of other ways that this problem could be solved
but I would like it to be as transparent as possible to existing calling
code.

Maybe imlementing __getslice__ on your container object helps - that way,
calling code has not to be changed, but the implementation can take
advantage from the underlying collection beeing efficiently sliceable.
 
P

Peter Otten

kosh said:
I was wondering if there is or there could be some way to pass a generator
an optional starting index so that if it supported that slicing could be
made more efficient. Right now if you do use a generator and do a
[100:110] or any other kind of slice it reads all the values up to 100
also.

Here is an 'iterate' class that decides whether to use iteration over or
indexed access of items based on the existence of a __getitem__() method.
It's mostly a proof of concept and not tested beyond what you see below.

Peter

import itertools, sys

def irange(start, stop, step):
if start is None:
start = 0
if step is None:
step = 1
index = start
if stop is None:
while True:
yield index
index += step
else:
if step < 0:
while index > stop:
yield index
index += step
else:
while index < stop:
yield index
index += step

def islice(iterable, start, step, stop):
if start is None: start = 0
if step is None: step = 1
return itertools.islice(iterable, start, stop, step)

class iterate(object):
def __init__(self, iterable):
self.iterable = iterable
def __getitem__(self, s):
if not isinstance(s, slice):
raise ValueError("supports only slices")
start, stop, step = s.start, s.stop, s.step
if hasattr(self.iterable, "__getitem__"):
indexable = self.iterable
def iterate():
for index in irange(start, stop, step):
yield indexable[index]
return iterate()
return islice(self.iterable, start, step, stop)

class Sequential(object):
def __init__(self):
self.index = 0
def __iter__(self):
return self
def next(self):
print "sequential[%s]" % self.index
try:
return self.index
finally:
self.index += 1

class Random(Sequential):
def __getitem__(self, index):
print "random[%s]" % index
return index


def printThem(iterable):
for i in iterable:
print i
print "---"

printThem(iterate(Random())[3:10:2])
printThem(iterate(Random())[3:10])
printThem(iterate(Sequential())[3:10:2])
printThem(iterate(Sequential())[3:10])
# I'd rather not:
# printThem(iterate(Sequential())[sys.maxint-10::3])
printThem(itertools.islice(iterate(Random())[sys.maxint-5::3], 10))
 
T

Tuure Laurinolli

kosh said:
I was wondering if there is or there could be some way to pass a generator an
optional starting index so that if it supported that slicing could be made
more efficient. Right now if you do use a generator and do a [100:110] or any
other kind of slice it reads all the values up to 100 also.

I know a lot of generators have to do all previous parts before they can do
the next part but it would be a nice optimization that can start at any index
to take an optional argument to tell them what index they are currently
working on. This could make strides more efficient also since it would not
have to run the generator to skip the parts it was not using.

I think adding an optional .skip(n) to the fenerator protocol could be
useful, it could be used for strides too, and I can't think of a much
more generic way to do things.

However, the caller willing to use this optimization would always need
to check if .skip() is defined, although this could naturally be
automated to some degree.

It could be specified as:

If an iterator object has a function skip([n=1]), it can be used to
advance the iterator n steps without it having to yield the intermediate
values. The end state of the iterator should be the same as with:
for i in range(iterable):
iterable.next()
 
A

Andrew Dalke

Peter said:
def irange(start, stop, step):

Isn't that xrange?
def islice(iterable, start, step, stop):

And that's itertools.islice?
class iterate(object):
def __init__(self, iterable):
self.iterable = iterable
def __getitem__(self, s):
if not isinstance(s, slice):
raise ValueError("supports only slices")
start, stop, step = s.start, s.stop, s.step
if hasattr(self.iterable, "__getitem__"):

And doesn't that test fail for dictionaries?

Andrew
(e-mail address removed)
 
P

Peter Otten

Andrew said:
Isn't that xrange?

No. That's what I would like xrange() to be. It has no artificial (if
generous) sys.maxint limit and can be open-ended, i. e. comprises the
itertools.count() functionality extended for step != 1.
And that's itertools.islice?

No, but it _calls_ itertool.islice() as you may have noted :)
Its raison d'être is that it allows to pass the start, stop, and step
attributes of a slice object without looking into the special cases.
itertools.islice() works with different signatures instead of None default
values. When called with three parameters, only 'stop' may be None. Slice
literals cannot be that strict because they may have a negative step, in
which case start values of None and 0 have a different meaning.

I had all checks for special cases in the iterate class first and factored
them out as an afterthought because they made it harder to see how simple
'iterate' really was.
And doesn't that test fail for dictionaries?

Yes, unfortunately. Discriminating between dictionaries and lists has been
deliberately left out. Is there a recommended way to do it? Looking for a
'keys' attribute maybe? Anyway, as long as something like iterate is not
built into the iterator protocol, it is up to the programmer to decide
whether it can help or do harm. Dictionaries may be wrapped somewhere up
the callstack:
False

Peter
 
A

Andrew Dalke

Peter Otten
No. That's what I would like xrange() to be.

Ahh, I had forgotten about that restriction.

BTW, this is kinda cute with your code
>>> list(irange("A", "AAAAAAAAA", "A")) ['A', 'AA', 'AAA', 'AAAA', 'AAAAA', 'AAAAAA', 'AAAAAAA', 'AAAAAAAA']
>>>
And that's itertools.islice?


No, but it _calls_ itertool.islice() as you may have noted :)
Its raison d'être is that it allows to pass the start, stop, and step
attributes of a slice object without looking into the special cases.

Acutally I didn't notice. I was on my way out the door and didn't
look that closely. My apologies.

Given the worry you had above about the sys.maxint limit, aren't
you also concerned about the same limits with itertools.islice?
Yes, unfortunately. Discriminating between dictionaries and lists has been
deliberately left out. Is there a recommended way to do it? Looking for a
'keys' attribute maybe?

No, because there are some data structure which act like both
a list and a dictionary. Eg, one where the dictionary keys are
strings and the items are stored and accesible via integers
into the sort order.

Andrew
(e-mail address removed)
 
P

Peter Otten

Andrew said:
BTW, this is kinda cute with your code
list(irange("A", "AAAAAAAAA", "A")) ['A', 'AA', 'AAA', 'AAAA', 'AAAAA', 'AAAAAA', 'AAAAAAA', 'AAAAAAAA']

This is of course completely accidental and strongly discouraged :)
On the other hand I don't like arbitrary limitations like that of the sum()
built-in.
Acutally I didn't notice. I was on my way out the door and didn't
look that closely. My apologies.

De nada. It _was_ named suggestively to trigger your perception.
Given the worry you had above about the sys.maxint limit, aren't
you also concerned about the same limits with itertools.islice?

I'm not especially worried about these limits, I just didn't like the look
of xrange(start, sys.maxint-step+1, step) (?) as the substitute for an
open-ended sequence and decided to do it right - after all large indices
are a corner case where the 'iterate' class shines if the iterable is
"indexable", too. As to itertools.islice(), anyone who throws away, say,
the first 2**31 values of an iteration has probably other things than
Python's limitations to worry about.

Peter
 
A

Alex Martelli

kosh said:
I was wondering if there is or there could be some way to pass a generator an
optional starting index so that if it supported that slicing could be made
more efficient. Right now if you do use a generator and do a [100:110] or any
other kind of slice it reads all the values up to 100 also.

You can pass any starting arguments you want when you call a generator,
just like you can when you call any other function. You can, if you
wish, make that argument optional, and by implied agreement between the
call site and the generator you can give that argument any meaning that
the generator is willing and able to supply. "This is the starting
index" is in no way different from any other meaning whatsoever.

This is all so obvious that I'm afraid you mean something very different
from what you're actually saying, but, if so, then you'll just have to
try to express yourself differently, because I can't read minds!-)

I know a lot of generators have to do all previous parts before they can do

Right, of course, so presumably such generators would never let you pass
you that optional "this is the starting index" argument -- with them you
have to start from the beginning, period.
the next part but it would be a nice optimization that can start at any index
to take an optional argument to tell them what index they are currently
working on. This could make strides more efficient also since it would not
have to run the generator to skip the parts it was not using.

I suspect you don't mean by "stride" the same thing I do. To me, being
an old Fortran/Linpack/etc programmer, a "stride" in a loop is the
amount by which the loop-control variable (an integer) is incremented
after each leg of the loop. In Python, for example, the third argument
to range, xrange or slice is what I think of as "a stride", though
Python tends to call it "a step" instead (but then, in English, a stride
IS nothing more than a long step!-). I don't see what passing an
optional starting index has to do with strides, or what it means to
"make strides more efficient". You could of course design your
generators, if what they're computing is suitable for that, to take an
optional stride argument (presumably defaulting to 1) instead of, or in
addition to, an optional start argument(presumably defaulting to 0).
Again, this is so obvious that I suspec we're just talking at
cross-purposes...

The reason I would like this to be part of the generator rather then making a
special function to do this is that it would make it easier to refactor a
piece of code to be a much more efficient generator from some other sequence
type without having to change how it is called.

You can always design and code a generator instead of "some other"
_function_ returning an iterator, or vice versa. The generator is often
more convenient, but whether it's "much more efficient" is of course
dubious. For example, all the stuff in itertools _could_ have been
coded as generators, but in fact they're coded as callables returning
iterators of other kinds (instances of C-coded iterator types), as
_that_ turns out to be the "much more efficient" implementation choice
in this case.

I have no idea how feasible this is, what it would take to implement it etc
but I did not get any matches searching this on google and wanted to throw it
out here to see if others have run into situations in which this would be
very useful also and would further simplify code and make it more readable.

I would also love to know of other ways that this problem could be solved but
I would like it to be as transparent as possible to existing calling code.

Perhaps if you can give some strawman examples of how you would like to
be able to code something, with exactly what inputs and what results,
somebody can understand exactly what you're wishing for, because, as of
now, I'm totally incapable of such understanding, beyond the obvieties
above expounded (and I suspect they ARE obvious enough that, if that was
all you were after, you wouldn't have posted -- but this doesn't allow
me to divine what it IS you're after, because the obvieties I expounded
DO match 100% of what you have literally been writing in this post).


Alex
 
A

Andrew Dalke

Peter said:
As to itertools.islice(), anyone who throws away, say,
the first 2**31 values of an iteration has probably other things than
Python's limitations to worry about.

True enough.

Personally I'm surprised to see that my two year old laptop
can do

for i in xrange(0, sys.maxint):
pass

in just over 7 minutes. It's like in the 1980s when I
realized that 16 bit numbers were small.


Andrew
(e-mail address removed)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,008
Latest member
obedient dusk

Latest Threads

Top