Controlling a generator the pythonic way

Discussion in 'Python' started by Thomas Lotze, Jun 11, 2005.

  1. Thomas Lotze

    Thomas Lotze Guest

    Hi,

    I'm trying to figure out what is the most pythonic way to interact with
    a generator.

    The task I'm trying to accomplish is writing a PDF tokenizer, and I want
    to implement it as a Python generator. Suppose all the ugly details of
    toknizing PDF can be handled (such as embedded streams of arbitrary
    binary content). There remains one problem, though: In order to get
    random file access, the tokenizer should not simply spit out a series of
    tokens read from the file sequentially; it should rather be possible to
    point it at places in the file at random.

    I can see two possibilities to do this: either the current file position
    has to be read from somewhere (say, a mutable object passed to the
    generator) after each yield, or a new generator needs to be instantiated
    every time the tokenizer is pointed to a new file position.

    The first approach has both the disadvantage that the pointer value is
    exposed and that due to the complex rules for hacking a PDF to tokens,
    there will be a lot of yield statements in the generator code, which
    would make for a lot of pointer assignments. This seems ugly to me.

    The second approach is cleaner in that respect, but pointing the
    tokenizer to some place has now the added semantics of creating a whole
    new generator instance. The programmer using the tokenizer now needs to
    remember to throw away any references to the generator each time the
    pointer is reset, which is also ugly.

    Does anybody here have a third way of dealing with this? Otherwise,
    which ugliness is the more pythonic one?

    Thanks a lot for any ideas.

    --
    Thomas
     
    Thomas Lotze, Jun 11, 2005
    #1
    1. Advertising

  2. Thomas Lotze

    Peter Hansen Guest

    Thomas Lotze wrote:
    > I can see two possibilities to do this: either the current file position
    > has to be read from somewhere (say, a mutable object passed to the
    > generator) after each yield, or a new generator needs to be instantiated
    > every time the tokenizer is pointed to a new file position.
    >...
    > Does anybody here have a third way of dealing with this? Otherwise,
    > which ugliness is the more pythonic one?


    The third approach, which is certain to be cleanest for this situation,
    is to have a custom class which stores the state information you need,
    and have the generator simply be a method in that class. There's no
    reason that a generator has to be a standalone function.

    class PdfTokenizer:
    def __init__(self, ...):
    # set up initial state

    def getTokens(self):
    while whatever:
    yield token

    def seek(self, newPosition):
    # change state here

    # usage:
    pdf = PdfTokenizer('myfile.pdf', ...)
    for token in pdf.getTokens():
    # do stuff...

    if I need to change position:
    pdf.seek(...)

    Easy as pie! :)

    -Peter
     
    Peter Hansen, Jun 11, 2005
    #2
    1. Advertising

  3. Thomas Lotze

    Thomas Lotze Guest

    Peter Hansen wrote:

    > Thomas Lotze wrote:
    >> I can see two possibilities to do this: either the current file position
    >> has to be read from somewhere (say, a mutable object passed to the
    >> generator) after each yield, [...]

    >
    > The third approach, which is certain to be cleanest for this situation, is
    > to have a custom class which stores the state information you need, and
    > have the generator simply be a method in that class.


    Which is, as far as the generator code is concerned, basically the same as
    passing a mutable object to a (possibly standalone) generator. The object
    will likely be called self, and the value is stored in an attribute of it.

    Probably this is indeed the best way as it doesn't require the programmer
    to remember any side-effects.

    It does, however, require a lot of attribute access, which does cost some
    cycles.

    A related problem is skipping whitespace. Sometimes you don't care about
    whitespace tokens, sometimes you do. Using generators, you can either set
    a state variable, say on the object the generator is an attribute of,
    before each call that requires a deviation from the default, or you can
    have a second generator for filtering the output of the first. Again, both
    solutions are ugly (the second more so than the first). One uses
    side-effects instead of passing parameters, which is what one really
    wants, while the other is dumb and slow (filtering can be done without
    taking a second look at things).

    All of this makes me wonder whether more elaborate generator semantics
    (maybe even allowing for passing arguments in the next() call) would not
    be useful. And yes, I have read the recent postings on PEP 343 - sigh.

    --
    Thomas
     
    Thomas Lotze, Jun 11, 2005
    #3
  4. Thomas Lotze

    Peter Hansen Guest

    Thomas Lotze wrote:
    > Which is, as far as the generator code is concerned, basically the same as
    > passing a mutable object to a (possibly standalone) generator. The object
    > will likely be called self, and the value is stored in an attribute of it.


    Fair enough, but who cares what the generator code thinks? It's what
    the programmer has to deal with that matters, and an object is going to
    have a cleaner interface than a generator-plus-mutable-object.

    > Probably this is indeed the best way as it doesn't require the programmer
    > to remember any side-effects.
    >
    > It does, however, require a lot of attribute access, which does cost some
    > cycles.


    Hmm... "premature optimization" is all I have to say about that.

    -Peter
     
    Peter Hansen, Jun 11, 2005
    #4
  5. Thomas Lotze

    Mike Meyer Guest

    Thomas Lotze <> writes:
    > A related problem is skipping whitespace. Sometimes you don't care about
    > whitespace tokens, sometimes you do. Using generators, you can either set
    > a state variable, say on the object the generator is an attribute of,
    > before each call that requires a deviation from the default, or you can
    > have a second generator for filtering the output of the first. Again, both
    > solutions are ugly (the second more so than the first). One uses
    > side-effects instead of passing parameters, which is what one really
    > wants, while the other is dumb and slow (filtering can be done without
    > taking a second look at things).


    I wouldn't call the first method ugly; I'd say it's *very* OO.

    Think of an object instance as a machine. It has various knobs,
    switches and dials you can use to control it's behavior, and displays
    you can use to read data from it, or parts of its state . A switch
    labelled "ignore whitespace" is a perfectly reasonable thing for a
    tokenizing machine to have.

    Yes, such a switch gets the desired behavior as a side effect. Then
    again, a generator that returns tokens has a desired behavior
    (advancing to the next token) as a side effect(*). If you think about
    these things as the state of the object, rather than "side effects",
    it won't seem nearly as ugly. In fact, part of the point of using a
    class is to encapsulate the state required for some activity in one
    place.

    Wanting to do everything via parameters to methods is a very top-down
    way of looking at the problem. It's not necessarily correct in an OO
    environment.

    <mike

    *) It's noticable that some OO languages/libraries avoid this side
    effect: the read method updates an attribute, so you do the read then
    get the object read from the attribute. That's very OO, but not very
    pythonic.
    --
    Mike Meyer <> http://www.mired.org/home/mwm/
    Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
     
    Mike Meyer, Jun 11, 2005
    #5
  6. Thomas Lotze

    Thomas Lotze Guest

    Mike Meyer wrote:

    > Yes, such a switch gets the desired behavior as a side effect. Then again,
    > a generator that returns tokens has a desired behavior (advancing to the
    > next token) as a side effect(*).


    That's certainly true.

    > If you think about these things as the
    > state of the object, rather than "side effects", it won't seem nearly as
    > ugly. In fact, part of the point of using a class is to encapsulate the
    > state required for some activity in one place.
    >
    > Wanting to do everything via parameters to methods is a very top-down way
    > of looking at the problem. It's not necessarily correct in an OO
    > environment.


    What worries me about the approach of changing state before making a
    next() call instead of doing it at the same time by passing a parameter is
    that the state change is meant to affect only a single call. The picture
    might fit better (IMO) if it didn't look so much like working around the
    fact that the next() call can't take parameters for some technical reason.

    I agree that decoupling state changes and next() calls would be perfectly
    beautiful if they were decoupled in the problem one wants to model. They
    aren't.

    > *) It's noticable that some OO languages/libraries avoid this side
    > effect: the read method updates an attribute, so you do the read then
    > get the object read from the attribute. That's very OO, but not very
    > pythonic.


    Just out of curiosity: What makes you state that that behaviour isn't
    pythonic? Is it because Python happens to do it differently, because of a
    gut feeling, or because of some design principle behind Python I fail to
    see right now?

    --
    Thomas
     
    Thomas Lotze, Jun 12, 2005
    #6
  7. Thomas Lotze

    Thomas Lotze Guest

    Peter Hansen wrote:

    > Fair enough, but who cares what the generator code thinks? It's what the
    > programmer has to deal with that matters, and an object is going to have a
    > cleaner interface than a generator-plus-mutable-object.


    That's right, and among the choices discussed, the object is the one I do
    prefer. I just don't feel really satisfied...

    >> It does, however, require a lot of attribute access, which does cost
    >> some cycles.

    >
    > Hmm... "premature optimization" is all I have to say about that.


    But when is the right time to optimize? There's a point when the thing
    runs, does the right thing and - by the token of "make it run, make it
    right, make it fast" - might get optimized. And if there are places in a
    PDF library that might justly be optimized, the tokenizer is certainly one
    of them as it gets called really often.

    Still, I'm going to focus on cleaner code and, first and foremost, a clean
    API if it comes to a decision between these goals and optimization - at
    least as long as I'm talking about pure Python code.

    --
    Thomas
     
    Thomas Lotze, Jun 12, 2005
    #7
  8. Thomas Lotze

    Thomas Lotze Guest

    Thomas Lotze wrote:

    > Does anybody here have a third way of dealing with this?


    Sleeping a night sometimes is an insightful exercise *g*

    I realized that there is a reason why fiddling with the pointer from
    outside the generator defeats much of the purpose of using one. The
    implementation using a simple method call instead of a generator needs
    to store some internal state variables on an object to save them for the
    next call, among them the pointer and a tokenization mode.

    I could make the thing a generator by turning the single return
    statement into a yield statement and adding a loop, leaving all the
    importing and exporting of the pointer intact - after all, someone might
    reset the pointer between next() calls.

    This is, however, hardly using all the possibilities a generator allows.
    I'd rather like to get rid of the mode switches by doing special things
    where I detect the need for them, yielding the result, and proceeding as
    before. But as soon as I move information from explicit (state variables
    that can be reset along with the pointer) to implicit (the point where
    the generator is suspended after yielding a token), resetting the
    pointer will lead to inconsistencies.

    So, it seems to me that if I do want to use generators for any practical
    reason instead of just because generators are way cool, they need to be
    instantiated anew each time the pointer is reset, for simple consistency
    reasons.

    Now a very simple idea struck me: If one is worried about throwing away
    a generator as a side-effect of resetting the tokenization pointer, why
    not define the whole tokenizer as not being resettable? Then the thing
    needs to be re-instantiated very explicitly every time it is pointed
    somewhere. While still feeling slightly awkward, it has lost the threat
    of doing unexpected things.

    Does this sound reasonable?

    --
    Thomas
     
    Thomas Lotze, Jun 12, 2005
    #8
  9. Thomas Lotze

    Thomas Lotze Guest

    Thomas Lotze wrote:

    > A related problem is skipping whitespace. Sometimes you don't care about
    > whitespace tokens, sometimes you do. Using generators, you can either set
    > a state variable, say on the object the generator is an attribute of,
    > before each call that requires a deviation from the default, or you can
    > have a second generator for filtering the output of the first.


    Last night's sleep was really productive - I've also found another way
    to tackle this problem, and it's really simple IMO. One could pass the
    parameter at generator instantiation time and simply create two
    generators behaving differently. They work on the same data and use the
    same source code, only with a different parametrization.

    All one has to care about is that they never get out of sync. If the
    data pointer is an object attribute, it's clear how to do it. Otherwise,
    both could acquire their data from a common generator that yields the
    PDF content (or a buffer representing part of it) character by
    character. This is even faster than keeping a pointer and using it as an
    index on the data.

    --
    Thomas
     
    Thomas Lotze, Jun 12, 2005
    #9
  10. Thomas Lotze

    Kent Johnson Guest

    Thomas Lotze wrote:
    > Mike Meyer wrote:
    > What worries me about the approach of changing state before making a
    > next() call instead of doing it at the same time by passing a parameter is
    > that the state change is meant to affect only a single call. The picture
    > might fit better (IMO) if it didn't look so much like working around the
    > fact that the next() call can't take parameters for some technical reason.


    I suggest you make the tokenizer class itself into an iterator. Then you can define additional next() methods with additional parameters. You could wrap an actual generator for the convenience of having multiple yield statements. For example (borrowing Peter's PdfTokenizer):

    class PdfTokenizer:
    def __init__(self, ...):
    # set up initial state
    self._tokenizer = _getTokens()

    def __iter__(self):
    return self

    def next(self, options=None):
    # set self state according to options, if any
    n = self._tokenizer.next()
    # restore default state
    return n

    def nextIgnoringSpace(self):
    # alterate way of specifying variations
    # ...

    def _getTokens(self):
    while whatever:
    yield token

    def seek(self, newPosition):
    # change state here

    Kent
     
    Kent Johnson, Jun 12, 2005
    #10
  11. Thomas Lotze

    Terry Reedy Guest

    "news:...
    > Thomas Lotze <> writes:
    >> A related problem is skipping whitespace. Sometimes you don't care about
    >> whitespace tokens, sometimes you do. Using generators, you can either
    >> set
    >> a state variable, say on the object the generator is an attribute of,
    >> before each call that requires a deviation from the default, or you can
    >> have a second generator for filtering the output of the first. Again,
    >> both
    >> solutions are ugly (the second more so than the first).


    Given an application that *only* wanted non-white tokens, or tokens meeting
    any other condition, filtering is, to me, exactly the right thing to do and
    not ugly at all. See itertools or roll your own.

    Given an application that intermittently wanted to skip over non-white
    tokens, I would use a *function*, not a second generator, that filtered the
    first when, and only when, that was wanted. Given next_tok, the next
    method of a token generator, this is simply

    def next_nonwhite():
    ret = next_tok()
    while not iswhte(ret):
    ret = next_tok()
    return ret

    A generic method of sending data to a generator on the fly, without making
    it an attribute of a class, is to give the generator function a mutable
    parameter, a list, dict, or instance, which you mutate from outside as
    desired to change the operation of the generator.

    The pair of statements
    <mutate generator mutable>
    val = gen.next()
    can, of course, be wrapped in various possible gennext(args) functions at
    the cost of an additional function call.

    Terry J. Reedy
     
    Terry Reedy, Jun 12, 2005
    #11
  12. Thomas Lotze

    Steve Holden Guest

    Thomas Lotze wrote:
    > Peter Hansen wrote:
    >
    >
    >>Thomas Lotze wrote:
    >>
    >>>I can see two possibilities to do this: either the current file position
    >>>has to be read from somewhere (say, a mutable object passed to the
    >>>generator) after each yield, [...]

    >>
    >>The third approach, which is certain to be cleanest for this situation, is
    >>to have a custom class which stores the state information you need, and
    >>have the generator simply be a method in that class.

    >
    >
    > Which is, as far as the generator code is concerned, basically the same as
    > passing a mutable object to a (possibly standalone) generator. The object
    > will likely be called self, and the value is stored in an attribute of it.
    >
    > Probably this is indeed the best way as it doesn't require the programmer
    > to remember any side-effects.
    >
    > It does, however, require a lot of attribute access, which does cost some
    > cycles.
    >

    Hmm, you could probably make your program run even quicker if you took
    out all the code :)

    Don't assume that there will be a perceptible impact on performance
    until you have written it they easy way. I'll leave you to Google for
    quotes from Donald Knuth about premature optimization.

    > A related problem is skipping whitespace. Sometimes you don't care about
    > whitespace tokens, sometimes you do. Using generators, you can either set
    > a state variable, say on the object the generator is an attribute of,
    > before each call that requires a deviation from the default, or you can
    > have a second generator for filtering the output of the first. Again, both
    > solutions are ugly (the second more so than the first). One uses
    > side-effects instead of passing parameters, which is what one really
    > wants, while the other is dumb and slow (filtering can be done without
    > taking a second look at things).
    >

    And, again, your obsession with performance obscure the far more
    important issue: which solution is easiest to write and maintain. If the
    user then turns up short of cycles they can always elect to migrate to a
    faster computer: this will almost inevitably be cheaper than paying you
    to speed the program up.

    > All of this makes me wonder whether more elaborate generator semantics
    > (maybe even allowing for passing arguments in the next() call) would not
    > be useful. And yes, I have read the recent postings on PEP 343 - sigh.
    >

    Sigh indeed. But if you allow next() calls to take arguments you are
    effectively arguing for the introduction of full coroutines into the
    language, and I suspect there would be pretty limited support for that.

    regards
    Steve
    --
    Steve Holden +1 703 861 4237 +1 800 494 3119
    Holden Web LLC http://www.holdenweb.com/
    Python Web Programming http://pydish.holdenweb.com/
     
    Steve Holden, Jun 13, 2005
    #12
  13. Thomas Lotze

    Thomas Lotze Guest

    Thomas Lotze wrote:

    > I'm trying to figure out what is the most pythonic way to interact with a
    > generator.


    JFTR, so you don't think I'd suddenly lost interest: I won't be able to
    respond for a couple of days because I've just incurred a nice little
    hospital session... will be back next week.

    --
    Thomas
     
    Thomas Lotze, Jun 14, 2005
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Doug Rosser

    What's the Pythonic way to do this?

    Doug Rosser, Sep 10, 2004, in forum: Python
    Replies:
    4
    Views:
    349
    Phillip J. Eby
    Sep 12, 2004
  2. Charles Krug
    Replies:
    11
    Views:
    567
    Bengt Richter
    Apr 27, 2005
  3. Sean Berry
    Replies:
    6
    Views:
    344
    George Sakkis
    Oct 10, 2005
  4. Somesh

    The pythonic way

    Somesh, Dec 1, 2005, in forum: Python
    Replies:
    0
    Views:
    346
    Somesh
    Dec 1, 2005
  5. Carl J. Van Arsdall
    Replies:
    4
    Views:
    519
    Bruno Desthuilliers
    Feb 7, 2006
Loading...

Share This Page