Possible File iteration bug

Discussion in 'Python' started by Billy Mays, Jul 14, 2011.

  1. Billy Mays

    Billy Mays Guest

    I noticed that if a file is being continuously written to, the file
    generator does not notice it:



    def getLines(f):
    lines = []
    for line in f:
    lines.append(line)
    return lines

    with open('/var/log/syslog', 'rb') as f:
    lines = getLines(f)
    # do some processing with lines
    # /var/log/syslog gets updated in the mean time

    # always returns an empty list, even though f has more data
    lines = getLines(f)




    I found a workaround by adding f.seek(0,1) directly before the last
    getLines() call, but is this the expected behavior? Calling f.tell()
    right after the first getLines() call shows that it isn't reset back to
    0. Is this correct or a bug?

    --
    Bill
     
    Billy Mays, Jul 14, 2011
    #1
    1. Advertising

  2. Billy Mays

    Ian Kelly Guest

    On Thu, Jul 14, 2011 at 1:46 PM, Billy Mays <> wrote:
    > def getLines(f):
    >    lines = []
    >    for line in f:
    >        lines.append(line)
    >    return lines
    >
    > with open('/var/log/syslog', 'rb') as f:
    >    lines = getLines(f)
    >    # do some processing with lines
    >    # /var/log/syslog gets updated in the mean time
    >
    >    # always returns an empty list, even though f has more data
    >    lines = getLines(f)
    >
    >
    >
    >
    > I found a workaround by adding f.seek(0,1) directly before the last
    > getLines() call, but is this the expected behavior?  Calling f.tell() right
    > after the first getLines() call shows that it isn't reset back to 0.  Is
    > this correct or a bug?


    This is expected. Part of the iterator protocol is that once an
    iterator raises StopIteration, it should continue to raise
    StopIteration on subsequent next() calls.
     
    Ian Kelly, Jul 14, 2011
    #2
    1. Advertising

  3. Billy Mays

    Billy Mays Guest

    On 07/14/2011 04:00 PM, Ian Kelly wrote:
    > On Thu, Jul 14, 2011 at 1:46 PM, Billy Mays<> wrote:
    >> def getLines(f):
    >> lines = []
    >> for line in f:
    >> lines.append(line)
    >> return lines
    >>
    >> with open('/var/log/syslog', 'rb') as f:
    >> lines = getLines(f)
    >> # do some processing with lines
    >> # /var/log/syslog gets updated in the mean time
    >>
    >> # always returns an empty list, even though f has more data
    >> lines = getLines(f)
    >>
    >>
    >>
    >>
    >> I found a workaround by adding f.seek(0,1) directly before the last
    >> getLines() call, but is this the expected behavior? Calling f.tell() right
    >> after the first getLines() call shows that it isn't reset back to 0. Is
    >> this correct or a bug?

    >
    > This is expected. Part of the iterator protocol is that once an
    > iterator raises StopIteration, it should continue to raise
    > StopIteration on subsequent next() calls.


    Is there any way to just create a new generator that clears its `closed`
    status?

    --
    Bill
     
    Billy Mays, Jul 14, 2011
    #3
  4. Billy Mays <> writes:

    > Is there any way to just create a new generator that clears its
    > closed` status?


    You can define getLines in terms of the readline file method, which does
    return new data when it is available.

    def getLines(f):
    lines = []
    while True:
    line = f.readline()
    if line == '':
    break
    lines.append(line)
    return lines

    or, more succinctly:

    def getLines(f):
    return list(iter(f.readline, ''))
     
    Hrvoje Niksic, Jul 14, 2011
    #4
  5. Billy Mays

    Terry Reedy Guest

    On 7/14/2011 3:46 PM, Billy Mays wrote:
    > I noticed that if a file is being continuously written to, the file
    > generator does not notice it:


    Because it does not look, as Ian explained.

    > def getLines(f):
    > lines = []
    > for line in f:
    > lines.append(line)
    > return lines


    This nearly duplicates .readlines, except for using f an an iterator.
    Try the following (untested):

    with open('/var/log/syslog', 'rb') as f:
    lines = f.readlines()
    # do some processing with lines
    # /var/log/syslog gets updated in the mean time
    lines = f.readlines()

    People regularly do things like this with readline, so it is possible.
    If above does not work, try (untested):

    def getlines(f):
    lines = []
    while True:
    l = f.readline()
    if l: lines.append(l)
    else: return lines

    --
    Terry Jan Reedy
     
    Terry Reedy, Jul 14, 2011
    #5
  6. Billy Mays

    Guest

    On Jul 14, 9:46 pm, Billy Mays <> wrote:
    > I noticed that if a file is being continuously written to, the file
    > generator does not notice it:
    >
    > def getLines(f):
    >      lines = []
    >      for line in f:
    >          lines.append(line)
    >      return lines


    what's wrong with file.readlines() ?
     
    , Jul 15, 2011
    #6
  7. Billy Mays

    Billy Mays Guest

    On 07/15/2011 04:01 AM, wrote:
    > On Jul 14, 9:46 pm, Billy Mays<> wrote:
    >> I noticed that if a file is being continuously written to, the file
    >> generator does not notice it:
    >>
    >> def getLines(f):
    >> lines = []
    >> for line in f:
    >> lines.append(line)
    >> return lines

    >
    > what's wrong with file.readlines() ?


    Using that will read the entire file into memory which may not be
    possible. In the library reference, it mentions that using the
    generator (which calls file.next()) uses a read ahead buffer to
    efficiently loop over the file. If I call .readline() myself, I forfeit
    that performance gain.

    I was thinking that a convenient solution to this problem would be to
    introduce a new Exception call PauseIteration, which would signal to the
    caller that there is no more data for now, but not to close down the
    generator entirely.

    --
    Bill
     
    Billy Mays, Jul 15, 2011
    #7
  8. Am 14.07.2011 21:46 schrieb Billy Mays:
    > I noticed that if a file is being continuously written to, the file
    > generator does not notice it:


    Yes. That's why there were alternative suggestions in your last thread
    "How to write a file generator".

    To repeat mine: an object which is not an iterator, but an iterable.

    class Follower(object):
    def __init__(self, file):
    self.file = file
    def __iter__(self):
    while True:
    l = self.file.readline()
    if not l: return
    yield l

    if __name__ == '__main__':
    import time
    f = Follower(open("/var/log/messages"))
    while True:
    for i in f: print i,
    print "all read, waiting..."
    time.sleep(4)

    Here, you iterate over the object until it is exhausted, but you can
    iterate again to get the next entries.

    The difference to the file as iterator is, as you have noticed, that
    once an iterator is exhausted, it will be so forever.

    But if you have an iterable, like the Follower above, you can reuse it
    as you want.
     
    Thomas Rachel, Jul 15, 2011
    #8
  9. Billy Mays

    Billy Mays Guest

    On 07/15/2011 08:39 AM, Thomas Rachel wrote:
    > Am 14.07.2011 21:46 schrieb Billy Mays:
    >> I noticed that if a file is being continuously written to, the file
    >> generator does not notice it:

    >
    > Yes. That's why there were alternative suggestions in your last thread
    > "How to write a file generator".
    >
    > To repeat mine: an object which is not an iterator, but an iterable.
    >
    > class Follower(object):
    > def __init__(self, file):
    > self.file = file
    > def __iter__(self):
    > while True:
    > l = self.file.readline()
    > if not l: return
    > yield l
    >
    > if __name__ == '__main__':
    > import time
    > f = Follower(open("/var/log/messages"))
    > while True:
    > for i in f: print i,
    > print "all read, waiting..."
    > time.sleep(4)
    >
    > Here, you iterate over the object until it is exhausted, but you can
    > iterate again to get the next entries.
    >
    > The difference to the file as iterator is, as you have noticed, that
    > once an iterator is exhausted, it will be so forever.
    >
    > But if you have an iterable, like the Follower above, you can reuse it
    > as you want.



    I did see it, but it feels less pythonic than using a generator. I did
    end up using an extra class to get more data from the file, but it seems
    like overhead. Also, in the python docs, file.next() mentions there
    being a performance gain for using the file generator (iterator?) over
    the readline function.

    Really what would be useful is some sort of PauseIteration Exception
    which doesn't close the generator when raised, but indicates to the
    looping header that there is no more data for now.

    --
    Bill
     
    Billy Mays, Jul 15, 2011
    #9
  10. On Fri, Jul 15, 2011 at 10:52 PM, Billy Mays <> wrote:
    > Really what would be useful is some sort of PauseIteration Exception which
    > doesn't close the generator when raised, but indicates to the looping header
    > that there is no more data for now.
    >


    All you need is a sentinel yielded value (eg None).

    ChrisA
     
    Chris Angelico, Jul 15, 2011
    #10
  11. Am 15.07.2011 14:26 schrieb Billy Mays:

    > I was thinking that a convenient solution to this problem would be to
    > introduce a new Exception call PauseIteration, which would signal to the
    > caller that there is no more data for now, but not to close down the
    > generator entirely.


    Alas, an exception thrown causes the generator to stop.


    Thomas
     
    Thomas Rachel, Jul 15, 2011
    #11
  12. Am 15.07.2011 14:52 schrieb Billy Mays:

    > Also, in the python docs, file.next() mentions there
    > being a performance gain for using the file generator (iterator?) over
    > the readline function.


    Here, the question is if this performance gain is really relevant AKA
    "feelable". The file object seems to have another internal buffer
    distinct from the one used for iterating used for the readline()
    function. Why this is not the same buffer is unclear to me.


    > Really what would be useful is some sort of PauseIteration Exception
    > which doesn't close the generator when raised, but indicates to the
    > looping header that there is no more data for now.


    a None or other sentinel value would do this as well (as ChrisA already
    said).


    Thomas
     
    Thomas Rachel, Jul 15, 2011
    #12
  13. Billy Mays

    Billy Mays Guest

    On 07/15/2011 10:28 AM, Thomas Rachel wrote:
    > Am 15.07.2011 14:52 schrieb Billy Mays:
    >
    >> Also, in the python docs, file.next() mentions there
    >> being a performance gain for using the file generator (iterator?) over
    >> the readline function.

    >
    > Here, the question is if this performance gain is really relevant AKA
    > "feelable". The file object seems to have another internal buffer
    > distinct from the one used for iterating used for the readline()
    > function. Why this is not the same buffer is unclear to me.
    >
    >
    >> Really what would be useful is some sort of PauseIteration Exception
    >> which doesn't close the generator when raised, but indicates to the
    >> looping header that there is no more data for now.

    >
    > a None or other sentinel value would do this as well (as ChrisA already
    > said).
    >
    >
    > Thomas


    A sentinel does provide a work around, but it also passes the problem
    onto the caller rather than the callee:

    def getLines(f):
    lines = []

    while True:
    yield f.readline()

    def bar(f):
    for line in getLines(f):
    if not line: # I now have to check here instead of in getLines
    break
    foo(line)


    def baz(f):
    for line in getLines(f) if line: # this would be nice for generators
    foo(line)


    bar() is the correct way to do things, but I think baz looks cleaner. I
    found my self writing baz() first, finding it wasn't syntactically
    correct, and then converting it to bar(). The if portion of the loop
    would be nice for generators, since it seems like the proper place for
    the sentinel to be matched. Also, with potentially infinite (but
    pauseable) data, there needs to be a nice way to catch stuff like this.

    --
    Bill
     
    Billy Mays, Jul 15, 2011
    #13
  14. Am 15.07.2011 16:42 schrieb Billy Mays:

    > A sentinel does provide a work around, but it also passes the problem
    > onto the caller rather than the callee:


    That is right.


    BTW, there is another, maybe easier way to do this:

    for line in iter(f.readline, ''):
    do_stuff(line)

    This provides an iterator which yields return values from the given
    callable until '' is returned, in which case the iterator stops.

    As caller, you need to have knowledge about the fact that you can always
    continue.

    The functionality which you ask for COULD be accomplished in two ways:

    Firstly, one could simply break the "contract" of an iterator (which
    would be a bad thing): just have your next() raise a StopIteration and
    then continue nevertheless.

    Secondly, one could do a similiar thing and have the next() method raise
    a different exception. Then the caller has as well to know about, but I
    cannot find a passage in the docs which prohibit this.

    I just have tested this:
    def r(x): return x
    def y(x): raise x

    def l(f, x): return lambda: f(x)
    class I(object):
    def __init__(self):
    self.l = [l(r, 1), l(r, 2), l(y, Exception), l(r, 3)]
    def __iter__(self):
    return self
    def next(self):
    if not self.l: raise StopIteration
    c = self.l.pop(0)
    return c()

    i = I()
    try:
    for j in i: print j
    except Exception, e: print "E:", e
    print tuple(i)

    and it works.


    So I think it COULD be ok to do this:

    class NotNow(Exception): pass

    class F(object):
    def __init__(self, f):
    self.file = f
    def __iter__(self):
    return self
    def next(self):
    l = self.file.readline()
    if not l: raise NotNow
    return l

    f = F(file("/var/log/messages"))
    import time
    while True:
    try:
    for i in f: print "", i,
    except NotNow, e:
    print "<pause>"
    time.sleep(1)


    HTH,

    Thomas
     
    Thomas Rachel, Jul 15, 2011
    #14
  15. Billy Mays

    Ethan Furman Guest

    Billy Mays wrote:
    > A sentinel does provide a work around, but it also passes the problem
    > onto the caller rather than the callee


    The callee can easily take care of it -- just block until more is ready.
    If blocking is not an option, then the caller has to deal with it no
    matter how callee is implemented -- an exception, a sentinel, or some
    signal that says "nope, nothing for ya! try back later!"

    ~Ethan~
     
    Ethan Furman, Jul 15, 2011
    #15
  16. Billy Mays

    Terry Reedy Guest

    On 7/15/2011 8:26 AM, Billy Mays wrote:
    > On 07/15/2011 04:01 AM, wrote:
    >> On Jul 14, 9:46 pm, Billy Mays<> wrote:
    >>> I noticed that if a file is being continuously written to, the file
    >>> generator does not notice it:
    >>>
    >>> def getLines(f):
    >>> lines = []
    >>> for line in f:
    >>> lines.append(line)
    >>> return lines

    >>
    >> what's wrong with file.readlines() ?

    >
    > Using that will read the entire file into memory which may not be


    So will getLines.

    > possible. In the library reference, it mentions that using the generator
    > (which calls file.next()) uses a read ahead buffer to efficiently loop
    > over the file. If I call .readline() myself, I forfeit that performance
    > gain.


    Are you sure? Have you measured the difference?

    --
    Terry Jan Reedy
     
    Terry Reedy, Jul 15, 2011
    #16
  17. Billy Mays

    Terry Reedy Guest

    On 7/15/2011 10:42 AM, Billy Mays wrote:
    > On 07/15/2011 10:28 AM, Thomas Rachel wrote:
    >> Am 15.07.2011 14:52 schrieb Billy Mays:


    >>> Really what would be useful is some sort of PauseIteration Exception
    >>> which doesn't close the generator when raised, but indicates to the
    >>> looping header that there is no more data for now.

    >>
    >> a None or other sentinel value would do this as well (as ChrisA already
    >> said).


    > A sentinel does provide a work around, but it also passes the problem
    > onto the caller rather than the callee:


    No more so than a new exception that the caller has to recognize.

    --
    Terry Jan Reedy
     
    Terry Reedy, Jul 15, 2011
    #17
  18. Billy Mays wrote:

    > I was thinking that a convenient solution to this problem would be to
    > introduce a new Exception call PauseIteration, which would signal to the
    > caller that there is no more data for now, but not to close down the
    > generator entirely.


    It never fails to amuse me how often people consider it "convenient" to add
    new built-in functionality to Python to solve every little issue. As
    pie-in-the-sky wishful-thinking, it can be fun, but people often mean it to
    be taken seriously.

    Okay, we've come up with the solution of a new exception, PauseIteration,
    that the iterator protocol will recognise. Now we have to:

    - write a PEP for it, setting out the case for it;
    - convince the majority of CPython developers that the idea is a good one,
    which might mean writing a proof-of-concept version;
    - avoid having the Jython, IronPython and PyPy developers come back and say
    that it is impossible under their implementations;
    - avoid having Guido veto it;
    - write an implementation or patch adding that functionality;
    - try to ensure it doesn't cause any regressions in the CPython tests;
    - fix the regressions that do occur despite our best efforts;
    - ensure that there are no backwards compatibility issues to be dealt with;
    - write a test suite for it;
    - write documentation for it;
    - unless we're some of the most senior Python developers, have the patch
    reviewed before it is accepted;
    - fix the bugs that have come to light since the first version;
    - make sure copyright is assigned to the Python Software Foundation;
    - wait anything up to a couple of years for the latest version of Python,
    including the patch, to be released as production-ready software;
    - upgrade our own Python installation to use the latest version, if we can
    and aren't forced to stick with an older version

    and now, at long last, we can use this convenient feature in our own code!
    Pretty convenient, yes?

    (If you think I exaggerate, consider the "yield from" construct, which has
    Guido's support and was pretty uncontroversial. Two and a half years later,
    it is now on track to be added to Python 3.3.)

    Or you can look at the various recipes on the Internet for writing tail-like
    file viewers in Python, and solve the problem the boring old fashioned way.
    Here's one that blocks while the file is unchanged:

    http://lethain.com/tailing-in-python/

    Modifying it to be non-blocking should be pretty straightforward -- just add
    a `yield ""` after the `if not line`.



    --
    Steven
     
    Steven D'Aprano, Jul 16, 2011
    #18
  19. On Sat, Jul 16, 2011 at 1:42 PM, Steven D'Aprano
    <> wrote:
    > Okay, we've come up with the solution of a new exception, PauseIteration,
    > that the iterator protocol will recognise. Now we have to:
    >
    > - write an implementation or patch adding that functionality;



    - and add it to our own personal builds of Python, thus bypassing the
    entire issue of getting it accepted into Python. Of course, this does
    mean that your brilliant code only works on your particular build of
    Python, but I'd say that this is the first step - before writing up
    the PEP, run it yourself and see whether you even like the way it
    feels.

    THEN, once you've convinced yourself, start convincing others (ie PEP).

    ChrisA
     
    Chris Angelico, Jul 16, 2011
    #19
  20. On 16Jul2011 13:42, Steven D'Aprano <> wrote:
    | Billy Mays wrote:
    | > I was thinking that a convenient solution to this problem would be to
    | > introduce a new Exception call PauseIteration, which would signal to the
    | > caller that there is no more data for now, but not to close down the
    | > generator entirely.
    |
    | It never fails to amuse me how often people consider it "convenient" to add
    | new built-in functionality to Python to solve every little issue. As
    | pie-in-the-sky wishful-thinking, it can be fun, but people often mean it to
    | be taken seriously.
    |
    | Okay, we've come up with the solution of a new exception, PauseIteration,
    | that the iterator protocol will recognise.

    One might suggest that Billy could wrp his generator in a Queue(1) and
    use the .empty() test, and/or raise his own PauseIteration from the
    wrapper.
    --
    Cameron Simpson <> DoD#743
    http://www.cskk.ezoshosting.com/cs/

    No team manager will tell you this; but they all want to see you
    come walking back into the pits sometimes, carrying the steering wheel.
    - Mario Andretti
     
    Cameron Simpson, Jul 17, 2011
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Danny Anderson

    open new file each loop iteration

    Danny Anderson, Jan 21, 2004, in forum: C++
    Replies:
    0
    Views:
    449
    Danny Anderson
    Jan 21, 2004
  2. Steve Holden
    Replies:
    1
    Views:
    421
    Behrang Dadsetan
    Jul 2, 2003
  3. Rudi
    Replies:
    5
    Views:
    5,309
  4. Michael Fellinger
    Replies:
    3
    Views:
    187
    Michael Fellinger
    Dec 27, 2007
  5. Kyle Barbour
    Replies:
    10
    Views:
    609
    Marvin Gülker
    Aug 2, 2010
Loading...

Share This Page