Possible File iteration bug

B

Billy Mays

I noticed that if a file is being continuously written to, the file
generator does not notice it:



def getLines(f):
lines = []
for line in f:
lines.append(line)
return lines

with open('/var/log/syslog', 'rb') as f:
lines = getLines(f)
# do some processing with lines
# /var/log/syslog gets updated in the mean time

# always returns an empty list, even though f has more data
lines = getLines(f)




I found a workaround by adding f.seek(0,1) directly before the last
getLines() call, but is this the expected behavior? Calling f.tell()
right after the first getLines() call shows that it isn't reset back to
0. Is this correct or a bug?
 
I

Ian Kelly

def getLines(f):
   lines = []
   for line in f:
       lines.append(line)
   return lines

with open('/var/log/syslog', 'rb') as f:
   lines = getLines(f)
   # do some processing with lines
   # /var/log/syslog gets updated in the mean time

   # always returns an empty list, even though f has more data
   lines = getLines(f)




I found a workaround by adding f.seek(0,1) directly before the last
getLines() call, but is this the expected behavior?  Calling f.tell() right
after the first getLines() call shows that it isn't reset back to 0.  Is
this correct or a bug?

This is expected. Part of the iterator protocol is that once an
iterator raises StopIteration, it should continue to raise
StopIteration on subsequent next() calls.
 
B

Billy Mays

def getLines(f):
lines = []
for line in f:
lines.append(line)
return lines

with open('/var/log/syslog', 'rb') as f:
lines = getLines(f)
# do some processing with lines
# /var/log/syslog gets updated in the mean time

# always returns an empty list, even though f has more data
lines = getLines(f)




I found a workaround by adding f.seek(0,1) directly before the last
getLines() call, but is this the expected behavior? Calling f.tell() right
after the first getLines() call shows that it isn't reset back to 0. Is
this correct or a bug?

This is expected. Part of the iterator protocol is that once an
iterator raises StopIteration, it should continue to raise
StopIteration on subsequent next() calls.

Is there any way to just create a new generator that clears its `closed`
status?
 
H

Hrvoje Niksic

Billy Mays said:
Is there any way to just create a new generator that clears its
closed` status?

You can define getLines in terms of the readline file method, which does
return new data when it is available.

def getLines(f):
lines = []
while True:
line = f.readline()
if line == '':
break
lines.append(line)
return lines

or, more succinctly:

def getLines(f):
return list(iter(f.readline, ''))
 
T

Terry Reedy

I noticed that if a file is being continuously written to, the file
generator does not notice it:

Because it does not look, as Ian explained.
def getLines(f):
lines = []
for line in f:
lines.append(line)
return lines

This nearly duplicates .readlines, except for using f an an iterator.
Try the following (untested):

with open('/var/log/syslog', 'rb') as f:
lines = f.readlines()
# do some processing with lines
# /var/log/syslog gets updated in the mean time
lines = f.readlines()

People regularly do things like this with readline, so it is possible.
If above does not work, try (untested):

def getlines(f):
lines = []
while True:
l = f.readline()
if l: lines.append(l)
else: return lines
 
B

bruno.desthuilliers

I noticed that if a file is being continuously written to, the file
generator does not notice it:

def getLines(f):
     lines = []
     for line in f:
         lines.append(line)
     return lines

what's wrong with file.readlines() ?
 
B

Billy Mays

I noticed that if a file is being continuously written to, the file
generator does not notice it:

def getLines(f):
lines = []
for line in f:
lines.append(line)
return lines

what's wrong with file.readlines() ?

Using that will read the entire file into memory which may not be
possible. In the library reference, it mentions that using the
generator (which calls file.next()) uses a read ahead buffer to
efficiently loop over the file. If I call .readline() myself, I forfeit
that performance gain.

I was thinking that a convenient solution to this problem would be to
introduce a new Exception call PauseIteration, which would signal to the
caller that there is no more data for now, but not to close down the
generator entirely.
 
T

Thomas Rachel

Am 14.07.2011 21:46 schrieb Billy Mays:
I noticed that if a file is being continuously written to, the file
generator does not notice it:

Yes. That's why there were alternative suggestions in your last thread
"How to write a file generator".

To repeat mine: an object which is not an iterator, but an iterable.

class Follower(object):
def __init__(self, file):
self.file = file
def __iter__(self):
while True:
l = self.file.readline()
if not l: return
yield l

if __name__ == '__main__':
import time
f = Follower(open("/var/log/messages"))
while True:
for i in f: print i,
print "all read, waiting..."
time.sleep(4)

Here, you iterate over the object until it is exhausted, but you can
iterate again to get the next entries.

The difference to the file as iterator is, as you have noticed, that
once an iterator is exhausted, it will be so forever.

But if you have an iterable, like the Follower above, you can reuse it
as you want.
 
B

Billy Mays

Am 14.07.2011 21:46 schrieb Billy Mays:

Yes. That's why there were alternative suggestions in your last thread
"How to write a file generator".

To repeat mine: an object which is not an iterator, but an iterable.

class Follower(object):
def __init__(self, file):
self.file = file
def __iter__(self):
while True:
l = self.file.readline()
if not l: return
yield l

if __name__ == '__main__':
import time
f = Follower(open("/var/log/messages"))
while True:
for i in f: print i,
print "all read, waiting..."
time.sleep(4)

Here, you iterate over the object until it is exhausted, but you can
iterate again to get the next entries.

The difference to the file as iterator is, as you have noticed, that
once an iterator is exhausted, it will be so forever.

But if you have an iterable, like the Follower above, you can reuse it
as you want.


I did see it, but it feels less pythonic than using a generator. I did
end up using an extra class to get more data from the file, but it seems
like overhead. Also, in the python docs, file.next() mentions there
being a performance gain for using the file generator (iterator?) over
the readline function.

Really what would be useful is some sort of PauseIteration Exception
which doesn't close the generator when raised, but indicates to the
looping header that there is no more data for now.
 
C

Chris Angelico

Really what would be useful is some sort of PauseIteration Exception which
doesn't close the generator when raised, but indicates to the looping header
that there is no more data for now.

All you need is a sentinel yielded value (eg None).

ChrisA
 
T

Thomas Rachel

Am 15.07.2011 14:26 schrieb Billy Mays:
I was thinking that a convenient solution to this problem would be to
introduce a new Exception call PauseIteration, which would signal to the
caller that there is no more data for now, but not to close down the
generator entirely.

Alas, an exception thrown causes the generator to stop.


Thomas
 
T

Thomas Rachel

Am 15.07.2011 14:52 schrieb Billy Mays:
Also, in the python docs, file.next() mentions there
being a performance gain for using the file generator (iterator?) over
the readline function.

Here, the question is if this performance gain is really relevant AKA
"feelable". The file object seems to have another internal buffer
distinct from the one used for iterating used for the readline()
function. Why this is not the same buffer is unclear to me.

Really what would be useful is some sort of PauseIteration Exception
which doesn't close the generator when raised, but indicates to the
looping header that there is no more data for now.

a None or other sentinel value would do this as well (as ChrisA already
said).


Thomas
 
B

Billy Mays

Am 15.07.2011 14:52 schrieb Billy Mays:


Here, the question is if this performance gain is really relevant AKA
"feelable". The file object seems to have another internal buffer
distinct from the one used for iterating used for the readline()
function. Why this is not the same buffer is unclear to me.



a None or other sentinel value would do this as well (as ChrisA already
said).


Thomas

A sentinel does provide a work around, but it also passes the problem
onto the caller rather than the callee:

def getLines(f):
lines = []

while True:
yield f.readline()

def bar(f):
for line in getLines(f):
if not line: # I now have to check here instead of in getLines
break
foo(line)


def baz(f):
for line in getLines(f) if line: # this would be nice for generators
foo(line)


bar() is the correct way to do things, but I think baz looks cleaner. I
found my self writing baz() first, finding it wasn't syntactically
correct, and then converting it to bar(). The if portion of the loop
would be nice for generators, since it seems like the proper place for
the sentinel to be matched. Also, with potentially infinite (but
pauseable) data, there needs to be a nice way to catch stuff like this.
 
T

Thomas Rachel

Am 15.07.2011 16:42 schrieb Billy Mays:
A sentinel does provide a work around, but it also passes the problem
onto the caller rather than the callee:

That is right.


BTW, there is another, maybe easier way to do this:

for line in iter(f.readline, ''):
do_stuff(line)

This provides an iterator which yields return values from the given
callable until '' is returned, in which case the iterator stops.

As caller, you need to have knowledge about the fact that you can always
continue.

The functionality which you ask for COULD be accomplished in two ways:

Firstly, one could simply break the "contract" of an iterator (which
would be a bad thing): just have your next() raise a StopIteration and
then continue nevertheless.

Secondly, one could do a similiar thing and have the next() method raise
a different exception. Then the caller has as well to know about, but I
cannot find a passage in the docs which prohibit this.

I just have tested this:
def r(x): return x
def y(x): raise x

def l(f, x): return lambda: f(x)
class I(object):
def __init__(self):
self.l = [l(r, 1), l(r, 2), l(y, Exception), l(r, 3)]
def __iter__(self):
return self
def next(self):
if not self.l: raise StopIteration
c = self.l.pop(0)
return c()

i = I()
try:
for j in i: print j
except Exception, e: print "E:", e
print tuple(i)

and it works.


So I think it COULD be ok to do this:

class NotNow(Exception): pass

class F(object):
def __init__(self, f):
self.file = f
def __iter__(self):
return self
def next(self):
l = self.file.readline()
if not l: raise NotNow
return l

f = F(file("/var/log/messages"))
import time
while True:
try:
for i in f: print "", i,
except NotNow, e:
print "<pause>"
time.sleep(1)


HTH,

Thomas
 
E

Ethan Furman

Billy said:
A sentinel does provide a work around, but it also passes the problem
onto the caller rather than the callee

The callee can easily take care of it -- just block until more is ready.
If blocking is not an option, then the caller has to deal with it no
matter how callee is implemented -- an exception, a sentinel, or some
signal that says "nope, nothing for ya! try back later!"

~Ethan~
 
T

Terry Reedy

I noticed that if a file is being continuously written to, the file
generator does not notice it:

def getLines(f):
lines = []
for line in f:
lines.append(line)
return lines

what's wrong with file.readlines() ?

Using that will read the entire file into memory which may not be

So will getLines.
possible. In the library reference, it mentions that using the generator
(which calls file.next()) uses a read ahead buffer to efficiently loop
over the file. If I call .readline() myself, I forfeit that performance
gain.

Are you sure? Have you measured the difference?
 
T

Terry Reedy

A sentinel does provide a work around, but it also passes the problem
onto the caller rather than the callee:

No more so than a new exception that the caller has to recognize.
 
S

Steven D'Aprano

Billy said:
I was thinking that a convenient solution to this problem would be to
introduce a new Exception call PauseIteration, which would signal to the
caller that there is no more data for now, but not to close down the
generator entirely.

It never fails to amuse me how often people consider it "convenient" to add
new built-in functionality to Python to solve every little issue. As
pie-in-the-sky wishful-thinking, it can be fun, but people often mean it to
be taken seriously.

Okay, we've come up with the solution of a new exception, PauseIteration,
that the iterator protocol will recognise. Now we have to:

- write a PEP for it, setting out the case for it;
- convince the majority of CPython developers that the idea is a good one,
which might mean writing a proof-of-concept version;
- avoid having the Jython, IronPython and PyPy developers come back and say
that it is impossible under their implementations;
- avoid having Guido veto it;
- write an implementation or patch adding that functionality;
- try to ensure it doesn't cause any regressions in the CPython tests;
- fix the regressions that do occur despite our best efforts;
- ensure that there are no backwards compatibility issues to be dealt with;
- write a test suite for it;
- write documentation for it;
- unless we're some of the most senior Python developers, have the patch
reviewed before it is accepted;
- fix the bugs that have come to light since the first version;
- make sure copyright is assigned to the Python Software Foundation;
- wait anything up to a couple of years for the latest version of Python,
including the patch, to be released as production-ready software;
- upgrade our own Python installation to use the latest version, if we can
and aren't forced to stick with an older version

and now, at long last, we can use this convenient feature in our own code!
Pretty convenient, yes?

(If you think I exaggerate, consider the "yield from" construct, which has
Guido's support and was pretty uncontroversial. Two and a half years later,
it is now on track to be added to Python 3.3.)

Or you can look at the various recipes on the Internet for writing tail-like
file viewers in Python, and solve the problem the boring old fashioned way.
Here's one that blocks while the file is unchanged:

http://lethain.com/tailing-in-python/

Modifying it to be non-blocking should be pretty straightforward -- just add
a `yield ""` after the `if not line`.
 
C

Chris Angelico

Okay, we've come up with the solution of a new exception, PauseIteration,
that the iterator protocol will recognise. Now we have to:

- write an implementation or patch adding that functionality;


- and add it to our own personal builds of Python, thus bypassing the
entire issue of getting it accepted into Python. Of course, this does
mean that your brilliant code only works on your particular build of
Python, but I'd say that this is the first step - before writing up
the PEP, run it yourself and see whether you even like the way it
feels.

THEN, once you've convinced yourself, start convincing others (ie PEP).

ChrisA
 
C

Cameron Simpson

| Billy Mays wrote:
| > I was thinking that a convenient solution to this problem would be to
| > introduce a new Exception call PauseIteration, which would signal to the
| > caller that there is no more data for now, but not to close down the
| > generator entirely.
|
| It never fails to amuse me how often people consider it "convenient" to add
| new built-in functionality to Python to solve every little issue. As
| pie-in-the-sky wishful-thinking, it can be fun, but people often mean it to
| be taken seriously.
|
| Okay, we've come up with the solution of a new exception, PauseIteration,
| that the iterator protocol will recognise.

One might suggest that Billy could wrp his generator in a Queue(1) and
use the .empty() test, and/or raise his own PauseIteration from the
wrapper.
--
Cameron Simpson <[email protected]> DoD#743
http://www.cskk.ezoshosting.com/cs/

No team manager will tell you this; but they all want to see you
come walking back into the pits sometimes, carrying the steering wheel.
- Mario Andretti
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top