file.readline() after a seek() breaking up lines

F

fd

I am a newcomer to python, and I hope someone can point out to me why
my calls to file.readline() (after a seek) are returning mangled lines.
Calling readline twice after each seek, eliminates the problem. Is seek(),
like next(), incompatible with readline()? If so, how should I be doing do
random access line reads?
Thanks
FD

# Sample code for readline() problem

# platform: windows xp
# python version 2.3
# The source file is just a list of words - one word per line,
# saved as ANSI from notepad


from string import rstrip
from random import randrange

words = file('C:\\swap\\english.txt', 'r')
words.seek(-1,2)
endAt = words.tell()
startAt = 1

for w in range(0, 50):
words.seek(randrange( startAt, endAt ) , 0)
#words.readline() #uncomment this and lines are intact
print words.readline()

words.close()
 
J

Jeff Epler

When you open a file in text mode, the only offsets that are valid for
'seek()' are ones returned by 'tell()' (or 0, presumably). In practice,
you can seek to arbitrary offsets on most operating systems, though the
results on Windows are confused by the fact that text files store '\n'
as a two-byte sequence. This is what the library reference means when
it says
If the file is opened in text mode (mode 't'), only offsets returned
by tell() are legal. Use of other offsets causes undefined behavior.
http://python.org/doc/lib/bltin-file-objects.html

When you open a file in binary mode, all offsets less than the file
length are valid, but in a text file most of them will be in the middle
of a line. (they're byte offsets into a file you think of as being made
of individual lines)

So, anyway, when you seek to a random offset, you are usually in the middle of a
line, and the first readline() returns that partial line.

You can do one of several things:
* Read the file and gather all line offsets, then pick one of them
(requires reading the whole file each time)
* Read the file in a line at a time and pick the word as you go (If
this is the n'th line, then 1/n of the time replace the "line to be
printed" with this line. At the end of the file, print the line to be
printed)
* Read the file once and write an index of offsets. Then, pick a random
offset from this file, seek to it, and read
* Pick a byte offset, and discard the first line read. You'll never
use the very first line of the file, and longer lines are preferred
over shorter lines (actually, lines *following* longer lines are
preferred...)
* Pick a byte offset and scan backwards until you get to the start of
the file or the start of a line, then readline. Again, longer lines
are preferred over shorter lines by this method
* Create a record-oriented format, so that you can seek to a multiple
of the record length and read a word. All words must be shorter
than reclen.

The old unix "fortune" program used the second method. I'm sure there
are other things you could do as well.

Jeff
 
M

Mark Day

fd said:
I am a newcomer to python, and I hope someone can point out to me why
my calls to file.readline() (after a seek) are returning mangled lines.
Calling readline twice after each seek, eliminates the problem.

Seek positions to an arbitrary byte offset (at least on most OSes).
Chances are, you're seeking into the middle of a line. The first
readline() returns the remainder of that line (which is what I assume
you mean by a "mangled" line). Subsequent readlines will return whole
lines since the previous readline left the current position just after
the end of the previous line.

-Mark
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,743
Messages
2,569,478
Members
44,899
Latest member
RodneyMcAu

Latest Threads

Top