file.readline() after a seek() breaking up lines

Discussion in 'Python' started by fd, Mar 5, 2004.

  1. fd

    fd Guest

    I am a newcomer to python, and I hope someone can point out to me why
    my calls to file.readline() (after a seek) are returning mangled lines.
    Calling readline twice after each seek, eliminates the problem. Is seek(),
    like next(), incompatible with readline()? If so, how should I be doing do
    random access line reads?
    Thanks
    FD

    # Sample code for readline() problem

    # platform: windows xp
    # python version 2.3
    # The source file is just a list of words - one word per line,
    # saved as ANSI from notepad


    from string import rstrip
    from random import randrange

    words = file('C:\\swap\\english.txt', 'r')
    words.seek(-1,2)
    endAt = words.tell()
    startAt = 1

    for w in range(0, 50):
    words.seek(randrange( startAt, endAt ) , 0)
    #words.readline() #uncomment this and lines are intact
    print words.readline()

    words.close()
     
    fd, Mar 5, 2004
    #1
    1. Advertising

  2. fd

    Jeff Epler Guest

    When you open a file in text mode, the only offsets that are valid for
    'seek()' are ones returned by 'tell()' (or 0, presumably). In practice,
    you can seek to arbitrary offsets on most operating systems, though the
    results on Windows are confused by the fact that text files store '\n'
    as a two-byte sequence. This is what the library reference means when
    it says
    If the file is opened in text mode (mode 't'), only offsets returned
    by tell() are legal. Use of other offsets causes undefined behavior.
    http://python.org/doc/lib/bltin-file-objects.html

    When you open a file in binary mode, all offsets less than the file
    length are valid, but in a text file most of them will be in the middle
    of a line. (they're byte offsets into a file you think of as being made
    of individual lines)

    So, anyway, when you seek to a random offset, you are usually in the middle of a
    line, and the first readline() returns that partial line.

    You can do one of several things:
    * Read the file and gather all line offsets, then pick one of them
    (requires reading the whole file each time)
    * Read the file in a line at a time and pick the word as you go (If
    this is the n'th line, then 1/n of the time replace the "line to be
    printed" with this line. At the end of the file, print the line to be
    printed)
    * Read the file once and write an index of offsets. Then, pick a random
    offset from this file, seek to it, and read
    * Pick a byte offset, and discard the first line read. You'll never
    use the very first line of the file, and longer lines are preferred
    over shorter lines (actually, lines *following* longer lines are
    preferred...)
    * Pick a byte offset and scan backwards until you get to the start of
    the file or the start of a line, then readline. Again, longer lines
    are preferred over shorter lines by this method
    * Create a record-oriented format, so that you can seek to a multiple
    of the record length and read a word. All words must be shorter
    than reclen.

    The old unix "fortune" program used the second method. I'm sure there
    are other things you could do as well.

    Jeff
     
    Jeff Epler, Mar 5, 2004
    #2
    1. Advertising

  3. fd

    Mark Day Guest

    In article <>, fd
    <> wrote:

    > I am a newcomer to python, and I hope someone can point out to me why
    > my calls to file.readline() (after a seek) are returning mangled lines.
    > Calling readline twice after each seek, eliminates the problem.


    Seek positions to an arbitrary byte offset (at least on most OSes).
    Chances are, you're seeking into the middle of a line. The first
    readline() returns the remainder of that line (which is what I assume
    you mean by a "mangled" line). Subsequent readlines will return whole
    lines since the previous readline left the current position just after
    the end of the previous line.

    -Mark
     
    Mark Day, Mar 5, 2004
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Amy
    Replies:
    0
    Views:
    539
  2. Sullivan WxPyQtKinter
    Replies:
    18
    Views:
    590
    John J. Lee
    Aug 12, 2007
  3. DataSmash

    readline() & seek() ???

    DataSmash, Jun 4, 2008, in forum: Python
    Replies:
    7
    Views:
    316
    Kam-Hung Soh
    Jun 7, 2008
  4. gavino
    Replies:
    4
    Views:
    565
    gavino
    Sep 20, 2010
  5. Replies:
    3
    Views:
    152
    Andreas Perstinger
    May 14, 2013
Loading...

Share This Page