Python 3.2 bug? Reading the last line of a file

Discussion in 'Python' started by tkpmep@hotmail.com, May 25, 2011.

  1. Guest

    The following function that returns the last line of a file works
    perfectly well under Python 2.71. but fails reliably under Python 3.2.
    Is this a bug, or am I doing something wrong? Any help would be
    greatly appreciated.


    import os

    def lastLine(filename):
    '''
    Returns the last line of a file
    file.seek takes an optional 'whence' argument which allows you
    to
    start looking at the end, so you can just work back from there
    till
    you hit the first newline that has anything after it
    Works perfectly under Python 2.7, but not under 3.2!
    '''
    offset = -50
    with open(filename) as f:
    while offset > -1024:
    offset *= 2
    f.seek(offset, os.SEEK_END)
    lines = f.readlines()
    if len(lines) > 1:
    return lines[-1]

    If I execute this with a valid filename fn. I get the following error
    message:

    >>> lastLine(fn)

    Traceback (most recent call last):
    File "<pyshell#12>", line 1, in <module>
    lastLine(fn)
    File "<pyshell#11>", line 13, in lastLine
    f.seek(offset, os.SEEK_END)
    io.UnsupportedOperation: can't do nonzero end-relative seeks

    Sincerely

    Thomas Philips
     
    , May 25, 2011
    #1
    1. Advertising

  2. MRAB Guest

    On 25/05/2011 20:33, wrote:
    > The following function that returns the last line of a file works
    > perfectly well under Python 2.71. but fails reliably under Python 3.2.
    > Is this a bug, or am I doing something wrong? Any help would be
    > greatly appreciated.
    >
    >
    > import os
    >
    > def lastLine(filename):
    > '''
    > Returns the last line of a file
    > file.seek takes an optional 'whence' argument which allows you
    > to
    > start looking at the end, so you can just work back from there
    > till
    > you hit the first newline that has anything after it
    > Works perfectly under Python 2.7, but not under 3.2!
    > '''
    > offset = -50
    > with open(filename) as f:
    > while offset> -1024:
    > offset *= 2
    > f.seek(offset, os.SEEK_END)
    > lines = f.readlines()
    > if len(lines)> 1:
    > return lines[-1]
    >
    > If I execute this with a valid filename fn. I get the following error
    > message:
    >
    >>>> lastLine(fn)

    > Traceback (most recent call last):
    > File "<pyshell#12>", line 1, in<module>
    > lastLine(fn)
    > File "<pyshell#11>", line 13, in lastLine
    > f.seek(offset, os.SEEK_END)
    > io.UnsupportedOperation: can't do nonzero end-relative seeks
    >

    You're opening the file in text mode, and seeking relative to the end
    of the file is not allowed in text mode, presumably because the file
    contents have to be decoded, and, in general, seeking to an arbitrary
    position within a sequence of encoded bytes can have undefined results
    when you attempt to decode to Unicode starting from that position.

    The strange thing is that you _are_ allowed to seek relative to the
    start of the file.

    Try opening the file in binary mode and do the decoding yourself,
    catching the DecodeError exceptions if/when they occur.
     
    MRAB, May 25, 2011
    #2
    1. Advertising

  3. Ian Kelly Guest

    On Wed, May 25, 2011 at 2:00 PM, MRAB <> wrote:
    > You're opening the file in text mode, and seeking relative to the end
    > of the file is not allowed in text mode, presumably because the file
    > contents have to be decoded, and, in general, seeking to an arbitrary
    > position within a sequence of encoded bytes can have undefined results
    > when you attempt to decode to Unicode starting from that position.
    >
    > The strange thing is that you _are_ allowed to seek relative to the
    > start of the file.


    I think that with text files seek() is only really meant to be called
    with values returned from tell(), which may include the decoder state
    in its return value.
     
    Ian Kelly, May 25, 2011
    #3
  4. MRAB Guest

    On 25/05/2011 21:54, Ian Kelly wrote:
    > On Wed, May 25, 2011 at 2:00 PM, MRAB<> wrote:
    >> You're opening the file in text mode, and seeking relative to the end
    >> of the file is not allowed in text mode, presumably because the file
    >> contents have to be decoded, and, in general, seeking to an arbitrary
    >> position within a sequence of encoded bytes can have undefined results
    >> when you attempt to decode to Unicode starting from that position.
    >>
    >> The strange thing is that you _are_ allowed to seek relative to the
    >> start of the file.

    >
    > I think that with text files seek() is only really meant to be called
    > with values returned from tell(), which may include the decoder state
    > in its return value.


    What do you mean by "may include the decoder state in its return value"?

    It does make sense that the values returned from tell() won't be in the
    middle of an encoded sequence of bytes.
     
    MRAB, May 25, 2011
    #4
  5. Guest

    Thanks for the guidance - it was indeed an issue with reading in
    binary vs. text., and I do now succeed in reading the last line,
    except that I now seem unable to split it, as I demonstrate below.
    Here's what I get when I read the last line in text mode using 2.7.1
    and in binary mode using 3.2 respectively under IDLE:

    2.7.1
    Name 31/12/2009 0 0 0

    3.2
    b'Name\t31/12/2009\t0\t0\t0\r\n'

    if, under 2.7.1 I read the file in text mode and write
    >>> x = lastLine(fn)

    I can then cleanly split the line to get its contents
    >>> x.split('\t')

    ['Name', '31/12/2009', '0', '0', '0\n']

    but under 3.2, with its binary read, I get
    >>> x.split('\t')

    Traceback (most recent call last):
    File "<pyshell#26>", line 1, in <module>
    x.split('\t')
    TypeError: Type str doesn't support the buffer API

    If I remove the '\t', the split now works and I get a list of bytes
    literals
    >>> x.split()

    [b'Name', b'31/12/2009', b'0', b'0', b'0']

    Looking through the docs did not clarify my understanding of the
    issue. Why can I not split on '\t' when reading in binary mode?

    Sincerely

    Thomas Philips
     
    , May 26, 2011
    #5
  6. MRAB Guest

    On 26/05/2011 00:25, wrote:
    > Thanks for the guidance - it was indeed an issue with reading in
    > binary vs. text., and I do now succeed in reading the last line,
    > except that I now seem unable to split it, as I demonstrate below.
    > Here's what I get when I read the last line in text mode using 2.7.1
    > and in binary mode using 3.2 respectively under IDLE:
    >
    > 2.7.1
    > Name 31/12/2009 0 0 0
    >
    > 3.2
    > b'Name\t31/12/2009\t0\t0\t0\r\n'
    >
    > if, under 2.7.1 I read the file in text mode and write
    >>>> x = lastLine(fn)

    > I can then cleanly split the line to get its contents
    >>>> x.split('\t')

    > ['Name', '31/12/2009', '0', '0', '0\n']
    >
    > but under 3.2, with its binary read, I get
    >>>> x.split('\t')

    > Traceback (most recent call last):
    > File "<pyshell#26>", line 1, in<module>
    > x.split('\t')
    > TypeError: Type str doesn't support the buffer API
    >
    > If I remove the '\t', the split now works and I get a list of bytes
    > literals
    >>>> x.split()

    > [b'Name', b'31/12/2009', b'0', b'0', b'0']
    >
    > Looking through the docs did not clarify my understanding of the
    > issue. Why can I not split on '\t' when reading in binary mode?
    >

    x.split('\t') tries to split on '\t', a string (str), but x is a
    bytestring (bytes).

    Do x.split(b'\t') instead.
     
    MRAB, May 26, 2011
    #6
  7. Ethan Furman Guest

    wrote:
    > Thanks for the guidance - it was indeed an issue with reading in
    > binary vs. text., and I do now succeed in reading the last line,
    > except that I now seem unable to split it, as I demonstrate below.
    > Here's what I get when I read the last line in text mode using 2.7.1
    > and in binary mode using 3.2 respectively under IDLE:
    >
    > 3.2
    > b'Name\t31/12/2009\t0\t0\t0\r\n'
    >
    > under 3.2, with its binary read, I get
    >--> x.split('\t')
    > Traceback (most recent call last):
    > File "<pyshell#26>", line 1, in <module>
    > x.split('\t')
    > TypeError: Type str doesn't support the buffer API


    You are trying to split a bytes object with a str object -- the two are
    not compatible. Try splitting with the bytes object b'\t'.

    ~Ethan~
     
    Ethan Furman, May 26, 2011
    #7
  8. Ethan Furman Guest

    MRAB wrote:
    > On 26/05/2011 00:25, wrote:
    >> Thanks for the guidance - it was indeed an issue with reading in
    >> binary vs. text., and I do now succeed in reading the last line,
    >> except that I now seem unable to split it, as I demonstrate below.
    >> Here's what I get when I read the last line in text mode using 2.7.1
    >> and in binary mode using 3.2 respectively under IDLE:
    >>
    >> 2.7.1
    >> Name 31/12/2009 0 0 0
    >>
    >> 3.2
    >> b'Name\t31/12/2009\t0\t0\t0\r\n'
    >>
    >> if, under 2.7.1 I read the file in text mode and write
    >>>>> x = lastLine(fn)

    >> I can then cleanly split the line to get its contents
    >>>>> x.split('\t')

    >> ['Name', '31/12/2009', '0', '0', '0\n']
    >>
    >> but under 3.2, with its binary read, I get
    >>>>> x.split('\t')

    >> Traceback (most recent call last):
    >> File "<pyshell#26>", line 1, in<module>
    >> x.split('\t')
    >> TypeError: Type str doesn't support the buffer API
    >>
    >> If I remove the '\t', the split now works and I get a list of bytes
    >> literals
    >>>>> x.split()

    >> [b'Name', b'31/12/2009', b'0', b'0', b'0']
    >>
    >> Looking through the docs did not clarify my understanding of the
    >> issue. Why can I not split on '\t' when reading in binary mode?
    >>

    > x.split('\t') tries to split on '\t', a string (str), but x is a
    > bytestring (bytes).
    >
    > Do x.split(b'\t') instead.


    <nitpick>
    Instances of the bytes class are more appropriately called 'bytes
    objects' rather than 'bytestrings' as they are really lists of integers.
    Accessing a single element of a bytes object does not return a bytes
    object, but rather the integer at that location; i.e.

    --> b'xyz'[1]
    121

    Contrast that with the str type where

    --> 'xyz'[1]
    'y'
    </nitpick>

    ~Ethan~
     
    Ethan Furman, May 26, 2011
    #8
  9. Ian Kelly Guest

    On Wed, May 25, 2011 at 3:52 PM, MRAB <> wrote:
    > What do you mean by "may include the decoder state in its return value"?
    >
    > It does make sense that the values returned from tell() won't be in the
    > middle of an encoded sequence of bytes.


    If you take a look at the source code, tell() returns a long that
    includes decoder state data in the upper bytes. For example:

    >>> data = b' ' + '\u0302a'.encode('utf-16')
    >>> data

    b' \xff\xfe\x02\x03a\x00'
    >>> f = open('test.txt', 'wb')
    >>> f.write(data)

    7
    >>> f.close()
    >>> f = open('test.txt', 'r', encoding='utf-16')
    >>> f.read()

    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "c:\python32\lib\codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
    File "c:\python32\lib\encodings\utf_16.py", line 61, in _buffer_decode
    codecs.utf_16_ex_decode(input, errors, 0, final)
    UnicodeDecodeError: 'utf16' codec can't decode bytes in position 6-6:
    truncated data

    The problem of course is the initial space, throwing off the decoder.
    We can try to seek past it:

    >>> f.seek(1)

    1
    >>> f.read()

    '\ufeff\u0302a'

    But notice that since we're not reading from the beginning of the
    file, the BOM has now been interpreted as data. However:

    >>> f.seek(1 + (2 << 65))

    73786976294838206465
    >>> f.read()

    '\u0302a'

    And you can see that instead of reading from position
    73786976294838206465 it has read from position 1 starting in the "read
    a BOM" state. Note that I wouldn't recommend doing anything remotely
    like this in production code, not least because the value that I
    passed into seek() is platform-dependent. This is just a
    demonstration of how the seek() value can include decoder state.

    Cheers,
    Ian
     
    Ian Kelly, May 26, 2011
    #9
  10. writes:

    > Looking through the docs did not clarify my understanding of the
    > issue. Why can I not split on '\t' when reading in binary mode?


    You can split on b'\t' to get a list of byteses, which you can then
    decode if you want them as strings.

    You can decode the bytes to get a string and then split on '\t' to get
    strings.

    >>> b'tic\ttac\ttoe'.split(b'\t')

    [b'tic', b'tac', b'toe']
    >>> b'tic\ttac\ttoe'.decode('utf-8').split('\t')

    ['tic', 'tac', 'toe']
     
    Jussi Piitulainen, May 26, 2011
    #10
  11. Guest

    This is exactly what I want to do - I can then pick up various
    elements of the list and turn them into floats, ints, etc. I have not
    ever used decode, and will look it up in the docs to better understand
    it. I can't thank everyone enough for the generous serving of help and
    guidance - I certainly would not have discovered all this on my own.

    Sincerely


    Thomas Philips
     
    , May 27, 2011
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. DJP
    Replies:
    7
    Views:
    7,374
    glen herrmannsfeldt
    Oct 21, 2004
  2. DJP
    Replies:
    16
    Views:
    974
    Villy Kruse
    Oct 21, 2004
  3. Trond Valen
    Replies:
    5
    Views:
    399
    Niklas Norrthon
    Dec 7, 2005
  4. scad
    Replies:
    23
    Views:
    1,172
    Alf P. Steinbach
    May 17, 2009
  5. Robin Wenger
    Replies:
    191
    Views:
    3,238
Loading...

Share This Page