Re: UTF-16-LE and split() under MS-Windows XP

Discussion in 'Python' started by Martin v. =?iso-8859-15?q?L=F6wis?=, Jul 9, 2003.

  1. "Colin S. Miller" <> writes:

    > Where have I gone wrong, and what is the correct method
    > to verify the BOM mark?


    readline is not supported in the UTF-16 codec. You have to read the
    entire file, and perform .split. Looking at the BOM should not be
    necessary, as the UTF-16 codec will do so on its own.

    Regards,
    Martin
    Martin v. =?iso-8859-15?q?L=F6wis?=, Jul 9, 2003
    #1
    1. Advertising

  2. Martin v. Löwis wrote:
    > "Colin S. Miller" <> writes:
    >
    >
    >>Where have I gone wrong, and what is the correct method
    >>to verify the BOM mark?

    >
    >
    > readline is not supported in the UTF-16 codec. You have to read the
    > entire file, and perform .split. Looking at the BOM should not be
    > necessary, as the UTF-16 codec will do so on its own.

    Is there any reason why readline() isn't supported?
    AFAIK,
    the prefered UNICODE standard line endings are
    0x2028 (Line seperator)
    0x2029 (Paragraph seperator)
    but 0x10 (Line feed) and 0x13 (carriage return) are
    also supported for legacy support.


    I'm using
    file.read().splitlines() now, but am slightly worried
    about perfomance/memory when there a few hundered lines.

    TIA,
    Colin S. Miller


    >
    > Regards,
    > Martin
    >
    Colin S. Miller, Jul 10, 2003
    #2
    1. Advertising

  3. "Colin S. Miller" <> writes:

    > Is there any reason why readline() isn't supported?


    Because it hasn't been implemented. The naive approach of calling the
    readline of the underlying stream (as all other codecs do) does not
    work for UTF-16.

    > AFAIK, the prefered UNICODE standard line endings are 0x2028 (Line
    > seperator) 0x2029 (Paragraph seperator) but 0x10 (Line feed) and
    > 0x13 (carriage return) are also supported for legacy support.


    Add that on top of that. One should support all line breaking
    characters for UTF-16, atleast in Universal Newline (U) mode.

    > I'm using file.read().splitlines() now, but am slightly worried
    > about perfomance/memory when there a few hundered lines.


    Feel free to implement and contribute a patch. It has been that way
    for some years now, and it likely will stay the same for the coming
    years unless somebody contributes a patch.

    Regards,
    Martin
    Martin v. =?iso-8859-15?q?L=F6wis?=, Jul 10, 2003
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. jmfauth
    Replies:
    4
    Views:
    316
    jmfauth
    Oct 13, 2010
  2. Grzegorz ¦liwiñski
    Replies:
    2
    Views:
    959
    Grzegorz ¦liwiñski
    Jan 19, 2011
  3. Wolfgang Nádasi-Donner

    UTF-8 encoding with BOM under Ruby 1.8.x (Windows)

    Wolfgang Nádasi-Donner, Aug 15, 2007, in forum: Ruby
    Replies:
    5
    Views:
    171
    Nobuyoshi Nakada
    Aug 16, 2007
  4. Stanley Xu
    Replies:
    2
    Views:
    604
    Stanley Xu
    Mar 23, 2011
  5. Replies:
    1
    Views:
    131
    Ian Wilson
    Jan 5, 2007
Loading...

Share This Page