UTF16, BOM, and Windows Line endings

Discussion in 'Python' started by Fuzzyman, Feb 6, 2006.

  1. Fuzzyman

    Fuzzyman Guest

    Hello all,

    I'm handling some text files where I don't (necessarily) know the
    encoding beforehand. Because I use regular expressions to parse the
    text I *must* decode UTF16 encoded text (otherwise the regexes split on
    byte boundaries).

    I can recognise UTF8 and BOM and remove (but not necessarily decode).
    For UTF16 it seems that the Python codec will automatically remove the
    BOM. Having detected it (to trigger a decode) is it considered
    *invalid* to remove it ? The codec certainly handles the text without a
    BOM - I just don't want this part of the code to break later.

    Because I don't know the encoding until I've checked for the BOM I have
    to read in binary mode. Similarly I have to write in binary mode.

    How should I handle line-endings for UTF16 ? Is it possible that other
    programs (on windows) will have line endings as u'\r\n' ? When saving
    files for that platform should I make the line endings u'\r\n' ? (This
    sequence obviously encodes to four bytes in UTF16). I would only do
    this to ensure compatibility with other programs the user may use to
    create the text files.

    All the best,

    Fuzzyman
    http://www.voidspace.org.uk/python/index.shtml
    Fuzzyman, Feb 6, 2006
    #1
    1. Advertising

  2. Fuzzyman

    Neil Hodgson Guest

    Fuzzyman:

    > How should I handle line-endings for UTF16 ? Is it possible that other
    > programs (on windows) will have line endings as u'\r\n' ?


    Yes, try Notepad and save as Unicode. For the text

    Fuzzy
    End of lines

    >>> contents = open("C:\\fuzzy.txt", "rb").read()
    >>> contents

    '\xff\xfeF\x00u\x00z\x00z\x00y\x00\r\x00\n\x00E\x00n\x00d\x00
    \x00o\x00f\x00 \x00l\x00i\x00n\x00e\x00s\x00'
    >>>


    The '\r\x00\n\x00' is a u'\r\n'.

    > When saving
    > files for that platform should I make the line endings u'\r\n' ? (This
    > sequence obviously encodes to four bytes in UTF16). I would only do
    > this to ensure compatibility with other programs the user may use to
    > create the text files.


    Notepad will read u'\r\n'. It doesn't like '\n' or u'\n'. Some
    applications are OK with other line ends by '\r\n' and u'\r\n' are
    safest on Windows.

    Neil
    Neil Hodgson, Feb 6, 2006
    #2
    1. Advertising

  3. Fuzzyman

    Fuzzyman Guest

    Neil Hodgson wrote:
    > Fuzzyman:
    >
    > > How should I handle line-endings for UTF16 ? Is it possible that other
    > > programs (on windows) will have line endings as u'\r\n' ?

    >
    > Yes, try Notepad and save as Unicode. For the text
    >
    > Fuzzy
    > End of lines
    >
    > >>> contents = open("C:\\fuzzy.txt", "rb").read()
    > >>> contents

    > '\xff\xfeF\x00u\x00z\x00z\x00y\x00\r\x00\n\x00E\x00n\x00d\x00
    > \x00o\x00f\x00 \x00l\x00i\x00n\x00e\x00s\x00'
    > >>>

    >
    > The '\r\x00\n\x00' is a u'\r\n'.
    >
    > > When saving
    > > files for that platform should I make the line endings u'\r\n' ? (This
    > > sequence obviously encodes to four bytes in UTF16). I would only do
    > > this to ensure compatibility with other programs the user may use to
    > > create the text files.

    >
    > Notepad will read u'\r\n'. It doesn't like '\n' or u'\n'. Some
    > applications are OK with other line ends by '\r\n' and u'\r\n' are
    > safest on Windows.
    >


    Thanks - so I need to decode to unicode and *then* split on line
    endings. Problem is, that means I can't use Python to handle line
    endings where I don't know the encoding in advance.

    In another thread I've posted a small function that *guesses* line
    endings in use.

    All the best,


    Fuzzyman
    http://www.voidspace.org.uk/python/index.shtml

    > Neil
    Fuzzyman, Feb 6, 2006
    #3
  4. Fuzzyman

    Neil Hodgson Guest

    Fuzzyman:

    > Thanks - so I need to decode to unicode and *then* split on line
    > endings. Problem is, that means I can't use Python to handle line
    > endings where I don't know the encoding in advance.
    >
    > In another thread I've posted a small function that *guesses* line
    > endings in use.


    You can normalise line endings:

    >>> x = "a\r\nb\rc\nd\n\re"
    >>> y = x.replace("\r\n", "\n").replace("\r","\n")
    >>> y

    'a\nb\nc\nd\n\ne'
    >>> print y

    a
    b
    c
    d

    e

    The empty line is because "\n\r" is 2 line ends.

    Neil
    Neil Hodgson, Feb 7, 2006
    #4
  5. Fuzzyman

    Fuzzyman Guest

    Neil Hodgson wrote:
    > Fuzzyman:
    >
    > > Thanks - so I need to decode to unicode and *then* split on line
    > > endings. Problem is, that means I can't use Python to handle line
    > > endings where I don't know the encoding in advance.
    > >
    > > In another thread I've posted a small function that *guesses* line
    > > endings in use.

    >
    > You can normalise line endings:
    >
    > >>> x = "a\r\nb\rc\nd\n\re"
    > >>> y = x.replace("\r\n", "\n").replace("\r","\n")
    > >>> y

    > 'a\nb\nc\nd\n\ne'
    > >>> print y

    > a
    > b
    > c
    > d
    >
    > e
    >
    > The empty line is because "\n\r" is 2 line ends.
    >


    Thanks - that works, but replaces *all* instances of '\r' to '\n' -
    even if they aren't used as line terminators. (Unlikely perhaps). It
    also doesn't tell me what line ending was used.

    Apparently files opened in universal mode - 'rU' - have a newline
    attribute. That makes it a bit easier. :)

    Fuzzyman
    http://www.voidspace.org.uk/python/index.shtml


    > Neil
    Fuzzyman, Feb 7, 2006
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Fuzzyman

    Detecting line endings

    Fuzzyman, Feb 6, 2006, in forum: Python
    Replies:
    18
    Views:
    575
    Fuzzyman
    Feb 8, 2006
  2. Ant
    Replies:
    6
    Views:
    372
    Fredrik Lundh
    Dec 5, 2006
  3. Replies:
    5
    Views:
    380
    Marc 'BlackJack' Rintsch
    May 3, 2007
  4. towers
    Replies:
    7
    Views:
    290
    =?ISO-8859-1?Q?Ricardo_Ar=E1oz?=
    Aug 17, 2007
  5. Steven D'Aprano
    Replies:
    1
    Views:
    335
    Patrick Maupin
    Apr 11, 2010
Loading...

Share This Page