UTF16, BOM, and Windows Line endings

Fuzzyman · Feb 6, 2006

Hello all,

I'm handling some text files where I don't (necessarily) know the
encoding beforehand. Because I use regular expressions to parse the
text I *must* decode UTF16 encoded text (otherwise the regexes split on
byte boundaries).

I can recognise UTF8 and BOM and remove (but not necessarily decode).
For UTF16 it seems that the Python codec will automatically remove the
BOM. Having detected it (to trigger a decode) is it considered
*invalid* to remove it ? The codec certainly handles the text without a
BOM - I just don't want this part of the code to break later.

Because I don't know the encoding until I've checked for the BOM I have
to read in binary mode. Similarly I have to write in binary mode.

How should I handle line-endings for UTF16 ? Is it possible that other
programs (on windows) will have line endings as u'\r\n' ? When saving
files for that platform should I make the line endings u'\r\n' ? (This
sequence obviously encodes to four bytes in UTF16). I would only do
this to ensure compatibility with other programs the user may use to
create the text files.

All the best,

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml

Neil Hodgson · Feb 6, 2006

Fuzzyman:

How should I handle line-endings for UTF16 ? Is it possible that other
programs (on windows) will have line endings as u'\r\n' ?

Yes, try Notepad and save as Unicode. For the text

Fuzzy
End of lines
'\xff\xfeF\x00u\x00z\x00z\x00y\x00\r\x00\n\x00E\x00n\x00d\x00
\x00o\x00f\x00 \x00l\x00i\x00n\x00e\x00s\x00'
The '\r\x00\n\x00' is a u'\r\n'.

> When saving
files for that platform should I make the line endings u'\r\n' ? (This
sequence obviously encodes to four bytes in UTF16). I would only do
this to ensure compatibility with other programs the user may use to
create the text files.

Notepad will read u'\r\n'. It doesn't like '\n' or u'\n'. Some
applications are OK with other line ends by '\r\n' and u'\r\n' are
safest on Windows.

Neil

Fuzzyman · Feb 6, 2006

Neil said:
Fuzzyman:

Yes, try Notepad and save as Unicode. For the text

Fuzzy
End of lines

'\xff\xfeF\x00u\x00z\x00z\x00y\x00\r\x00\n\x00E\x00n\x00d\x00
\x00o\x00f\x00 \x00l\x00i\x00n\x00e\x00s\x00'

The '\r\x00\n\x00' is a u'\r\n'.

Notepad will read u'\r\n'. It doesn't like '\n' or u'\n'. Some
applications are OK with other line ends by '\r\n' and u'\r\n' are
safest on Windows.

Thanks - so I need to decode to unicode and *then* split on line
endings. Problem is, that means I can't use Python to handle line
endings where I don't know the encoding in advance.

In another thread I've posted a small function that *guesses* line
endings in use.

All the best,

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml

Neil Hodgson · Feb 7, 2006

Fuzzyman:

Thanks - so I need to decode to unicode and *then* split on line
endings. Problem is, that means I can't use Python to handle line
endings where I don't know the encoding in advance.

In another thread I've posted a small function that *guesses* line
endings in use.

You can normalise line endings:
a
b
c
d

e

The empty line is because "\n\r" is 2 line ends.

Neil

Fuzzyman · Feb 7, 2006

Neil said:
Fuzzyman:

You can normalise line endings:

a
b
c
d

e

The empty line is because "\n\r" is 2 line ends.

Thanks - that works, but replaces *all* instances of '\r' to '\n' -
even if they aren't used as line terminators. (Unlikely perhaps). It
also doesn't tell me what line ending was used.

Apparently files opened in universal mode - 'rU' - have a newline
attribute. That makes it a bit easier.

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml

Guessing the encoding from a BOM	7	Jan 16, 2014
Detecting line endings	18	Feb 6, 2006
doctest.testfile fails on text files with Windows line endings	1	Apr 11, 2010
Printing unix Line endings from Windows.	6	Dec 4, 2006
Python 3.0 automatic decoding of UTF16	25	Dec 5, 2008
UTF16 codec doesn't round-trip?	1	May 28, 2005
ascii to unicode line endings	5	May 2, 2007
module: zipfile.writestr - line endings issue	7	Aug 14, 2007

UTF16, BOM, and Windows Line endings

Fuzzyman

Neil Hodgson

Fuzzyman

Neil Hodgson

Fuzzyman

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads