UTF16, BOM, and Windows Line endings

F

Fuzzyman

Hello all,

I'm handling some text files where I don't (necessarily) know the
encoding beforehand. Because I use regular expressions to parse the
text I *must* decode UTF16 encoded text (otherwise the regexes split on
byte boundaries).

I can recognise UTF8 and BOM and remove (but not necessarily decode).
For UTF16 it seems that the Python codec will automatically remove the
BOM. Having detected it (to trigger a decode) is it considered
*invalid* to remove it ? The codec certainly handles the text without a
BOM - I just don't want this part of the code to break later.

Because I don't know the encoding until I've checked for the BOM I have
to read in binary mode. Similarly I have to write in binary mode.

How should I handle line-endings for UTF16 ? Is it possible that other
programs (on windows) will have line endings as u'\r\n' ? When saving
files for that platform should I make the line endings u'\r\n' ? (This
sequence obviously encodes to four bytes in UTF16). I would only do
this to ensure compatibility with other programs the user may use to
create the text files.

All the best,

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml
 
N

Neil Hodgson

Fuzzyman:
How should I handle line-endings for UTF16 ? Is it possible that other
programs (on windows) will have line endings as u'\r\n' ?

Yes, try Notepad and save as Unicode. For the text

Fuzzy
End of lines
'\xff\xfeF\x00u\x00z\x00z\x00y\x00\r\x00\n\x00E\x00n\x00d\x00
\x00o\x00f\x00 \x00l\x00i\x00n\x00e\x00s\x00'
The '\r\x00\n\x00' is a u'\r\n'.
> When saving
files for that platform should I make the line endings u'\r\n' ? (This
sequence obviously encodes to four bytes in UTF16). I would only do
this to ensure compatibility with other programs the user may use to
create the text files.

Notepad will read u'\r\n'. It doesn't like '\n' or u'\n'. Some
applications are OK with other line ends by '\r\n' and u'\r\n' are
safest on Windows.

Neil
 
F

Fuzzyman

Neil said:
Fuzzyman:


Yes, try Notepad and save as Unicode. For the text

Fuzzy
End of lines

'\xff\xfeF\x00u\x00z\x00z\x00y\x00\r\x00\n\x00E\x00n\x00d\x00
\x00o\x00f\x00 \x00l\x00i\x00n\x00e\x00s\x00'

The '\r\x00\n\x00' is a u'\r\n'.


Notepad will read u'\r\n'. It doesn't like '\n' or u'\n'. Some
applications are OK with other line ends by '\r\n' and u'\r\n' are
safest on Windows.

Thanks - so I need to decode to unicode and *then* split on line
endings. Problem is, that means I can't use Python to handle line
endings where I don't know the encoding in advance.

In another thread I've posted a small function that *guesses* line
endings in use.

All the best,


Fuzzyman
http://www.voidspace.org.uk/python/index.shtml
 
N

Neil Hodgson

Fuzzyman:
Thanks - so I need to decode to unicode and *then* split on line
endings. Problem is, that means I can't use Python to handle line
endings where I don't know the encoding in advance.

In another thread I've posted a small function that *guesses* line
endings in use.

You can normalise line endings:
a
b
c
d

e

The empty line is because "\n\r" is 2 line ends.

Neil
 
F

Fuzzyman

Neil said:
Fuzzyman:


You can normalise line endings:

a
b
c
d

e

The empty line is because "\n\r" is 2 line ends.

Thanks - that works, but replaces *all* instances of '\r' to '\n' -
even if they aren't used as line terminators. (Unlikely perhaps). It
also doesn't tell me what line ending was used.

Apparently files opened in universal mode - 'rU' - have a newline
attribute. That makes it a bit easier. :)

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,010
Latest member
MerrillEic

Latest Threads

Top