Wrong default endianess in utf-16 and utf-32 !?

J

jmfauth

I hope my understanding is correct and I'm not dreaming.

When an endianess is not specified, (BE, LE, unmarked forms),
the Unicode Consortium specifies, the default byte serialization
should be big-endian.

See http://www.unicode.org/faq//utf_bom.html
Q: Which of the UTFs do I need to support?
and
Q: Why do some of the UTFs have a BE or LE in their label,
such as UTF-16LE?

(+ technical papers)

It appears Python is just working in the opposite way.
sys.version 2.7 (r27:82525, Jul 4 2010, 09:01:59) [MSC v.1500 32 bit (Intel)]
repr(u'abc'.encode('utf-16-le')) 'a\x00b\x00c\x00'
repr(u'abc'.encode('utf-16-be')) '\x00a\x00b\x00c'
repr(u'abc'.encode('utf-16')) '\xff\xfea\x00b\x00c\x00'
repr(u'abc'.encode('utf-16')[2:]) == repr(u'abc'.encode('utf-16-be')) False
repr(u'abc'.encode('utf-16')[2:]) == repr(u'abc'.encode('utf-16-le'))
True

Ditto with utf-32 and with utf-16/utf-32 in Python 3.1.2

I attempted to find some precise discussions on that subject
and I failed.

Any thougths?
 
A

Antoine Pitrou

I hope my understanding is correct and I'm not dreaming.

When an endianess is not specified, (BE, LE, unmarked forms),
the Unicode Consortium specifies, the default byte serialization
should be big-endian.
[...]

It appears Python is just working in the opposite way.
[...]
repr(u'abc'.encode('utf-16')[2:]) == repr(u'abc'.encode('utf-16-le'))
True

Python uses the host's endianness by default. So, on a little-endian
machine, utf-16 and utf-32 will use little-endian encoding.
While decoding, though, the BOM is read by both of these codecs, so
there should be no interoperability problems:
u'abc'


(do note, though, that the explicit utf*-be and utf*-le variants do not
add a BOM)

Regards

Antoine.
 
J

John Machin

jmfauth said:
When an endianess is not specified, (BE, LE, unmarked forms),
the Unicode Consortium specifies, the default byte serialization
should be big-endian.

See http://www.unicode.org/faq//utf_bom.html
Q: Which of the UTFs do I need to support?
and
Q: Why do some of the UTFs have a BE or LE in their label,
such as UTF-16LE?

Sometimes it is necessary to read right to the end of an answer:

Q: Why do some of the UTFs have a BE or LE in their label, such as UTF-16LE?

A: [snip] the unmarked form uses big-endian byte serialization by default, but
may include a byte order mark at the beginning to indicate the actual byte
serialization used.
 
J

jmfauth

jmfauth said:
When an endianess is not specified, (BE, LE, unmarked forms),
the Unicode Consortium specifies, the default byte serialization
should be big-endian.
Seehttp://www.unicode.org/faq//utf_bom.html
Q: Which of the UTFs do I need to support?
and
Q: Why do some of the UTFs have a BE or LE in their label,
such as UTF-16LE?

Sometimes it is necessary to read right to the end of an answer:

Q: Why do some of the UTFs have a BE or LE in their label, such as UTF-16LE?

A: [snip] the unmarked form uses big-endian byte serialization by default, but
may include a byte order mark at the beginning to indicate the actual byte
serialization used.



Well, English is not my native language, however I think I read it
correctly.

My question had nothing to do with the BOM, the encoding/decoding
or the BOM inclusion. My question was:

"What should I understand by "utf-16"? "utf-16-le" or "utf-16-be"?

And Antoine gave an answer.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,743
Messages
2,569,478
Members
44,898
Latest member
BlairH7607

Latest Threads

Top