Wrong default endianess in utf-16 and utf-32 !?

jmfauth · Oct 12, 2010

I hope my understanding is correct and I'm not dreaming.

When an endianess is not specified, (BE, LE, unmarked forms),
the Unicode Consortium specifies, the default byte serialization
should be big-endian.

See http://www.unicode.org/faq//utf_bom.html
Q: Which of the UTFs do I need to support?
and
Q: Why do some of the UTFs have a BE or LE in their label,
such as UTF-16LE?

(+ technical papers)

It appears Python is just working in the opposite way.

sys.version 2.7 (r27:82525, Jul 4 2010, 09:01:59) [MSC v.1500 32 bit (Intel)]
repr(u'abc'.encode('utf-16-le')) 'a\x00b\x00c\x00'
repr(u'abc'.encode('utf-16-be')) '\x00a\x00b\x00c'
repr(u'abc'.encode('utf-16')) '\xff\xfea\x00b\x00c\x00'
repr(u'abc'.encode('utf-16')[2:]) == repr(u'abc'.encode('utf-16-be')) False
repr(u'abc'.encode('utf-16')[2:]) == repr(u'abc'.encode('utf-16-le'))

Click to expand...

Click to expand...

True

Ditto with utf-32 and with utf-16/utf-32 in Python 3.1.2

I attempted to find some precise discussions on that subject
and I failed.

Any thougths?

Antoine Pitrou · Oct 12, 2010

I hope my understanding is correct and I'm not dreaming.

When an endianess is not specified, (BE, LE, unmarked forms),
the Unicode Consortium specifies, the default byte serialization
should be big-endian.
[...]

It appears Python is just working in the opposite way.
[...]

repr(u'abc'.encode('utf-16')[2:]) == repr(u'abc'.encode('utf-16-le'))

Click to expand...

Click to expand...

True

Python uses the host's endianness by default. So, on a little-endian
machine, utf-16 and utf-32 will use little-endian encoding.
While decoding, though, the BOM is read by both of these codecs, so
there should be no interoperability problems:
u'abc'

(do note, though, that the explicit utf*-be and utf*-le variants do not
add a BOM)

Regards

Antoine.

jmfauth · Oct 12, 2010

On Tue, 12 Oct 2010 06:28:23 -0700 (PDT)

Python uses the host's endianness by default. So, on a little-endian
machine, utf-16 and utf-32 will use little-endian encoding.

Thanks. I never have been aware of this.

John Machin · Oct 12, 2010

jmfauth said:
When an endianess is not specified, (BE, LE, unmarked forms),
the Unicode Consortium specifies, the default byte serialization
should be big-endian.

See http://www.unicode.org/faq//utf_bom.html
Q: Which of the UTFs do I need to support?
and
Q: Why do some of the UTFs have a BE or LE in their label,
such as UTF-16LE?

Sometimes it is necessary to read right to the end of an answer:

Q: Why do some of the UTFs have a BE or LE in their label, such as UTF-16LE?

A: [snip] the unmarked form uses big-endian byte serialization by default, but
may include a byte order mark at the beginning to indicate the actual byte
serialization used.

jmfauth · Oct 13, 2010

jmfauth said:
jmfauth said:

When an endianess is not specified, (BE, LE, unmarked forms),
the Unicode Consortium specifies, the default byte serialization
should be big-endian.

Click to expand...

Seehttp://www.unicode.org/faq//utf_bom.html
Q: Which of the UTFs do I need to support?
and
Q: Why do some of the UTFs have a BE or LE in their label,
such as UTF-16LE?

Click to expand...

Sometimes it is necessary to read right to the end of an answer:

Q: Why do some of the UTFs have a BE or LE in their label, such as UTF-16LE?

A: [snip] the unmarked form uses big-endian byte serialization by default, but
may include a byte order mark at the beginning to indicate the actual byte
serialization used.

Well, English is not my native language, however I think I read it
correctly.

My question had nothing to do with the BOM, the encoding/decoding
or the BOM inclusion. My question was:

"What should I understand by "utf-16"? "utf-16-le" or "utf-16-be"?

And Antoine gave an answer.

Wrong default endianess in utf-16 and utf-32 !?

jmfauth

Antoine Pitrou

jmfauth

John Machin

jmfauth

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads