J
jmfauth
I hope my understanding is correct and I'm not dreaming.
When an endianess is not specified, (BE, LE, unmarked forms),
the Unicode Consortium specifies, the default byte serialization
should be big-endian.
See http://www.unicode.org/faq//utf_bom.html
Q: Which of the UTFs do I need to support?
and
Q: Why do some of the UTFs have a BE or LE in their label,
such as UTF-16LE?
(+ technical papers)
It appears Python is just working in the opposite way.
Ditto with utf-32 and with utf-16/utf-32 in Python 3.1.2
I attempted to find some precise discussions on that subject
and I failed.
Any thougths?
When an endianess is not specified, (BE, LE, unmarked forms),
the Unicode Consortium specifies, the default byte serialization
should be big-endian.
See http://www.unicode.org/faq//utf_bom.html
Q: Which of the UTFs do I need to support?
and
Q: Why do some of the UTFs have a BE or LE in their label,
such as UTF-16LE?
(+ technical papers)
It appears Python is just working in the opposite way.
Truesys.version 2.7 (r27:82525, Jul 4 2010, 09:01:59) [MSC v.1500 32 bit (Intel)]
repr(u'abc'.encode('utf-16-le')) 'a\x00b\x00c\x00'
repr(u'abc'.encode('utf-16-be')) '\x00a\x00b\x00c'
repr(u'abc'.encode('utf-16')) '\xff\xfea\x00b\x00c\x00'
repr(u'abc'.encode('utf-16')[2:]) == repr(u'abc'.encode('utf-16-be')) False
repr(u'abc'.encode('utf-16')[2:]) == repr(u'abc'.encode('utf-16-le'))
Ditto with utf-32 and with utf-16/utf-32 in Python 3.1.2
I attempted to find some precise discussions on that subject
and I failed.
Any thougths?