python-unicode doesn't support >65535 symbols?

gabor · Nov 27, 2003

hi,

today i made some tests...

i tested some unicode symbols, that are above the 16bit limit
(gothic:http://www.unicode.org/charts/PDF/U10330.pdf)
..

i played around with iconv and so on,
so at the end i created an utf8 encoded text file,
with the text "Marrakesh",
where the second 'a' wes replaced with
GOTHIC_LETTER_AHSA (unicode-value:0x10330).

(i simply wrote the text file "Marrakesh", used iconv to convert it to
utf32big-endian, and replaced the character in hexedit, then converted
with iconv back to utf8).

now i started python:
u'Marr\U00010330kesh'

so far it seemed ok.
then i did:
10

this is wrong. the length should be 9.
and why?

text[0] u'M'
text[1] u'a'
text[2] u'r'
text[3] u'r'
text[4] u'\ud800'
text[5] u'\udf30'
text[6] u'k'

Click to expand...

Click to expand...

so text[3] (which should be \U00010330),
was split to 2 16bit values (text[3] and text[4]).

i don't understand.
if tthe representation of 'text' is correct, why is the length wrong?

btw. i understand that it's a very exotic character, but i tried for
example kwrite and gedit, and none of the was able to display the
symbol, but both successfully identified it as ONE unknown symbol.

thanks,
gabor

Michael Hudson · Nov 27, 2003

gabor said:
i played around with iconv and so on,
so at the end i created an utf8 encoded text file,
with the text "Marrakesh",
where the second 'a' wes replaced with
GOTHIC_LETTER_AHSA (unicode-value:0x10330).

(i simply wrote the text file "Marrakesh", used iconv to convert it to
utf32big-endian, and replaced the character in hexedit, then converted
with iconv back to utf8).

now i started python:

u'Marr\U00010330kesh'

so far it seemed ok.
then i did:

10

this is wrong. the length should be 9.

I suspect you have a "narrow unicode" build of Python. You can make
yourself a "wide unicode" build easily enough.

and why?

text[0] u'M'
text[1] u'a'
text[2] u'r'
text[3] u'r'
text[4] u'\ud800'
text[5] u'\udf30'
text[6] u'k'

Click to expand...

Click to expand...

so text[3] (which should be \U00010330),
was split to 2 16bit values (text[3] and text[4]).

i don't understand.
if tthe representation of 'text' is correct, why is the length wrong?

I expect that this has to do with surrogates or some other unicode
thing that's beyond my understanding...

Cheers,
mwh

Andrew Clover · Nov 27, 2003

gabor said:
so text[3] (which should be \U00010330),
was split to 2 16bit values (text[3] and text[4]).

The default encoding for native Unicode strings in Python in UTF-16, which
cannot hold the extended planes beyond 0xFFFF in a single character. Instead,
it uses two 'surrogate' characters. Bit of a nasty hack, but that's what
Unicode does and there's nothing can be done about it now.

Python can be compiled to use UCS-4 for native Unicode strings if you prefer.
Then every conceptual 'character' from the Unicode repertoire will be one
item long. It'll eat more memory too of course.

if tthe representation of 'text' is correct, why is the length wrong?

The representation of 'text' you are seeing is just the nicely-readable
version output by Python 2.2+. Despite the \U sequence, it is actually still
stored internally as two UTF-16 surrogates. You'll see this if you enter
'\U00012345' into Python 2.0 or 2.1, which don't use the \U form to output
strings.

Rainer Deyke · Nov 27, 2003

Andrew said:
gabor said:

so text[3] (which should be \U00010330),
was split to 2 16bit values (text[3] and text[4]).

Click to expand...

The default encoding for native Unicode strings in Python in UTF-16,
which cannot hold the extended planes beyond 0xFFFF in a single
character.

That's not quite right. UTF-16 encodes unicode characters as either single
16 bit values and pairs of 16 bit values. However, one character is still
one character.

Python makes the mistake of exposing the internal representation instead of
the logical value of unicode objects. This means that, aside from space
optimization, unicode objects have no advantage over UTF-8 encoded plain
strings for storing unicode text.

Martin v. =?iso-8859-15?q?L=F6wis?= · Nov 27, 2003

Rainer Deyke said:
Python makes the mistake of exposing the internal representation instead of
the logical value of unicode objects. This means that, aside from space
optimization, unicode objects have no advantage over UTF-8 encoded plain
strings for storing unicode text.

That is not true. First, it is not "Python", but a specific Python
configuration - in "wide Unicode" builds, it uses UCS-4 internally.

In either build, len() and indexing addresses code units, not
characters: that is true.

However, it is not true that there is no advantage over UTF-8 encoded
byte strings. Instead, there are several advantages:
- In a UCS-4 build, Unicode characters and code units are in a 1:1
relationship
- In a UCS-2 build, Unicode characters and code units are in a 1:1
relationship as long as the application only ever processes BMP
characters.
- In either case, a Unicode object has inherent information about the
character set, which a UTF-8 byte string does not have. IOW, you know
what a Unicode object is, but you don't know (inherently) whether a
byte string is UTF-8.

Regards,
Martin

Python Unicode handling wins again -- mostly	67	Nov 30, 2013
can't get utf8 / unicode strings from embedded python	19	Aug 23, 2013
Unicode Support in Ruby, Perl, Python, Emacs Lisp	6	Oct 7, 2010
Python 3.3, gettext and Unicode problems	0	Dec 31, 2012
wxGrid: Problem with unicode mathematical symbols	0	Dec 20, 2006
Inserting Unicode text with MySQLdb in Python 2.4-2.5?	5	Nov 18, 2009
tkinter unicode question	0	Jul 27, 2010
python3 raw strings and \u escapes	10	May 30, 2012

python-unicode doesn't support >65535 symbols?

gabor

Michael Hudson

Andrew Clover

Rainer Deyke

Martin v. =?iso-8859-15?q?L=F6wis?=

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads