G
gabor
hi,
today i made some tests...
i tested some unicode symbols, that are above the 16bit limit
(gothic:http://www.unicode.org/charts/PDF/U10330.pdf)
..
i played around with iconv and so on,
so at the end i created an utf8 encoded text file,
with the text "Marrakesh",
where the second 'a' wes replaced with
GOTHIC_LETTER_AHSA (unicode-value:0x10330).
(i simply wrote the text file "Marrakesh", used iconv to convert it to
utf32big-endian, and replaced the character in hexedit, then converted
with iconv back to utf8).
now i started python:
u'Marr\U00010330kesh'
so far it seemed ok.
then i did:
10
this is wrong. the length should be 9.
and why?
so text[3] (which should be \U00010330),
was split to 2 16bit values (text[3] and text[4]).
i don't understand.
if tthe representation of 'text' is correct, why is the length wrong?
btw. i understand that it's a very exotic character, but i tried for
example kwrite and gedit, and none of the was able to display the
symbol, but both successfully identified it as ONE unknown symbol.
thanks,
gabor
today i made some tests...
i tested some unicode symbols, that are above the 16bit limit
(gothic:http://www.unicode.org/charts/PDF/U10330.pdf)
..
i played around with iconv and so on,
so at the end i created an utf8 encoded text file,
with the text "Marrakesh",
where the second 'a' wes replaced with
GOTHIC_LETTER_AHSA (unicode-value:0x10330).
(i simply wrote the text file "Marrakesh", used iconv to convert it to
utf32big-endian, and replaced the character in hexedit, then converted
with iconv back to utf8).
now i started python:
u'Marr\U00010330kesh'
so far it seemed ok.
then i did:
10
this is wrong. the length should be 9.
and why?
text[0] u'M'
text[1] u'a'
text[2] u'r'
text[3] u'r'
text[4] u'\ud800'
text[5] u'\udf30'
text[6] u'k'
so text[3] (which should be \U00010330),
was split to 2 16bit values (text[3] and text[4]).
i don't understand.
if tthe representation of 'text' is correct, why is the length wrong?
btw. i understand that it's a very exotic character, but i tried for
example kwrite and gedit, and none of the was able to display the
symbol, but both successfully identified it as ONE unknown symbol.
thanks,
gabor