R
Rhamphoryncus
[u'\udbff', u'\udfff']Rhamphoryncus said:Indexing cost, memory efficiency, and canonical representation: pick
two. You can't use a canonical representation (scalar values) without
some sort of costly search when indexing (O(log n) probably) or by
expanding to the worst-case size (UTF-32). Python has taken the
approach of always providing efficient indexing (O(1)), but you can
compile it with either UTF-16 (better memory efficiency) or UTF-32
(canonical representation).
I still don't get it. UTF-16 is just a data compression scheme, right?
I mean, s[17] isn't the 17th character of the (unicode) string regardless
of which memory byte it happens to live at? It could be that that accessing
it takes more than constant time, but that's hidden by the implementation.
So where does the invariant c==s[s.index(c)] fail, assuming s contains c?
[u'\U0010ffff']
The unicode type's repr hides the distinction but you can see it with
list. Your "single character" is actually two surrogate code points.
s[s.index(c)] would only give you the first surrogate character