"index" method only for mutable sequences??

R

Rhamphoryncus

Rhamphoryncus said:
Indexing cost, memory efficiency, and canonical representation: pick
two. You can't use a canonical representation (scalar values) without
some sort of costly search when indexing (O(log n) probably) or by
expanding to the worst-case size (UTF-32). Python has taken the
approach of always providing efficient indexing (O(1)), but you can
compile it with either UTF-16 (better memory efficiency) or UTF-32
(canonical representation).

I still don't get it. UTF-16 is just a data compression scheme, right?
I mean, s[17] isn't the 17th character of the (unicode) string regardless
of which memory byte it happens to live at? It could be that that accessing
it takes more than constant time, but that's hidden by the implementation.

So where does the invariant c==s[s.index(c)] fail, assuming s contains c?
[u'\U0010ffff']

[u'\udbff', u'\udfff']

The unicode type's repr hides the distinction but you can see it with
list. Your "single character" is actually two surrogate code points.
s[s.index(c)] would only give you the first surrogate character
 
R

Rhamphoryncus

Paul Rubin schreef:
I still don't get it. UTF-16 is just a data compression scheme, right?
I mean, s[17] isn't the 17th character of the (unicode) string regardless
of which memory byte it happens to live at? It could be that that accessing
it takes more than constant time, but that's hidden by the implementation.
So where does the invariant c==s[s.index(c)] fail, assuming s contains c?

I didn't get it either, but now I understand. Like you, I thought Python
Unicode strings contain a canonical representation (in interface, not
necessarily in implementation) but apparently that is not true; see
Neil's post and the reference manual
(http://docs.python.org/ref/types.html#l2h-22).

A simple example on my Python installation, apparently compiled to use
UTF-16 (sys.maxunicode == 65535):

You're confusing \u, which is followed by 4 digits, and \U, which is
followed by eight:
list(u'\u1d400') [u'\u1d40', u'0']
list(u'\U0001d400')
[u'\U0001d400'] # UTF-32 output, sys.maxunicode == 1114111
[u'\ud835', u'\udc00'] # UTF-16 output, sys.maxunicode == 65535
 
P

Paul Rubin

Roel Schroeven said:
In this case s[0] is not the full Unicode scalar, but instead just the
first part of the surrogate pair consisting of 0x1D40 (in s[0]) and
0x0000 (in s[1]).

Arrrrgggh. After much head scratching I think I now understand what
you are saying. This appears to me to be absolutely nuts. What is
the purpose of having a unicode string type, if its sequence elements
are not guaranteed to be the unicode characters in the string? Might
as well use byte strings for everything.

Come to think of it, I don't understand why we have this plethora of
encodings like utf-16. utf-8 I can sort of understand on pragmatic
grounds, but aside from that I'd think UCS-4 should be used for everything,
and when a space-saving compressed representation is desired, then use
a general purpose data compression algorithm such as gzip.
 
D

Donn Cave

"Hendrik van Rooyen said:
Donn Cave said:
Well, yes - consider for example the "tm" tuple returned
from time.localtime() - it's all integers, but heterogeneous
as could be - tm[0] is Year, tm[1] is Month, etc., and it
turns out that not one of them is alike. The point is exactly
that we can't discover these differences from the items itself -
so it isn't about Python types - but rather from the position
of the item in the struct/tuple. (For the person who is about
to write to me that localtime() doesn't exactly return a tuple: QED)

This is the point where the whole thing falls apart in my head and
I get real confused - I can't find a reason why, list or tuple, the first
item can't be something, the second something else, etc...

Of course, you may do what you like. Don't forget, though,
that there's no "index" method for a tuple.

Donn Cave, (e-mail address removed)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,599
Members
45,162
Latest member
GertrudeMa
Top