"index" method only for mutable sequences??

Rhamphoryncus · Apr 15, 2007

Rhamphoryncus said:
Rhamphoryncus said:

Indexing cost, memory efficiency, and canonical representation: pick
two. You can't use a canonical representation (scalar values) without
some sort of costly search when indexing (O(log n) probably) or by
expanding to the worst-case size (UTF-32). Python has taken the
approach of always providing efficient indexing (O(1)), but you can
compile it with either UTF-16 (better memory efficiency) or UTF-32
(canonical representation).

Click to expand...

I still don't get it. UTF-16 is just a data compression scheme, right?
I mean, s[17] isn't the 17th character of the (unicode) string regardless
of which memory byte it happens to live at? It could be that that accessing
it takes more than constant time, but that's hidden by the implementation.

So where does the invariant c==s[s.index(c)] fail, assuming s contains c?
[u'\U0010ffff']

[u'\udbff', u'\udfff']

The unicode type's repr hides the distinction but you can see it with
list. Your "single character" is actually two surrogate code points.
s[s.index(c)] would only give you the first surrogate character

Rhamphoryncus · Apr 15, 2007

Paul Rubin schreef:

I still don't get it. UTF-16 is just a data compression scheme, right?
I mean, s[17] isn't the 17th character of the (unicode) string regardless
of which memory byte it happens to live at? It could be that that accessing
it takes more than constant time, but that's hidden by the implementation.

Click to expand...

So where does the invariant c==s[s.index(c)] fail, assuming s contains c?

Click to expand...

I didn't get it either, but now I understand. Like you, I thought Python
Unicode strings contain a canonical representation (in interface, not
necessarily in implementation) but apparently that is not true; see
Neil's post and the reference manual
(http://docs.python.org/ref/types.html#l2h-22).

A simple example on my Python installation, apparently compiled to use
UTF-16 (sys.maxunicode == 65535):

You're confusing \u, which is followed by 4 digits, and \U, which is
followed by eight:

list(u'\u1d400') [u'\u1d40', u'0']
list(u'\U0001d400')

Click to expand...

Click to expand...

[u'\U0001d400'] # UTF-32 output, sys.maxunicode == 1114111
[u'\ud835', u'\udc00'] # UTF-16 output, sys.maxunicode == 65535

Paul Rubin · Apr 15, 2007

Roel Schroeven said:
In this case s[0] is not the full Unicode scalar, but instead just the
first part of the surrogate pair consisting of 0x1D40 (in s[0]) and
0x0000 (in s[1]).

Arrrrgggh. After much head scratching I think I now understand what
you are saying. This appears to me to be absolutely nuts. What is
the purpose of having a unicode string type, if its sequence elements
are not guaranteed to be the unicode characters in the string? Might
as well use byte strings for everything.

Come to think of it, I don't understand why we have this plethora of
encodings like utf-16. utf-8 I can sort of understand on pragmatic
grounds, but aside from that I'd think UCS-4 should be used for everything,
and when a space-saving compressed representation is desired, then use
a general purpose data compression algorithm such as gzip.

Donn Cave · Apr 16, 2007

"Hendrik van Rooyen said:
Donn Cave said:

Well, yes - consider for example the "tm" tuple returned
from time.localtime() - it's all integers, but heterogeneous
as could be - tm[0] is Year, tm[1] is Month, etc., and it
turns out that not one of them is alike. The point is exactly
that we can't discover these differences from the items itself -
so it isn't about Python types - but rather from the position
of the item in the struct/tuple. (For the person who is about
to write to me that localtime() doesn't exactly return a tuple: QED)

Click to expand...

This is the point where the whole thing falls apart in my head and
I get real confused - I can't find a reason why, list or tuple, the first
item can't be something, the second something else, etc...

Of course, you may do what you like. Don't forget, though,
that there's no "index" method for a tuple.

Donn Cave, (e-mail address removed)

JS querySelector addEventListenerer index getElementsByClassName parent div only	1	Jan 25, 2023
Default mutable parameters in functions	10	Apr 3, 2014
Index Error during backpropagation in a multilayer neural network.	1	Jun 17, 2023
Text File Only Programming	1	May 10, 2023
Accessing array index addresses with custom datatype in a function	0	Jun 2, 2022
Logic Problem with BigInteger Method	2	Aug 26, 2023
confusing doc: mutable and hashable	1	Apr 28, 2012
IndexError: Replacement index 2 out of range for positional args tuple - help	0	Oct 14, 2022

"index" method only for mutable sequences??

Rhamphoryncus

Rhamphoryncus

Paul Rubin

Donn Cave

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads