John said:
So as it turns out, Unicode and UTF-8 are not the same thing?
Well yes. UTF-8 is one scheme in which the whole Unicode character
repertoire can be represented as bytes.
Confusion arises because Windows uses the name 'Unicode' in character
encoding lists, to mean UTF-16_LE, which is another encoding that can
store the whole Unicode character repertoire as bytes. However
UTF-16_LE is not any more definitively 'Unicode' than UTF-8 is.
Further confusion arises because the encoding 'UTF-16' can actually
mean two things that are deceptively different:
- Unicode characters stored natively in 16-bit units (using two
UTF-16 characters to represent characters outside of the Basic
Multilingual Plane)
- Either of the 8-bit encodings UTF-16_LE and UTF-16_BE, detected
automatically using a Byte Order Mark when loaded, or chosen
arbitrarily when saving
Yet more confusion arises because UTF-32 (which can reference any
Unicode character directly) has the same problem. And though
wide-unicode builds of Python understand the first meaning (unicode()
strings are stored natively as UTF-32), they don't support the 8-bit
encodings UTF-32_LE and UTF-32_BE. Phew!
To summarise: confusion.
Am I right to say that UTF-8 stores the first 128 Unicode code points
in a single byte, and then stores higher code points in however many
bytes they may need?
That is correct.
To answer the original question, we're always going to need byte
strings. They're a fundamental part of computing and the need to
process them isn't going to go away. However as Unicode text
manipulation becomes a more common event than byte string processing,
it makes sense to change the default kind of string you get when you
type a literal.
Personally I would like to see byte strings available under an easy
syntax like b'...' and UTF-32 strings available as w'...', or something
like that - currently having u'...' mean either UTF-16 or UTF-32
depending on compile-time options is very very annoying to the few
kinds of programs that really do need to know the difference. But
whatever is chosen, it's all tasty Python 3000 future-soup and not
worth worrying about for the moment.