W
willie
Martin v. Löwis:
Thanks for the thorough explanation. One last question
about terminology then I'll go away
What is the proper way to describe "ustr" below?
<type 'unicode'>
Is it a "unicode object that contains a UTF-8 encoded
string object?"
>willie schrieb:
>
>
>Well, to get to the enlightenment, you have to understand
>that Unicode and UTF-8 are *not* synonyms.
>
>A Python Unicode string is an abstract sequence of
>characters. It does have an in-memory representation,
>but that is irrelevant and depends on what microprocessor
>you use. A byte string is a sequence of quantities with
>8 bits each (called bytes).
>
>For each of them, the notion of "length" exists: For
>a Unicode string, it's the number of characters; for
>a byte string, the number of bytes.
>
>UTF-8 is a character encoding; it is only meaningful
>to say that byte strings have an encoding (where
>"UTF-8", "cp1252", "iso-2022-jp" are really very
>similar). For a character encoding, "what is the
>number of bytes?" is a meaningful question. For
>a Unicode string, this question is not meaningful:
>you have to specify the encoding first.
>
>Now, there is no len(unicode_string, encoding) function:
>len takes a single argument. To specify both the string
>and the encoding, you have to write
>len(unicode_string.encode(encoding)). This, as a
>side effect, actually computes the encoding.
>
>While it would be possible to answer the question
>"how many bytes has Unicode string S in encoding E?"
>without actually encoding the string, doing so would
>require codecs to implement their algorithm twice:
>once to count the number of bytes, and once to
>actually perform the encoding. Since this operation
>is not that frequent, it was chosen not to put the
>burden of implementing the algorithm twice (actually,
>doing so was never even considered).
Thanks for the thorough explanation. One last question
about terminology then I'll go away
What is the proper way to describe "ustr" below?
<type 'unicode'>
Is it a "unicode object that contains a UTF-8 encoded
string object?"