byte count unicode string

W

willie

Martin v. Löwis:
>willie schrieb:
>
>
>Well, to get to the enlightenment, you have to understand
>that Unicode and UTF-8 are *not* synonyms.
>
>A Python Unicode string is an abstract sequence of
>characters. It does have an in-memory representation,
>but that is irrelevant and depends on what microprocessor
>you use. A byte string is a sequence of quantities with
>8 bits each (called bytes).
>
>For each of them, the notion of "length" exists: For
>a Unicode string, it's the number of characters; for
>a byte string, the number of bytes.
>
>UTF-8 is a character encoding; it is only meaningful
>to say that byte strings have an encoding (where
>"UTF-8", "cp1252", "iso-2022-jp" are really very
>similar). For a character encoding, "what is the
>number of bytes?" is a meaningful question. For
>a Unicode string, this question is not meaningful:
>you have to specify the encoding first.
>
>Now, there is no len(unicode_string, encoding) function:
>len takes a single argument. To specify both the string
>and the encoding, you have to write
>len(unicode_string.encode(encoding)). This, as a
>side effect, actually computes the encoding.
>
>While it would be possible to answer the question
>"how many bytes has Unicode string S in encoding E?"
>without actually encoding the string, doing so would
>require codecs to implement their algorithm twice:
>once to count the number of bytes, and once to
>actually perform the encoding. Since this operation
>is not that frequent, it was chosen not to put the
>burden of implementing the algorithm twice (actually,
>doing so was never even considered).


Thanks for the thorough explanation. One last question
about terminology then I'll go away :)
What is the proper way to describe "ustr" below?
<type 'unicode'>


Is it a "unicode object that contains a UTF-8 encoded
string object?"
 
J

John Machin

willie said:
Thanks for the thorough explanation. One last question
about terminology then I'll go away :)
What is the proper way to describe "ustr" below?

<type 'unicode'>


Is it a "unicode object that contains a UTF-8 encoded
string object?"

No. It is a Python unicode object, period.

1. If it did contain another object you would be (quite justifiably)
screaming your peripherals off about the waste of memory :)
2. You don't need to concern yourself with the internals of a unicode
object; however rest assured that it is *not* stored as UTF-8 -- so if
you are hoping for a quick "number of utf 8 bytes without actually
producing a str object" method, you are out of luck.

Consider this example: you have a str object which contains some
Russian text, encoded in cp1251.

str1 = russian_text
unicode1 = str1.decode('cp1251')
str2 = unicode1.encode('utf-8')
unicode2 = str2.decode('utf-8')
Then unicode2 == unicode1, repr(unicode2) == repr(unicode1), there is
no way (without the above history) of determining how it was created --
and you don't need to care how it was created.

HTH,
John
 
P

Paul Rubin

willie said:
<type 'unicode'>
Is it a "unicode object that contains a UTF-8 encoded
string object?"

No, it's just unicode, which is a string over a certain character set.
UTF-8 is a way to encode unicode strings as byte strings.

You should read the wikipedia article about unicode, it will help you
understand.

http://en.wikipedia.org/wiki/Unicode
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top