byte count unicode string

willie · Sep 20, 2006

John Machin:

>Good luck!

Thank you for your patience and for educating me.
(Though I still have a long way to go before enlightenment)
I thought Python might have a small weakness in
lacking an efficient way to get the number of bytes
in a "UTF-8 encoded Python string object" (proper?),
but I've been disabused of that notion.
It's always a nice feeling when my language of choice
withstands my nitpicking.

Guest · Sep 20, 2006

willie said:
Thank you for your patience and for educating me.
(Though I still have a long way to go before enlightenment)
I thought Python might have a small weakness in
lacking an efficient way to get the number of bytes
in a "UTF-8 encoded Python string object" (proper?),
but I've been disabused of that notion.

Well, to get to the enlightenment, you have to understand
that Unicode and UTF-8 are *not* synonyms.

A Python Unicode string is an abstract sequence of
characters. It does have an in-memory representation,
but that is irrelevant and depends on what microprocessor
you use. A byte string is a sequence of quantities with
8 bits each (called bytes).

For each of them, the notion of "length" exists: For
a Unicode string, it's the number of characters; for
a byte string, the number of bytes.

UTF-8 is a character encoding; it is only meaningful
to say that byte strings have an encoding (where
"UTF-8", "cp1252", "iso-2022-jp" are really very
similar). For a character encoding, "what is the
number of bytes?" is a meaningful question. For
a Unicode string, this question is not meaningful:
you have to specify the encoding first.

Now, there is no len(unicode_string, encoding) function:
len takes a single argument. To specify both the string
and the encoding, you have to write
len(unicode_string.encode(encoding)). This, as a
side effect, actually computes the encoding.

While it would be possible to answer the question
"how many bytes has Unicode string S in encoding E?"
without actually encoding the string, doing so would
require codecs to implement their algorithm twice:
once to count the number of bytes, and once to
actually perform the encoding. Since this operation
is not that frequent, it was chosen not to put the
burden of implementing the algorithm twice (actually,
doing so was never even considered).

HTH,
Martin

Guest · Sep 20, 2006

willie said:
Thank you for your patience and for educating me.
(Though I still have a long way to go before enlightenment)
I thought Python might have a small weakness in
lacking an efficient way to get the number of bytes
in a "UTF-8 encoded Python string object" (proper?),
but I've been disabused of that notion.

Well, to get to the enlightenment, you have to understand
that Unicode and UTF-8 are *not* synonyms.

A Python Unicode string is an abstract sequence of
characters. It does have an in-memory representation,
but that is irrelevant and depends on what microprocessor
you use. A byte string is a sequence of quantities with
8 bits each (called bytes).

For each of them, the notion of "length" exists: For
a Unicode string, it's the number of characters; for
a byte string, the number of bytes.

UTF-8 is a character encoding; it is only meaningful
to say that byte strings have an encoding (where
"UTF-8", "cp1252", "iso-2022-jp" are really very
similar). For a character encoding, "what is the
number of bytes?" is a meaningful question. For
a Unicode string, this question is not meaningful:
you have to specify the encoding first.

Now, there is no len(unicode_string, encoding) function:
len takes a single argument. To specify both the string
and the encoding, you have to write
len(unicode_string.encode(encoding)). This, as a
side effect, actually computes the encoding.

While it would be possible to answer the question
"how many bytes has Unicode string S in encoding E?"
without actually encoding the string, doing so would
require codecs to implement their algorithm twice:
once to count the number of bytes, and once to
actually perform the encoding. Since this operation
is not that frequent, it was chosen not to put the
burden of implementing the algorithm twice (actually,
doing so was never even considered).

HTH,
Martin

byte count unicode string

willie

Guest

Guest

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads

byte count unicode string	2	Sep 20, 2006
byte count unicode string	0	Sep 20, 2006
byte count unicode string	2	Sep 20, 2006
byte count unicode string	1	Sep 20, 2006
Unicode literals and byte string interpretation.	4	Oct 27, 2011
byte count unicode string	2	Sep 20, 2006
byte count unicode string	7	Sep 20, 2006
Python Unicode handling wins again -- mostly	67	Nov 29, 2013