byte count unicode string

Discussion in 'Python' started by willie, Sep 20, 2006.

  1. willie

    willie Guest

    John Machin:

    >Good luck!



    Thank you for your patience and for educating me.
    (Though I still have a long way to go before enlightenment)
    I thought Python might have a small weakness in
    lacking an efficient way to get the number of bytes
    in a "UTF-8 encoded Python string object" (proper?),
    but I've been disabused of that notion.
    It's always a nice feeling when my language of choice
    withstands my nitpicking.
     
    willie, Sep 20, 2006
    #1
    1. Advertising

  2. willie schrieb:
    > Thank you for your patience and for educating me.
    > (Though I still have a long way to go before enlightenment)
    > I thought Python might have a small weakness in
    > lacking an efficient way to get the number of bytes
    > in a "UTF-8 encoded Python string object" (proper?),
    > but I've been disabused of that notion.


    Well, to get to the enlightenment, you have to understand
    that Unicode and UTF-8 are *not* synonyms.

    A Python Unicode string is an abstract sequence of
    characters. It does have an in-memory representation,
    but that is irrelevant and depends on what microprocessor
    you use. A byte string is a sequence of quantities with
    8 bits each (called bytes).

    For each of them, the notion of "length" exists: For
    a Unicode string, it's the number of characters; for
    a byte string, the number of bytes.

    UTF-8 is a character encoding; it is only meaningful
    to say that byte strings have an encoding (where
    "UTF-8", "cp1252", "iso-2022-jp" are really very
    similar). For a character encoding, "what is the
    number of bytes?" is a meaningful question. For
    a Unicode string, this question is not meaningful:
    you have to specify the encoding first.

    Now, there is no len(unicode_string, encoding) function:
    len takes a single argument. To specify both the string
    and the encoding, you have to write
    len(unicode_string.encode(encoding)). This, as a
    side effect, actually computes the encoding.

    While it would be possible to answer the question
    "how many bytes has Unicode string S in encoding E?"
    without actually encoding the string, doing so would
    require codecs to implement their algorithm twice:
    once to count the number of bytes, and once to
    actually perform the encoding. Since this operation
    is not that frequent, it was chosen not to put the
    burden of implementing the algorithm twice (actually,
    doing so was never even considered).

    HTH,
    Martin
     
    =?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=, Sep 20, 2006
    #2
    1. Advertising

  3. willie schrieb:
    > Thank you for your patience and for educating me.
    > (Though I still have a long way to go before enlightenment)
    > I thought Python might have a small weakness in
    > lacking an efficient way to get the number of bytes
    > in a "UTF-8 encoded Python string object" (proper?),
    > but I've been disabused of that notion.


    Well, to get to the enlightenment, you have to understand
    that Unicode and UTF-8 are *not* synonyms.

    A Python Unicode string is an abstract sequence of
    characters. It does have an in-memory representation,
    but that is irrelevant and depends on what microprocessor
    you use. A byte string is a sequence of quantities with
    8 bits each (called bytes).

    For each of them, the notion of "length" exists: For
    a Unicode string, it's the number of characters; for
    a byte string, the number of bytes.

    UTF-8 is a character encoding; it is only meaningful
    to say that byte strings have an encoding (where
    "UTF-8", "cp1252", "iso-2022-jp" are really very
    similar). For a character encoding, "what is the
    number of bytes?" is a meaningful question. For
    a Unicode string, this question is not meaningful:
    you have to specify the encoding first.

    Now, there is no len(unicode_string, encoding) function:
    len takes a single argument. To specify both the string
    and the encoding, you have to write
    len(unicode_string.encode(encoding)). This, as a
    side effect, actually computes the encoding.

    While it would be possible to answer the question
    "how many bytes has Unicode string S in encoding E?"
    without actually encoding the string, doing so would
    require codecs to implement their algorithm twice:
    once to count the number of bytes, and once to
    actually perform the encoding. Since this operation
    is not that frequent, it was chosen not to put the
    burden of implementing the algorithm twice (actually,
    doing so was never even considered).

    HTH,
    Martin
     
    =?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=, Sep 20, 2006
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. willie

    byte count unicode string

    willie, Sep 20, 2006, in forum: Python
    Replies:
    2
    Views:
    365
    Marc 'BlackJack' Rintsch
    Sep 20, 2006
  2. willie

    byte count unicode string

    willie, Sep 20, 2006, in forum: Python
    Replies:
    7
    Views:
    725
    Virgil Dupras
    Sep 21, 2006
  3. willie

    byte count unicode string

    willie, Sep 20, 2006, in forum: Python
    Replies:
    1
    Views:
    616
    John Machin
    Sep 20, 2006
  4. willie

    byte count unicode string

    willie, Sep 20, 2006, in forum: Python
    Replies:
    2
    Views:
    476
    Diez B. Roggisch
    Sep 20, 2006
  5. willie

    byte count unicode string

    willie, Sep 20, 2006, in forum: Python
    Replies:
    2
    Views:
    533
    Paul Rubin
    Sep 22, 2006
Loading...

Share This Page