byte count unicode string

Discussion in 'Python' started by willie, Sep 20, 2006.

  1. willie

    willie Guest

    Martin v. Löwis:

    >willie schrieb:
    >
    >> Thank you for your patience and for educating me.
    >> (Though I still have a long way to go before enlightenment)
    >> I thought Python might have a small weakness in
    >> lacking an efficient way to get the number of bytes
    >> in a "UTF-8 encoded Python string object" (proper?),
    >> but I've been disabused of that notion.

    >
    >Well, to get to the enlightenment, you have to understand
    >that Unicode and UTF-8 are *not* synonyms.
    >
    >A Python Unicode string is an abstract sequence of
    >characters. It does have an in-memory representation,
    >but that is irrelevant and depends on what microprocessor
    >you use. A byte string is a sequence of quantities with
    >8 bits each (called bytes).
    >
    >For each of them, the notion of "length" exists: For
    >a Unicode string, it's the number of characters; for
    >a byte string, the number of bytes.
    >
    >UTF-8 is a character encoding; it is only meaningful
    >to say that byte strings have an encoding (where
    >"UTF-8", "cp1252", "iso-2022-jp" are really very
    >similar). For a character encoding, "what is the
    >number of bytes?" is a meaningful question. For
    >a Unicode string, this question is not meaningful:
    >you have to specify the encoding first.
    >
    >Now, there is no len(unicode_string, encoding) function:
    >len takes a single argument. To specify both the string
    >and the encoding, you have to write
    >len(unicode_string.encode(encoding)). This, as a
    >side effect, actually computes the encoding.
    >
    >While it would be possible to answer the question
    >"how many bytes has Unicode string S in encoding E?"
    >without actually encoding the string, doing so would
    >require codecs to implement their algorithm twice:
    >once to count the number of bytes, and once to
    >actually perform the encoding. Since this operation
    >is not that frequent, it was chosen not to put the
    >burden of implementing the algorithm twice (actually,
    >doing so was never even considered).



    Thanks for the thorough explanation. One last question
    about terminology then I'll go away :)
    What is the proper way to describe "ustr" below?

    >>> ustr = buf.decode('UTF-8')
    >>> type(ustr)

    <type 'unicode'>


    Is it a "unicode object that contains a UTF-8 encoded
    string object?"
     
    willie, Sep 20, 2006
    #1
    1. Advertising

  2. willie

    John Machin Guest

    willie wrote:
    >
    > Thanks for the thorough explanation. One last question
    > about terminology then I'll go away :)
    > What is the proper way to describe "ustr" below?
    >
    > >>> ustr = buf.decode('UTF-8')
    > >>> type(ustr)

    > <type 'unicode'>
    >
    >
    > Is it a "unicode object that contains a UTF-8 encoded
    > string object?"


    No. It is a Python unicode object, period.

    1. If it did contain another object you would be (quite justifiably)
    screaming your peripherals off about the waste of memory :)
    2. You don't need to concern yourself with the internals of a unicode
    object; however rest assured that it is *not* stored as UTF-8 -- so if
    you are hoping for a quick "number of utf 8 bytes without actually
    producing a str object" method, you are out of luck.

    Consider this example: you have a str object which contains some
    Russian text, encoded in cp1251.

    str1 = russian_text
    unicode1 = str1.decode('cp1251')
    str2 = unicode1.encode('utf-8')
    unicode2 = str2.decode('utf-8')
    Then unicode2 == unicode1, repr(unicode2) == repr(unicode1), there is
    no way (without the above history) of determining how it was created --
    and you don't need to care how it was created.

    HTH,
    John
     
    John Machin, Sep 21, 2006
    #2
    1. Advertising

  3. willie

    Paul Rubin Guest

    willie <> writes:

    > >>> ustr = buf.decode('UTF-8')
    > >>> type(ustr)

    > <type 'unicode'>
    > Is it a "unicode object that contains a UTF-8 encoded
    > string object?"


    No, it's just unicode, which is a string over a certain character set.
    UTF-8 is a way to encode unicode strings as byte strings.

    You should read the wikipedia article about unicode, it will help you
    understand.

    http://en.wikipedia.org/wiki/Unicode
     
    Paul Rubin, Sep 22, 2006
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. willie

    byte count unicode string

    willie, Sep 20, 2006, in forum: Python
    Replies:
    2
    Views:
    361
    Marc 'BlackJack' Rintsch
    Sep 20, 2006
  2. willie

    byte count unicode string

    willie, Sep 20, 2006, in forum: Python
    Replies:
    7
    Views:
    721
    Virgil Dupras
    Sep 21, 2006
  3. willie

    byte count unicode string

    willie, Sep 20, 2006, in forum: Python
    Replies:
    1
    Views:
    613
    John Machin
    Sep 20, 2006
  4. willie

    byte count unicode string

    willie, Sep 20, 2006, in forum: Python
    Replies:
    2
    Views:
    470
    Diez B. Roggisch
    Sep 20, 2006
  5. willie

    byte count unicode string

    willie, Sep 20, 2006, in forum: Python
    Replies:
    2
    Views:
    713
    =?UTF-8?B?Ik1hcnRpbiB2LiBMw7Z3aXMi?=
    Sep 20, 2006
Loading...

Share This Page