Re: Question on Strings

Discussion in 'Python' started by Chris Rebert, Feb 6, 2009.

  1. Chris Rebert

    Chris Rebert Guest

    On Fri, Feb 6, 2009 at 1:49 AM, Kalyankumar Ramaseshan
    <> wrote:
    >
    > Hi,
    >
    > Excuse me if this is a repeat question!
    >
    > I just wanted to know how are strings represented in python?
    >
    > I need to know in terms of:
    >
    > a) Strings are stored as UTF-16 (LE/BE) or UTF-32 characters?


    IIRC, Depends on what the build settings were when CPython was
    compiled. UTF-16 is the default.

    > b) They are converted to utf-8 format when it is needed for e.g. when storing the string to disk or sending it through a socket (tcp/ip)?


    No. They are implicitly converted to ASCII in such cases. To properly
    handle non-ASCII Unicode characters, you need to encode/decode the
    strings to/from bytes manually by specifying the encoding.

    Cheers,
    Chris

    --
    Follow the path of the Iguana...
    http://rebertia.com
    Chris Rebert, Feb 6, 2009
    #1
    1. Advertising

  2. Chris Rebert

    John Machin Guest

    On Feb 6, 9:24 pm, Chris Rebert <> wrote:
    > On Fri, Feb 6, 2009 at 1:49 AM, Kalyankumar Ramaseshan
    >
    > <> wrote:
    >
    > > Hi,

    >
    > > Excuse me if this is a repeat question!

    >
    > > I just wanted to know how are strings represented in python?

    >
    > > I need to know in terms of:

    >
    > > a) Strings are stored as UTF-16 (LE/BE) or UTF-32 characters?


    Neither.

    >
    > IIRC, Depends on what the build settings were when CPython was
    > compiled. UTF-16 is the default.


    Unicode strings are held as arrays of 16-bit numbers or 32-bit numbers
    [of which only 21 are used]. If you must use an acronym, use UCS-2 or
    UCS-4.

    The UTF-n siblings are *external* representations.
    2.x: a_unicode_object.decode('UTF-16') -> an_str_object
    3.x: an_str_object.decode('UTF-16') -> a_bytes_object

    By the way, has anyone come up with a name for the shifting effect
    observed above on str, and also with repr, range, and the iter*
    family? If not, I suggest that the language's association with the
    best of English humour be widened so that it be dubbed the "Mad
    Hatter's Tea Party" effect.
    John Machin, Feb 6, 2009
    #2
    1. Advertising

  3. Chris Rebert

    MRAB Guest

    John Machin wrote:
    > On Feb 6, 9:24 pm, Chris Rebert <> wrote:
    >> On Fri, Feb 6, 2009 at 1:49 AM, Kalyankumar Ramaseshan
    >>
    >> <> wrote:
    >>
    >>> Hi,
    >>> Excuse me if this is a repeat question!
    >>> I just wanted to know how are strings represented in python?
    >>> I need to know in terms of:
    >>> a) Strings are stored as UTF-16 (LE/BE) or UTF-32 characters?

    >
    > Neither.
    >
    >> IIRC, Depends on what the build settings were when CPython was
    >> compiled. UTF-16 is the default.

    >
    > Unicode strings are held as arrays of 16-bit numbers or 32-bit numbers
    > [of which only 21 are used]. If you must use an acronym, use UCS-2 or
    > UCS-4.
    >
    > The UTF-n siblings are *external* representations.
    > 2.x: a_unicode_object.decode('UTF-16') -> an_str_object
    > 3.x: an_str_object.decode('UTF-16') -> a_bytes_object
    >
    > By the way, has anyone come up with a name for the shifting effect
    > observed above on str, and also with repr, range, and the iter*
    > family? If not, I suggest that the language's association with the
    > best of English humour be widened so that it be dubbed the "Mad
    > Hatter's Tea Party" effect.
    >

    Bitwise shifts and rotates are collectively referred to as skew
    operations. I therefore suggest the term "skewing". :)
    MRAB, Feb 6, 2009
    #3
  4. "John Machin" <s..n@le..n.net> wrote:

    >By the way, has anyone come up with a name for the shifting effect
    >observed above on str, and also with repr, range, and the iter*
    >family? If not, I suggest that the language's association with the
    >best of English humour be widened so that it be dubbed the "Mad
    >Hatter's Tea Party" effect.


    The MHTP effect.

    Sounds educated, almost like
    a network protocol.

    +1

    - Hendrik
    Hendrik van Rooyen, Feb 6, 2009
    #4
  5. Chris Rebert

    Terry Reedy Guest

    John Machin wrote:

    > The UTF-n siblings are *external* representations.
    > 2.x: a_unicode_object.decode('UTF-16') -> an_str_object
    > 3.x: an_str_object.decode('UTF-16') -> a_bytes_object


    That should be .encode() to bytes, which is the coded form.
    ..decode is bytes => str/unicode
    Terry Reedy, Feb 6, 2009
    #5
  6. Chris Rebert

    John Machin Guest

    On Feb 7, 5:23 am, Terry Reedy <> wrote:
    > John Machin wrote:
    > > The UTF-n siblings are *external* representations.
    > > 2.x: a_unicode_object.decode('UTF-16') -> an_str_object
    > > 3.x: an_str_object.decode('UTF-16') -> a_bytes_object

    >
    > That should be .encode() to bytes, which is the coded form.
    > .decode is bytes => str/unicode


    True. I guess that makes me the Dohmouse :)
    John Machin, Feb 6, 2009
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Kurt Krueckeberg
    Replies:
    2
    Views:
    694
    =?ISO-8859-1?Q?Ney_Andr=E9_de_Mello_Zunino?=
    Nov 17, 2004
  2. Rick

    Comparing strings from within strings

    Rick, Oct 21, 2003, in forum: C Programming
    Replies:
    3
    Views:
    367
    Irrwahn Grausewitz
    Oct 21, 2003
  3. Klaus Neuner
    Replies:
    7
    Views:
    471
    Klaus Neuner
    Jul 26, 2004
  4. Girish Sahani
    Replies:
    17
    Views:
    557
    Boris Borcic
    Jun 9, 2006
  5. Ben

    Strings, Strings and Damned Strings

    Ben, Jun 22, 2006, in forum: C Programming
    Replies:
    14
    Views:
    733
    Malcolm
    Jun 24, 2006
Loading...

Share This Page