Need debugging knowhow for my creeping Unicodephobia

Discussion in 'Python' started by kj, Feb 10, 2010.

  1. kj

    kj Guest

    Some people have mathphobia. I'm developing a wicked case of
    Unicodephobia.

    I have read a *ton* of stuff on Unicode. It doesn't even seem all
    that hard. Or so I think. Then I start writing code, and WHAM:

    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

    (There, see? My Unicodephobia just went up a notch.)

    Here's the thing: I don't even know how to *begin* debugging errors
    like this. This is where I could use some help.

    In the past I've gone for method of choice of the clueless:
    "programming by trial-and-error", try random crap until something
    "works." And if that "strategy" fails, I come begging for help to
    c.l.p. And thanks for the very effective pointers for getting rid
    of the errors.

    But afterwards I remain as clueless as ever... It's the old "give
    a man a fish" vs. "teach a man to fish" story.

    I need a systematic approach to troubleshooting and debugging these
    Unicode errors. I don't know what. Some tools maybe. Some useful
    modules or builtin commands. A diagnostic flowchart? I don't
    think that any more RTFM on Unicode is going to help (I've done it
    in spades), but if there's a particularly good write-up on Unicode
    debugging, please let me know.

    Any suggestions would be much appreciated.

    FWIW, I'm using Python 2.6. The example above happens to come from
    a script that extracts data from HTML files, which are all in
    English, but they are a daily occurrence when I write code to
    process non-English text. The script uses Beautiful Soup. I won't
    post a lot of code because, as I said, what I'm after is not so
    much a way around this specific error as much as the tools and
    techniques to troubleshoot it and fix it on my own. But to ground
    the problem a bit I'll say that the exception above happens during
    the execution of a statement of the form:

    x = '%s %s' % (y, z)

    Also, I found that, with the exact same values y and z as above,
    all of the following statements work perfectly fine:

    x = '%s' % y
    x = '%s' % z
    print y
    print z
    print y, z

    TIA!

    ~K
     
    kj, Feb 10, 2010
    #1
    1. Advertising

  2. On Feb 10, 11:09 am, kj <> wrote:
    >
    > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
    >


    You'll have to understand some terminology first.

    "codec" is a description of how to encode and decode unicode data to a
    stream of bytes.

    "decode" means you are taking a series of bytes and converting it to
    unicode.

    "encode" is the opposite---take a unicode string and convert it to a
    stream of bytes.

    "ascii" is a codec that can only describe 0-127 with bytes 0-127.
    "utf-8", "utf-16", etc... are other codecs. There's a lot of them.
    Only some of them (ie, utf-8, utf-16) can encode all unicode. Most
    (ie, ascii) can only do a subset of unicode.

    In this case, you've fed a stream of bytes with 128 as one of the
    bytes to the decoder. Since the decoder thinks it's working with
    ascii, it doesn't know what to do with 128. There's a number of ways
    to fix this:

    (1) Feed it unicode instead, so it doesn't try to decode it.

    (2) Tell it what encoding you are using, because it's obviously not
    ascii.

    >
    > FWIW, I'm using Python 2.6.  The example above happens to come from
    > a script that extracts data from HTML files, which are all in
    > English, but they are a daily occurrence when I write code to
    > process non-English text.  The script uses Beautiful Soup.  I won't
    > post a lot of code because, as I said, what I'm after is not so
    > much a way around this specific error as much as the tools and
    > techniques to troubleshoot it and fix it on my own.  But to ground
    > the problem a bit I'll say that the exception above happens during
    > the execution of a statement of the form:
    >
    >   x = '%s %s' % (y, z)
    >
    > Also, I found that, with the exact same values y and z as above,
    > all of the following statements work perfectly fine:
    >
    >   x = '%s' % y
    >   x = '%s' % z
    >   print y
    >   print z
    >   print y, z
    >


    What are y and z? Are they unicode or strings? What are their values?

    It sounds like someone, probably beautiful soup, is trying to turn
    your strings into unicode. A full stacktrace would be useful to see
    who did what where.
     
    Jonathan Gardner, Feb 10, 2010
    #2
    1. Advertising

  3. kj

    MRAB Guest

    kj wrote:
    >
    > Some people have mathphobia. I'm developing a wicked case of
    > Unicodephobia.
    >
    > I have read a *ton* of stuff on Unicode. It doesn't even seem all
    > that hard. Or so I think. Then I start writing code, and WHAM:
    >
    > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
    >
    > (There, see? My Unicodephobia just went up a notch.)
    >
    > Here's the thing: I don't even know how to *begin* debugging errors
    > like this. This is where I could use some help.
    >
    > In the past I've gone for method of choice of the clueless:
    > "programming by trial-and-error", try random crap until something
    > "works." And if that "strategy" fails, I come begging for help to
    > c.l.p. And thanks for the very effective pointers for getting rid
    > of the errors.
    >
    > But afterwards I remain as clueless as ever... It's the old "give
    > a man a fish" vs. "teach a man to fish" story.
    >
    > I need a systematic approach to troubleshooting and debugging these
    > Unicode errors. I don't know what. Some tools maybe. Some useful
    > modules or builtin commands. A diagnostic flowchart? I don't
    > think that any more RTFM on Unicode is going to help (I've done it
    > in spades), but if there's a particularly good write-up on Unicode
    > debugging, please let me know.
    >
    > Any suggestions would be much appreciated.
    >
    > FWIW, I'm using Python 2.6. The example above happens to come from
    > a script that extracts data from HTML files, which are all in
    > English, but they are a daily occurrence when I write code to
    > process non-English text. The script uses Beautiful Soup. I won't
    > post a lot of code because, as I said, what I'm after is not so
    > much a way around this specific error as much as the tools and
    > techniques to troubleshoot it and fix it on my own. But to ground
    > the problem a bit I'll say that the exception above happens during
    > the execution of a statement of the form:
    >
    > x = '%s %s' % (y, z)
    >
    > Also, I found that, with the exact same values y and z as above,
    > all of the following statements work perfectly fine:
    >
    > x = '%s' % y
    > x = '%s' % z
    > print y
    > print z
    > print y, z
    >

    Decode all text input; encode all text output; do all text processing
    in Unicode, which also means making all text literals Unicode (prefixed
    with 'u').

    Note: I'm talking about when you're working with _text_, as distinct
    from when you're working with _binary data_, ie bytes.
     
    MRAB, Feb 10, 2010
    #3
  4. On Feb 10, 2:09 pm, kj <> wrote:
    > Some people have mathphobia.  I'm developing a wicked case of
    > Unicodephobia.
    > [snip]


    Some general advice (Looks like I am reiterating what MRAB said -- I
    type slower :):

    1. If possible, use unicode strings for everything. That is, don't
    use both str and unicode within the same project.

    2. If that isn't possible, convert strings to unicode as early as
    possible, work with them that way, then convert them back as late as
    possible.

    3. Know what type of string you are working with! If a function
    returns or accepts a string value, verify whether the expected type is
    unicode or str.

    4. Consider switching to Python 3.x, since there is only one string
    type (unicode).

    --
     
    Anthony Tolle, Feb 10, 2010
    #4
  5. kj

    kj Guest

    In <> Jonathan Gardner <> writes:

    >On Feb 10, 11:09=A0am, kj <> wrote:
    >> FWIW, I'm using Python 2.6. =A0The example above happens to come from
    >> a script that extracts data from HTML files, which are all in
    >> English, but they are a daily occurrence when I write code to
    >> process non-English text. =A0The script uses Beautiful Soup. =A0I won't
    >> post a lot of code because, as I said, what I'm after is not so
    >> much a way around this specific error as much as the tools and
    >> techniques to troubleshoot it and fix it on my own. =A0But to ground
    >> the problem a bit I'll say that the exception above happens during
    >> the execution of a statement of the form:
    >>
    >> =A0 x =3D '%s %s' % (y, z)
    >>
    >> Also, I found that, with the exact same values y and z as above,
    >> all of the following statements work perfectly fine:
    >>
    >> =A0 x =3D '%s' % y
    >> =A0 x =3D '%s' % z
    >> =A0 print y
    >> =A0 print z
    >> =A0 print y, z
    >>


    >What are y and z?


    x = "%s %s" % (table['id'], table.tr.renderContents())

    where the variable table represents a BeautifulSoup.Tag instance.

    >Are they unicode or strings?


    The first item (table['id']) is unicode, and the second is str.

    >What are their values?


    The only easy way I know to examine the values of these strings is
    to print them, which, I know, is very crude. (IOW, to answer this
    question usefully, in the context of this problem, more Unicode
    knowhow is needed than I have.) If I print them, the output for
    the first one on my screen is "mainTable", and for the second it is

    <th class="mainTableHeader" colspan="2"> Tags</th>
    <th class="mainTableHeader"> Id</th>


    >It sounds like someone, probably beautiful soup, is trying to turn
    >your strings into unicode. A full stacktrace would be useful to see
    >who did what where.


    Unfortunately, there's not much in the stacktrace:

    Traceback (most recent call last):
    File "./download_tt.py", line 427, in <module>
    x = "%s %s" % (table['id'], table.tr.renderContents())
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 41: ordinal not in range(128)

    (NB: the difference between this error message and the one I
    originally posted, namely the position of the unrecognized byte,
    is because I simplified the code for the purpose of posting it
    here, eliminating one additional processing of the second entry of
    the tuple above.)

    ~K
     
    kj, Feb 10, 2010
    #5
  6. On Wed, 2010-02-10 at 12:17 -0800, Anthony Tolle wrote:
    > On Feb 10, 2:09 pm, kj <> wrote:
    > > Some people have mathphobia. I'm developing a wicked case of
    > > Unicodephobia.
    > > [snip]

    >
    > Some general advice (Looks like I am reiterating what MRAB said -- I
    > type slower :):
    >
    > 1. If possible, use unicode strings for everything. That is, don't
    > use both str and unicode within the same project.
    >
    > 2. If that isn't possible, convert strings to unicode as early as
    > possible, work with them that way, then convert them back as late as
    > possible.
    >
    > 3. Know what type of string you are working with! If a function
    > returns or accepts a string value, verify whether the expected type is
    > unicode or str.
    >
    > 4. Consider switching to Python 3.x, since there is only one string
    > type (unicode).


    Some further nasty gotchas:

    5. Be wary of the encoding of sys.stdout (and stderr/stdin), e.g. when
    issuing a "print" statement: they can change on Unix depending on
    whether the python process is directly connected to a tty or not.

    (a) If they're directly connected to a tty, their encoding is taken from
    the locale, UTF-8 on my machine:
    [david@brick ~]$ python -c 'print u"\u03b1\u03b2\u03b3"'
    αβγ
    (prints alpha, beta, gamma to terminal, though these characters might
    not survive being sent in this email)

    (b) If they're not (e.g. cronjob, daemon, within a shell pipeline, etc)
    their encoding is the default encoding, which is typically ascii;
    rerunning the same command, but piping into "cat":
    [david@brick ~]$ python -c 'print u"\u03b1\u03b2\u03b3"'| cat
    Traceback (most recent call last):
    File "<string>", line 1, in <module>
    UnicodeEncodeError: 'ascii' codec can't encode characters in position
    0-2: ordinal not in range(128)

    (c) These problems can lurk in sources and only manifest themselves
    during _deployment_ of code. You can set PYTHONIOENCODING=ascii in the
    environment to force (a) to behave like (b), so that your code will fail
    whilst you're _developing_ it, rather than on your servers at midnight:
    [david@brick ~]$ PYTHONIOENCODING=ascii python -c 'print u"\u03b1\u03b2
    \u03b3"'
    Traceback (most recent call last):
    File "<string>", line 1, in <module>
    UnicodeEncodeError: 'ascii' codec can't encode characters in position
    0-2: ordinal not in range(128)

    (Given the above, it could be argued perhaps that one should never
    "print" unicode instances, and instead should write the data to
    file-like objects, specifying an encoding. Not sure).

    6. If you're using pygtk (specifically the "pango" module, typically
    implicitly imported), be warned that it abuses the C API to set the
    default encoding inside python, which probably breaks any unicode
    instances in memory at the time, and is likely to cause weird side
    effects:
    [david@brick ~]$ python
    Python 2.6.2 (r262:71600, Jan 25 2010, 13:22:47)
    [GCC 4.4.2 20100121 (Red Hat 4.4.2-28)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import sys
    >>> sys.getdefaultencoding()

    'ascii'
    >>> import pango
    >>> sys.getdefaultencoding()

    'utf-8'
    (the above is on Fedora 12, though I'd expect to see the same weirdness
    on any linux distro running gnome 2)

    Python 3 will probably make this all much easier; you'll still have to
    care about encodings when dealing with files/sockets/etc, but it should
    be much more clear what's going on. I hope.

    Hope this is helpful
    Dave
     
    David Malcolm, Feb 10, 2010
    #6
  7. kj

    kj Guest

    In <Xns9D1BCAD3B50E1duncanbooth@127.0.0.1> Duncan Booth <> writes:

    >kj <> wrote:


    >> But to ground
    >> the problem a bit I'll say that the exception above happens during
    >> the execution of a statement of the form:
    >>
    >> x = '%s %s' % (y, z)
    >>
    >> Also, I found that, with the exact same values y and z as above,
    >> all of the following statements work perfectly fine:
    >>
    >> x = '%s' % y
    >> x = '%s' % z
    >> print y
    >> print z
    >> print y, z
    >>


    >One of y or z is unicode, the other is str.


    Yes, that was the root of the problem.

    >1. Print the repr of each value so you can see which is which.


    Thanks for pointing out repr; it's really useful when dealing with
    Unicode headaches.



    Thanks for all the replies!

    ~K
     
    kj, Feb 11, 2010
    #7
  8. kj

    mk Guest

    kj wrote:
    > I have read a *ton* of stuff on Unicode. It doesn't even seem all
    > that hard. Or so I think. Then I start writing code, and WHAM:
    >
    > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
    >
    > (There, see? My Unicodephobia just went up a notch.)
    >
    > Here's the thing: I don't even know how to *begin* debugging errors
    > like this. This is where I could use some help.


    >>> a=u'\u0104'
    >>>
    >>> type(a)

    <type 'unicode'>
    >>>
    >>> nu=a.encode('utf-8')
    >>>
    >>> type(nu)

    <type 'str'>


    See what I mean? You encode INTO string, and decode OUT OF string.

    To make matters more complicated, str.encode() internally DECODES from
    string into unicode:

    >>> nu

    '\xc4\x84'
    >>>
    >>> type(nu)

    <type 'str'>
    >>> nu.encode()

    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0:
    ordinal not in range(128)

    There's logic to this, although it makes my brain want to explode. :)

    Regards,
    mk
     
    mk, Feb 11, 2010
    #8
  9. kj

    kj Guest

    In <> mk <> writes:

    >To make matters more complicated, str.encode() internally DECODES from
    >string into unicode:


    > >>> nu

    >'\xc4\x84'
    > >>>
    > >>> type(nu)

    ><type 'str'>
    > >>> nu.encode()

    >Traceback (most recent call last):
    > File "<stdin>", line 1, in <module>
    >UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0:
    >ordinal not in range(128)


    >There's logic to this, although it makes my brain want to explode. :)



    Thanks for pointing this one out! It could have easily pushed my
    Unicodephobia into the incurable zone...

    ~K
     
    kj, Feb 11, 2010
    #9
  10. kj

    MRAB Guest

    mk wrote:
    > kj wrote:
    >> I have read a *ton* of stuff on Unicode. It doesn't even seem all
    >> that hard. Or so I think. Then I start writing code, and WHAM:
    >>
    >> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
    >> 0: ordinal not in range(128)
    >>
    >> (There, see? My Unicodephobia just went up a notch.)
    >>
    >> Here's the thing: I don't even know how to *begin* debugging errors
    >> like this. This is where I could use some help.

    >
    > >>> a=u'\u0104'
    > >>>
    > >>> type(a)

    > <type 'unicode'>
    > >>>
    > >>> nu=a.encode('utf-8')
    > >>>
    > >>> type(nu)

    > <type 'str'>
    >
    >
    > See what I mean? You encode INTO string, and decode OUT OF string.
    >

    Traditionally strings were string of byte-sized characters. Because they
    were byte-sided they could also be used to contain binary data.

    Then along came Unicode.

    When working with Unicode in Python 2, you should use the 'unicode' type
    for text (Unicode strings) and limit the 'str' type to binary data
    (bytestrings, ie bytes) only.

    In Python 3 they've been renamed to 'str' for Unicode _strings_ and
    'bytes' for binary data (bytes!).

    > To make matters more complicated, str.encode() internally DECODES from
    > string into unicode:
    >
    > >>> nu

    > '\xc4\x84'
    > >>>
    > >>> type(nu)

    > <type 'str'>
    > >>> nu.encode()

    > Traceback (most recent call last):
    > File "<stdin>", line 1, in <module>
    > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 0:
    > ordinal not in range(128)
    >
    > There's logic to this, although it makes my brain want to explode. :)
    >

    Strictly speaking, only Unicode can be encoded.

    What Python 2 is doing here is trying to be helpful: if it's already a
    bytestring then decode it first to Unicode and then re-encode it to a
    bytestring.

    Unfortunately, the default encoding is ASCII, and the bytestring isn't
    valid ASCII. Python 2 is being 'helpful' in a bad way!
     
    MRAB, Feb 11, 2010
    #10
  11. kj

    mk Guest

    MRAB wrote:

    > When working with Unicode in Python 2, you should use the 'unicode' type
    > for text (Unicode strings) and limit the 'str' type to binary data
    > (bytestrings, ie bytes) only.


    Well OK, always use u'something', that's simple -- but isn't str what I
    get from files and sockets and the like?

    > In Python 3 they've been renamed to 'str' for Unicode _strings_ and
    > 'bytes' for binary data (bytes!).


    Neat, except that the process of porting most projects and external
    libraries to P3 seems to be, how should I put it, standing still? Or am
    I wrong? But that's the impression I get?

    Take web frameworks for example. Does any of them have serious plans and
    work in place to port to P3?

    > Strictly speaking, only Unicode can be encoded.


    How so? Can't bytestrings containing characters of, say, koi8r encoding
    be encoded?

    > What Python 2 is doing here is trying to be helpful: if it's already a
    > bytestring then decode it first to Unicode and then re-encode it to a
    > bytestring.


    It's really cumbersome sometimes, even if two libraries are written by
    one author: for instance, Mako and SQLAlchemy are written by the same
    guy. They are both top-of-the line in my humble opinion, but when you
    connect them you get things like this:

    1. you query SQLAlchemy object, that happens to have string fields in
    relational DB.

    2. Corresponding Python attributes of those objects then have type str,
    not unicode.

    3. then I pass those objects to Mako for HTML rendering.

    Typically, it works: but if and only if a character in there does not
    happen to be out of ASCII range. If it does, you get UnicodeDecodeError
    on an unsuspecting user.

    Sure, I wrote myself a helper that iterates over keyword dictionary to
    make sure to convert all str to unicode and only then passes the
    dictionary to render_unicode. It's an overhead, though. It would be
    nicer to have it all unicode from db and then just pass it for rendering
    and having it working. (unless there's something in filters that I
    missed, but there's encoding of templates, tags, but I didn't find
    anything on automatic conversion of objects passed to method rendering
    template)

    But maybe I'm whining.


    > Unfortunately, the default encoding is ASCII, and the bytestring isn't
    > valid ASCII. Python 2 is being 'helpful' in a bad way!


    And the default encoding is coded in such way so it cannot be changed in
    sitecustomize (without code modification, that is).

    Regards,
    mk
     
    mk, Feb 11, 2010
    #11
  12. kj

    Robert Kern Guest

    On 2010-02-11 15:43 PM, mk wrote:
    > MRAB wrote:


    >> Strictly speaking, only Unicode can be encoded.

    >
    > How so? Can't bytestrings containing characters of, say, koi8r encoding
    > be encoded?


    I think he means that only unicode objects can be encoded using the .encode()
    method, as clarified by his next sentence:

    >> What Python 2 is doing here is trying to be helpful: if it's already a
    >> bytestring then decode it first to Unicode and then re-encode it to a
    >> bytestring.


    --
    Robert Kern

    "I have come to believe that the whole world is an enigma, a harmless enigma
    that is made terrible by our own mad attempt to interpret it as though it had
    an underlying truth."
    -- Umberto Eco
     
    Robert Kern, Feb 11, 2010
    #12
  13. kj

    Steve Holden Guest

    mk wrote:
    > MRAB wrote:
    >
    >> When working with Unicode in Python 2, you should use the 'unicode' type
    >> for text (Unicode strings) and limit the 'str' type to binary data
    >> (bytestrings, ie bytes) only.

    >
    > Well OK, always use u'something', that's simple -- but isn't str what I
    > get from files and sockets and the like?
    >

    Yes, which is why you need to know what encoding was used to create it.

    >> In Python 3 they've been renamed to 'str' for Unicode _strings_ and
    >> 'bytes' for binary data (bytes!).

    >
    > Neat, except that the process of porting most projects and external
    > libraries to P3 seems to be, how should I put it, standing still? Or am
    > I wrong? But that's the impression I get?
    >

    No, it's probably not going as quickly as you would like, but it's
    certainly not standing still. Some of these libraries are substantial
    works, and there were changes to the C API that take quite a bit of work
    to adapt existing code to.

    > Take web frameworks for example. Does any of them have serious plans and
    > work in place to port to P3?
    >

    There have already been demonstrations of partially-working Python 3
    Django. I can't speak to the rest.

    >> Strictly speaking, only Unicode can be encoded.

    >
    > How so? Can't bytestrings containing characters of, say, koi8r encoding
    > be encoded?
    >

    It's just terminology. If a bytestring contains koi8r characters then
    (as you unconsciously recognized by your use of the word "encoding") it
    already *has* been encoded.

    >> What Python 2 is doing here is trying to be helpful: if it's already a
    >> bytestring then decode it first to Unicode and then re-encode it to a
    >> bytestring.

    >
    > It's really cumbersome sometimes, even if two libraries are written by
    > one author: for instance, Mako and SQLAlchemy are written by the same
    > guy. They are both top-of-the line in my humble opinion, but when you
    > connect them you get things like this:
    >
    > 1. you query SQLAlchemy object, that happens to have string fields in
    > relational DB.
    >
    > 2. Corresponding Python attributes of those objects then have type str,
    > not unicode.
    >

    Yes, a relational database will often return ASCII, but nowadays people
    are increasingly using encoded Unicode. In that case you need to be
    aware of the encoding that has been used to render the Unicode values
    into the byte strings (which in Python 2 are of type str) so that you
    can decode them into Unicode.

    > 3. then I pass those objects to Mako for HTML rendering.
    >
    > Typically, it works: but if and only if a character in there does not
    > happen to be out of ASCII range. If it does, you get UnicodeDecodeError
    > on an unsuspecting user.
    >

    Well first you need to be clear what you are passing to Mako.

    > Sure, I wrote myself a helper that iterates over keyword dictionary to
    > make sure to convert all str to unicode and only then passes the
    > dictionary to render_unicode. It's an overhead, though. It would be
    > nicer to have it all unicode from db and then just pass it for rendering
    > and having it working. (unless there's something in filters that I
    > missed, but there's encoding of templates, tags, but I didn't find
    > anything on automatic conversion of objects passed to method rendering
    > template)
    >

    Some database modules will distinguish between fields of type varchar
    and nvarchar, returning Unicode objects for the latter. You will need to
    ensure that the module knows which encoding is used in the database.
    This is usually automatic.

    > But maybe I'm whining.
    >

    Nope, just struggling with a topic that is far from straightforward the
    first time you encounter it.
    >
    >> Unfortunately, the default encoding is ASCII, and the bytestring isn't
    >> valid ASCII. Python 2 is being 'helpful' in a bad way!

    >
    > And the default encoding is coded in such way so it cannot be changed in
    > sitecustomize (without code modification, that is).
    >

    Yes, the default encoding is not always convenient.

    regards
    Steve
    --
    Steve Holden +1 571 484 6266 +1 800 494 3119
    PyCon is coming! Atlanta, Feb 2010 http://us.pycon.org/
    Holden Web LLC http://www.holdenweb.com/
    UPCOMING EVENTS: http://holdenweb.eventbrite.com/
     
    Steve Holden, Feb 11, 2010
    #13
  14. kj

    Terry Reedy Guest

    On 2/11/2010 4:43 PM, mk wrote:

    > Neat, except that the process of porting most projects and external
    > libraries to P3 seems to be, how should I put it, standing still?


    What is important are the libraries, so more new projects can start in
    3.x. There is a slow trickly of 3.x support announcements.

    > But maybe I'm whining.


    Or perhaps explaining why 3.x unicode improvements are needed.

    tjr
     
    Terry Reedy, Feb 11, 2010
    #14
  15. kj

    Nobody Guest

    On Wed, 10 Feb 2010 12:17:51 -0800, Anthony Tolle wrote:

    > 4. Consider switching to Python 3.x, since there is only one string
    > type (unicode).


    However: one drawback of Python 3.x is that the repr() of a Unicode string
    is no longer restricted to ASCII. There is an ascii() function which
    behaves like the 2.x repr(). However: the interpreter uses repr() for
    displaying the value of an expression typed at the interactive prompt,
    which results in "can't encode" errors if the string cannot be converted
    to your locale's encoding.
     
    Nobody, Feb 12, 2010
    #15
  16. kj

    John Nagle Guest

    kj wrote:

    >>> =A0 x =3D '%s' % y
    >>> =A0 x =3D '%s' % z
    >>> =A0 print y
    >>> =A0 print z
    >>> =A0 print y, z


    Bear in mind that most Python implementations assume the "console"
    only handles ASCII. So "print" output is converted to ASCII, which
    can fail. (Actually, all modern Windows and Linux systems support
    Unicode consoles, but Python somehow doesn't get this.)

    John Nagle
     
    John Nagle, Feb 13, 2010
    #16
  17. kj

    John Nagle Guest

    kj wrote:
    > Some people have mathphobia. I'm developing a wicked case of
    > Unicodephobia.
    >
    > I have read a *ton* of stuff on Unicode. It doesn't even seem all
    > that hard. Or so I think. Then I start writing code, and WHAM:
    >
    > UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)


    First, you haven't told us what platform you're on. Windows? Linux?
    Something else?

    If you're on Windows, and running Python from the command line, try
    "cmd /u" before running Python. This will get you a Windows console that
    will print Unicode. Python recognizes this, and "print" calls will
    go out to the console in Unicode, which will then print the correct
    characters if they're in the font being used by the Windows console.
    Most European languages are covered in the standard font.

    If you're using IDLE, or some Python debugger, it may need to be
    told to have its window use Unicode.

    John Nagle
     
    John Nagle, Feb 13, 2010
    #17
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Johann Blake
    Replies:
    1
    Views:
    1,242
    Hermit Dave
    Jan 5, 2004
  2. Trevor Benedict R

    Re: Need debugging help.

    Trevor Benedict R, Jun 27, 2004, in forum: ASP .Net
    Replies:
    3
    Views:
    358
    Steven Cheng[MSFT]
    Jun 30, 2004
  3. jacob navia

    Debugging C vs debugging C++

    jacob navia, Oct 26, 2006, in forum: C Programming
    Replies:
    11
    Views:
    623
    Ian Collins
    Oct 27, 2006
  4. AAaron123
    Replies:
    3
    Views:
    690
    AAaron123
    Jul 28, 2009
  5. Roedy Green

    creeping consensus

    Roedy Green, Feb 15, 2013, in forum: Java
    Replies:
    6
    Views:
    325
    Andreas Leitgeb
    Feb 26, 2013
Loading...

Share This Page