Unicode and exception strings

Discussion in 'Python' started by Rune Froysa, Jan 9, 2004.

  1. Rune Froysa

    Rune Froysa Guest

    Assuming an exception like:

    x = ValueError(u'\xf8')

    AFAIK the common way to get a string representation of the exception
    as a message is to simply cast it to a string: str(x). This will
    result in an "UnicodeError: ASCII encoding error: ordinal not in
    range(128)".

    The common way to fix this is with something like
    u'\xf8'.encode("ascii", 'replace'). However I can't find any way to
    tell ValueErrors __str__ method which encoding to use.

    Is it possible to solve this without using sys.setdefaultencoding()
    from sitecustomize?

    Regards,
    Rune Frøysa
     
    Rune Froysa, Jan 9, 2004
    #1
    1. Advertising

  2. On 09 Jan 2004 13:18:39 +0100, Rune Froysa <>
    wrote:

    >Assuming an exception like:
    >
    > x = ValueError(u'\xf8')
    >
    >AFAIK the common way to get a string representation of the exception
    >as a message is to simply cast it to a string: str(x). This will
    >result in an "UnicodeError: ASCII encoding error: ordinal not in
    >range(128)".
    >
    >The common way to fix this is with something like
    >u'\xf8'.encode("ascii", 'replace'). However I can't find any way to
    >tell ValueErrors __str__ method which encoding to use.


    Rune, I'm not understanding what your problem is.

    Is there any reason you're not using, for example, just repr(u'\xf8')?

    In one program I have that occasionally runs into a line that includes
    some (UTF-8) Unicode-encoded Chinese characters , I have something like
    this:

    try:
    _display_text = _display_text + "%s\n" % line
    except UnicodeDecodeError:
    try:
    # decode those UTF8 nasties
    _display_text = _display_text + "%s\n" % line.decode('utf-8'))
    except UnicodeDecodeError:
    # if that still doesn't work, punt
    # (I don't think we'll ever reach this, but just in case)
    _display_text = _display_text + "%s\n" % repr(line)

    I don't know if this will help you or not.
     
    Terry Carroll, Jan 9, 2004
    #2
    1. Advertising

  3. On Fri, 09 Jan 2004 19:44:21 GMT, Terry Carroll <> wrote:

    >In one program I have that occasionally runs into a line that includes
    >some (UTF-8) Unicode-encoded Chinese characters , I have something like
    >this:


    Sorry, a stray parenthesis crept in here (since this is a pared down
    version of my actual code). It should read:


    try:
    _display_text = _display_text + "%s\n" % line
    except UnicodeDecodeError:
    try:
    # decode those UTF8 nasties
    _display_text = _display_text + "%s\n" % line.decode('utf-8')
    except UnicodeDecodeError:
    # if that still doesn't work, punt
    # (I don't think we'll ever reach this, but just in case)
    _display_text = _display_text + "%s\n" % repr(line)

    > I don't know if this will help you or not.
     
    Terry Carroll, Jan 9, 2004
    #3
  4. Rune Froysa

    Rune Froysa Guest

    Terry Carroll <> writes:

    > On 09 Jan 2004 13:18:39 +0100, Rune Froysa <>
    > wrote:
    >
    > >Assuming an exception like:
    > >
    > > x = ValueError(u'\xf8')
    > >
    > >AFAIK the common way to get a string representation of the exception
    > >as a message is to simply cast it to a string: str(x). This will
    > >result in an "UnicodeError: ASCII encoding error: ordinal not in
    > >range(128)".
    > >
    > >The common way to fix this is with something like
    > >u'\xf8'.encode("ascii", 'replace'). However I can't find any way to
    > >tell ValueErrors __str__ method which encoding to use.

    >
    > Rune, I'm not understanding what your problem is.
    >
    > Is there any reason you're not using, for example, just repr(u'\xf8')?


    The problem is that I have little control over the message string that
    is passed to ValueError(). All my program knows is that it has caught
    one such error, and that its message string is in unicode format. I
    need to access the message string (for logging etc.).

    > _display_text = _display_text + "%s\n" % line.decode('utf-8'))


    This does not work, as I'm unable to get at the 'line', which is
    stored internally in the ValueError class (and generated by its __str_
    method).

    Regards,
    Rune Frøysa
     
    Rune Froysa, Jan 12, 2004
    #4
  5. On 12 Jan 2004 08:41:43 +0100, Rune Froysa <>
    wrote:

    >Terry Carroll <> writes:
    >
    >> On 09 Jan 2004 13:18:39 +0100, Rune Froysa <>
    >> wrote:
    >>
    >> >Assuming an exception like:
    >> >
    >> > x = ValueError(u'\xf8')
    >> >
    >> >AFAIK the common way to get a string representation of the exception
    >> >as a message is to simply cast it to a string: str(x). This will
    >> >result in an "UnicodeError: ASCII encoding error: ordinal not in
    >> >range(128)".
    >> >
    >> >The common way to fix this is with something like
    >> >u'\xf8'.encode("ascii", 'replace'). However I can't find any way to
    >> >tell ValueErrors __str__ method which encoding to use.

    >>
    >> Rune, I'm not understanding what your problem is.
    >>
    >> Is there any reason you're not using, for example, just repr(u'\xf8')?

    >
    >The problem is that I have little control over the message string that
    >is passed to ValueError(). All my program knows is that it has caught
    >one such error, and that its message string is in unicode format. I
    >need to access the message string (for logging etc.).
    >
    >> _display_text = _display_text + "%s\n" % line.decode('utf-8'))

    >
    >This does not work, as I'm unable to get at the 'line', which is
    >stored internally in the ValueError class (and generated by its __str_
    >method).


    You should be able to get at it via x.args[0]:

    >>> x = ValueError(u'\xf8')
    >>> x.args[0]

    u'\xf8'

    The only thing is, what to do with it once you get there. I don't think
    0xF8 is a valid unicode encoding on its own. IIRC, it's part of a
    multibyte character.

    You can try to extract it as above, and then decode it with the codecs
    module, but if it's only the first byte, it won't decode correctly:

    >>> import codecs
    >>> d = codecs.getdecoder('utf-8')
    >>> x.args[0]

    u'\xf8'
    >>> d.decode(x.args[0])

    Traceback (most recent call last):
    File "<stdin>", line 1, in ?
    AttributeError: 'builtin_function_or_method' object has no attribute
    'decode'
    >>>


    But, still, if all you want is to have *something* to print out explaining
    the exception, you can use repr():

    >>> repr(x.args[0])

    "u'\\xf8'"
    >>>


    Is this helping any, or am I just flailing around?
     
    Terry Carroll, Jan 14, 2004
    #5
  6. Terry Carroll wrote in message ...
    >On 12 Jan 2004 08:41:43 +0100, Rune Froysa <>
    >wrote:
    >The only thing is, what to do with it once you get there. I don't think
    >0xF8 is a valid unicode encoding on its own. IIRC, it's part of a
    >multibyte character.


    Yes, about that.

    What are the semantics of hexadecimal literals in unicode literals? It
    seems to me that it is meaningless, if not dangerous, to allow hexadecimal
    literals in unicode. What code point would it correspond to?

    Python 2.3.2 (#49, Oct 2 2003, 20:02:00) [MSC v.1200 32 bit (Intel)] on
    win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> u'\xf8\u00f8'.encode('unicode-internal')

    '\xf8\x00\xf8\x00'

    I get the same on linux with Python 2.2.1, x86.

    So, is a hexadecimal literal a shorthand for \u00XX, i.e., unicode code
    point XX? Or does it bypass the code point abstraction entirely, preserving
    the raw bits unchanged for any encoding of the unicode string (thus
    rendering unicode useless)?

    Once again, I don't see why hexadecimal literals should be allowed at all,
    except maybe for compatability when moving to Python -U behavior. But I
    submit that all such code is broken, and should be fixed. If you're using
    hexadecimal literals, what you have is not a unicode string but a byte
    sequence.

    This whole unicode/bytestring mess is going to have to be sorted out
    eventually. It seems to me that it would be best to have all bare string
    literals be unicode objects (henceforth called 'str' or 'string' objects?),
    drop the unicode literal, and make a new type and literal prefix for byte
    sequences, possibly dropping the traditional str methods or absorbing more
    appropriate ones. Perhaps some struct functionality could be folded in?

    Of course, this breaks absolutely everything.

    --
    Francis Avila
     
    Francis Avila, Jan 14, 2004
    #6
  7. Rune Froysa

    Rune Froysa Guest

    Terry Carroll <> writes:

    > On 12 Jan 2004 08:41:43 +0100, Rune Froysa <>
    > wrote:
    >
    > >Terry Carroll <> writes:
    > >
    > >> On 09 Jan 2004 13:18:39 +0100, Rune Froysa <>
    > >> wrote:
    > >>
    > >> >Assuming an exception like:
    > >> >
    > >> > x = ValueError(u'\xf8')
    > >> >
    > >> >AFAIK the common way to get a string representation of the exception
    > >> >as a message is to simply cast it to a string: str(x). This will
    > >> >result in an "UnicodeError: ASCII encoding error: ordinal not in
    > >> >range(128)".

    ....
    > >>> x = ValueError(u'\xf8')
    > >>> x.args[0]

    > u'\xf8'


    I was aware of the args variable in Exception, though I could not find
    any documentation for its usage, thus I wanted to rely on its internal
    __str__ method, rather than constructing the message myself. But,
    after a quick look at Python/exceptions.c, it seems that this is a
    feasable way :)

    > The only thing is, what to do with it once you get there. I don't think
    > 0xF8 is a valid unicode encoding on its own. IIRC, it's part of a
    > multibyte character.


    Python gives me this, so I think it is correct:
    >>> unicode('ø', 'latin-1')

    u'\xf8'

    For my usage, "u'\xf8'.encode('latin-1', 'replace')" is sufficient.

    > Is this helping any, or am I just flailing around?


    It does, thanks a lot for your help.

    Regards,
    Rune Frøysa
     
    Rune Froysa, Jan 14, 2004
    #7
  8. On Wed, 14 Jan 2004 01:32:36 GMT, Terry Carroll <> wrote:

    >You can try to extract it as above, and then decode it with the codecs
    >module, but if it's only the first byte, it won't decode correctly:
    >
    >>>> import codecs
    >>>> d = codecs.getdecoder('utf-8')
    >>>> x.args[0]

    >u'\xf8'
    >>>> d.decode(x.args[0])

    >Traceback (most recent call last):
    > File "<stdin>", line 1, in ?
    >AttributeError: 'builtin_function_or_method' object has no attribute
    >'decode'
    >>>>


    Oops. Copy-and-pasted the wrong line here. Let's try that again:

    >>> x = ValueError(u'\xf8')
    >>> import codecs
    >>> d = codecs.getdecoder('utf-8')
    >>> d(x.args[0])

    Traceback (most recent call last):
    File "<stdin>", line 1, in ?
    UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in
    position 0:
    ordinal not in range(128)
    >>>


    *That's* the exception I was trying to show, not the AttributeError you
    get when you use the decoder wrongly!
     
    Terry Carroll, Jan 14, 2004
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ben

    Strings, Strings and Damned Strings

    Ben, Jun 22, 2006, in forum: C Programming
    Replies:
    14
    Views:
    787
    Malcolm
    Jun 24, 2006
  2. Asterix
    Replies:
    5
    Views:
    729
    Matt Nordhoff
    Aug 31, 2008
  3. Grzegorz ¦liwiñski
    Replies:
    2
    Views:
    986
    Grzegorz ¦liwiñski
    Jan 19, 2011
  4. Chirag Mistry
    Replies:
    6
    Views:
    176
    Ollivier Robert
    Feb 8, 2008
  5. Terry Reedy
    Replies:
    0
    Views:
    78
    Terry Reedy
    Jan 7, 2014
Loading...

Share This Page