Unicode and exception strings

Rune Froysa · Jan 9, 2004

Assuming an exception like:

x = ValueError(u'\xf8')

AFAIK the common way to get a string representation of the exception
as a message is to simply cast it to a string: str(x). This will
result in an "UnicodeError: ASCII encoding error: ordinal not in
range(128)".

The common way to fix this is with something like
u'\xf8'.encode("ascii", 'replace'). However I can't find any way to
tell ValueErrors __str__ method which encoding to use.

Is it possible to solve this without using sys.setdefaultencoding()
from sitecustomize?

Regards,
Rune Frøysa

Terry Carroll · Jan 9, 2004

Assuming an exception like:

x = ValueError(u'\xf8')

AFAIK the common way to get a string representation of the exception
as a message is to simply cast it to a string: str(x). This will
result in an "UnicodeError: ASCII encoding error: ordinal not in
range(128)".

The common way to fix this is with something like
u'\xf8'.encode("ascii", 'replace'). However I can't find any way to
tell ValueErrors __str__ method which encoding to use.

Rune, I'm not understanding what your problem is.

Is there any reason you're not using, for example, just repr(u'\xf8')?

In one program I have that occasionally runs into a line that includes
some (UTF-8) Unicode-encoded Chinese characters , I have something like
this:

try:
_display_text = _display_text + "%s\n" % line
except UnicodeDecodeError:
try:
# decode those UTF8 nasties
_display_text = _display_text + "%s\n" % line.decode('utf-8'))
except UnicodeDecodeError:
# if that still doesn't work, punt
# (I don't think we'll ever reach this, but just in case)
_display_text = _display_text + "%s\n" % repr(line)

I don't know if this will help you or not.

Terry Carroll · Jan 9, 2004

In one program I have that occasionally runs into a line that includes
some (UTF-8) Unicode-encoded Chinese characters , I have something like
this:

Sorry, a stray parenthesis crept in here (since this is a pared down
version of my actual code). It should read:

try:
_display_text = _display_text + "%s\n" % line
except UnicodeDecodeError:
try:
# decode those UTF8 nasties
_display_text = _display_text + "%s\n" % line.decode('utf-8')
except UnicodeDecodeError:
# if that still doesn't work, punt
# (I don't think we'll ever reach this, but just in case)
_display_text = _display_text + "%s\n" % repr(line)

Rune Froysa · Jan 12, 2004

Terry Carroll said:
Rune, I'm not understanding what your problem is.

Is there any reason you're not using, for example, just repr(u'\xf8')?

The problem is that I have little control over the message string that
is passed to ValueError(). All my program knows is that it has caught
one such error, and that its message string is in unicode format. I
need to access the message string (for logging etc.).

_display_text = _display_text + "%s\n" % line.decode('utf-8'))

This does not work, as I'm unable to get at the 'line', which is
stored internally in the ValueError class (and generated by its __str_
method).

Regards,
Rune Frøysa

Terry Carroll · Jan 14, 2004

The problem is that I have little control over the message string that
is passed to ValueError(). All my program knows is that it has caught
one such error, and that its message string is in unicode format. I
need to access the message string (for logging etc.).

This does not work, as I'm unable to get at the 'line', which is
stored internally in the ValueError class (and generated by its __str_
method).

You should be able to get at it via x.args[0]:

x = ValueError(u'\xf8')
x.args[0]

Click to expand...

Click to expand...

u'\xf8'

The only thing is, what to do with it once you get there. I don't think
0xF8 is a valid unicode encoding on its own. IIRC, it's part of a
multibyte character.

You can try to extract it as above, and then decode it with the codecs
module, but if it's only the first byte, it won't decode correctly:

import codecs
d = codecs.getdecoder('utf-8')
x.args[0] u'\xf8'
d.decode(x.args[0])

Click to expand...

Click to expand...

Traceback (most recent call last):
File "<stdin>", line 1, in ?
AttributeError: 'builtin_function_or_method' object has no attribute
'decode'
But, still, if all you want is to have *something* to print out explaining
the exception, you can use repr():

repr(x.args[0]) "u'\\xf8'"

Click to expand...

Click to expand...

Is this helping any, or am I just flailing around?

Francis Avila · Jan 14, 2004

Terry Carroll wrote in message ...

On 12 Jan 2004 08:41:43 +0100, Rune Froysa <[email protected]>
wrote:
The only thing is, what to do with it once you get there. I don't think
0xF8 is a valid unicode encoding on its own. IIRC, it's part of a
multibyte character.

Yes, about that.

What are the semantics of hexadecimal literals in unicode literals? It
seems to me that it is meaningless, if not dangerous, to allow hexadecimal
literals in unicode. What code point would it correspond to?

Python 2.3.2 (#49, Oct 2 2003, 20:02:00) [MSC v.1200 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.'\xf8\x00\xf8\x00'

I get the same on linux with Python 2.2.1, x86.

So, is a hexadecimal literal a shorthand for \u00XX, i.e., unicode code
point XX? Or does it bypass the code point abstraction entirely, preserving
the raw bits unchanged for any encoding of the unicode string (thus
rendering unicode useless)?

Once again, I don't see why hexadecimal literals should be allowed at all,
except maybe for compatability when moving to Python -U behavior. But I
submit that all such code is broken, and should be fixed. If you're using
hexadecimal literals, what you have is not a unicode string but a byte
sequence.

This whole unicode/bytestring mess is going to have to be sorted out
eventually. It seems to me that it would be best to have all bare string
literals be unicode objects (henceforth called 'str' or 'string' objects?),
drop the unicode literal, and make a new type and literal prefix for byte
sequences, possibly dropping the traditional str methods or absorbing more
appropriate ones. Perhaps some struct functionality could be folded in?

Of course, this breaks absolutely everything.

Rune Froysa · Jan 14, 2004

Terry Carroll said:
Terry Carroll said:

On 09 Jan 2004 13:18:39 +0100, Rune Froysa <[email protected]>
wrote:

Assuming an exception like:

x = ValueError(u'\xf8')

AFAIK the common way to get a string representation of the exception
as a message is to simply cast it to a string: str(x). This will
result in an "UnicodeError: ASCII encoding error: ordinal not in
range(128)". ....
x = ValueError(u'\xf8')
x.args[0]

Click to expand...

Click to expand...

u'\xf8'

I was aware of the args variable in Exception, though I could not find
any documentation for its usage, thus I wanted to rely on its internal
__str__ method, rather than constructing the message myself. But,
after a quick look at Python/exceptions.c, it seems that this is a
feasable way

The only thing is, what to do with it once you get there. I don't think
0xF8 is a valid unicode encoding on its own. IIRC, it's part of a
multibyte character.

Python gives me this, so I think it is correct: u'\xf8'

For my usage, "u'\xf8'.encode('latin-1', 'replace')" is sufficient.

Is this helping any, or am I just flailing around?

It does, thanks a lot for your help.

Regards,
Rune Frøysa

Terry Carroll · Jan 14, 2004

You can try to extract it as above, and then decode it with the codecs
module, but if it's only the first byte, it won't decode correctly:

import codecs
d = codecs.getdecoder('utf-8')
x.args[0] u'\xf8'
d.decode(x.args[0])

Click to expand...

Click to expand...

Traceback (most recent call last):
File "<stdin>", line 1, in ?
AttributeError: 'builtin_function_or_method' object has no attribute
'decode'

Oops. Copy-and-pasted the wrong line here. Let's try that again:

x = ValueError(u'\xf8')
import codecs
d = codecs.getdecoder('utf-8')
d(x.args[0])

Click to expand...

Click to expand...

Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in
position 0:
ordinal not in range(128)
*That's* the exception I was trying to show, not the AttributeError you
get when you use the decoder wrongly!

SMTPHandler and Unicode	13	Jul 5, 2010
pexpect and unicode strings	1	Sep 5, 2009
unicode in multi-line strings	1	Sep 18, 2008
Python dict as unicode	1	Nov 24, 2010
Python 3.3, gettext and Unicode problems	0	Dec 31, 2012
unicode	7	Jul 1, 2007
Unicode characters in btye-strings	5	Mar 12, 2010
Stuck with urllib.quote and Unicode/UTF-8	0	May 7, 2011

Unicode and exception strings

Rune Froysa

Terry Carroll

Terry Carroll

Rune Froysa

Terry Carroll

Francis Avila

Rune Froysa

Terry Carroll

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads