Unicode and exception strings

R

Rune Froysa

Assuming an exception like:

x = ValueError(u'\xf8')

AFAIK the common way to get a string representation of the exception
as a message is to simply cast it to a string: str(x). This will
result in an "UnicodeError: ASCII encoding error: ordinal not in
range(128)".

The common way to fix this is with something like
u'\xf8'.encode("ascii", 'replace'). However I can't find any way to
tell ValueErrors __str__ method which encoding to use.

Is it possible to solve this without using sys.setdefaultencoding()
from sitecustomize?

Regards,
Rune Frøysa
 
T

Terry Carroll

Assuming an exception like:

x = ValueError(u'\xf8')

AFAIK the common way to get a string representation of the exception
as a message is to simply cast it to a string: str(x). This will
result in an "UnicodeError: ASCII encoding error: ordinal not in
range(128)".

The common way to fix this is with something like
u'\xf8'.encode("ascii", 'replace'). However I can't find any way to
tell ValueErrors __str__ method which encoding to use.

Rune, I'm not understanding what your problem is.

Is there any reason you're not using, for example, just repr(u'\xf8')?

In one program I have that occasionally runs into a line that includes
some (UTF-8) Unicode-encoded Chinese characters , I have something like
this:

try:
_display_text = _display_text + "%s\n" % line
except UnicodeDecodeError:
try:
# decode those UTF8 nasties
_display_text = _display_text + "%s\n" % line.decode('utf-8'))
except UnicodeDecodeError:
# if that still doesn't work, punt
# (I don't think we'll ever reach this, but just in case)
_display_text = _display_text + "%s\n" % repr(line)

I don't know if this will help you or not.
 
T

Terry Carroll

In one program I have that occasionally runs into a line that includes
some (UTF-8) Unicode-encoded Chinese characters , I have something like
this:

Sorry, a stray parenthesis crept in here (since this is a pared down
version of my actual code). It should read:


try:
_display_text = _display_text + "%s\n" % line
except UnicodeDecodeError:
try:
# decode those UTF8 nasties
_display_text = _display_text + "%s\n" % line.decode('utf-8')
except UnicodeDecodeError:
# if that still doesn't work, punt
# (I don't think we'll ever reach this, but just in case)
_display_text = _display_text + "%s\n" % repr(line)
 
R

Rune Froysa

Terry Carroll said:
Rune, I'm not understanding what your problem is.

Is there any reason you're not using, for example, just repr(u'\xf8')?

The problem is that I have little control over the message string that
is passed to ValueError(). All my program knows is that it has caught
one such error, and that its message string is in unicode format. I
need to access the message string (for logging etc.).
_display_text = _display_text + "%s\n" % line.decode('utf-8'))

This does not work, as I'm unable to get at the 'line', which is
stored internally in the ValueError class (and generated by its __str_
method).

Regards,
Rune Frøysa
 
T

Terry Carroll

The problem is that I have little control over the message string that
is passed to ValueError(). All my program knows is that it has caught
one such error, and that its message string is in unicode format. I
need to access the message string (for logging etc.).


This does not work, as I'm unable to get at the 'line', which is
stored internally in the ValueError class (and generated by its __str_
method).

You should be able to get at it via x.args[0]:
x = ValueError(u'\xf8')
x.args[0]
u'\xf8'

The only thing is, what to do with it once you get there. I don't think
0xF8 is a valid unicode encoding on its own. IIRC, it's part of a
multibyte character.

You can try to extract it as above, and then decode it with the codecs
module, but if it's only the first byte, it won't decode correctly:
import codecs
d = codecs.getdecoder('utf-8')
x.args[0] u'\xf8'
d.decode(x.args[0])
Traceback (most recent call last):
File "<stdin>", line 1, in ?
AttributeError: 'builtin_function_or_method' object has no attribute
'decode'
But, still, if all you want is to have *something* to print out explaining
the exception, you can use repr():

Is this helping any, or am I just flailing around?
 
F

Francis Avila

Terry Carroll wrote in message ...
On 12 Jan 2004 08:41:43 +0100, Rune Froysa <[email protected]>
wrote:
The only thing is, what to do with it once you get there. I don't think
0xF8 is a valid unicode encoding on its own. IIRC, it's part of a
multibyte character.

Yes, about that.

What are the semantics of hexadecimal literals in unicode literals? It
seems to me that it is meaningless, if not dangerous, to allow hexadecimal
literals in unicode. What code point would it correspond to?

Python 2.3.2 (#49, Oct 2 2003, 20:02:00) [MSC v.1200 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.'\xf8\x00\xf8\x00'

I get the same on linux with Python 2.2.1, x86.

So, is a hexadecimal literal a shorthand for \u00XX, i.e., unicode code
point XX? Or does it bypass the code point abstraction entirely, preserving
the raw bits unchanged for any encoding of the unicode string (thus
rendering unicode useless)?

Once again, I don't see why hexadecimal literals should be allowed at all,
except maybe for compatability when moving to Python -U behavior. But I
submit that all such code is broken, and should be fixed. If you're using
hexadecimal literals, what you have is not a unicode string but a byte
sequence.

This whole unicode/bytestring mess is going to have to be sorted out
eventually. It seems to me that it would be best to have all bare string
literals be unicode objects (henceforth called 'str' or 'string' objects?),
drop the unicode literal, and make a new type and literal prefix for byte
sequences, possibly dropping the traditional str methods or absorbing more
appropriate ones. Perhaps some struct functionality could be folded in?

Of course, this breaks absolutely everything.
 
R

Rune Froysa

Terry Carroll said:
Terry Carroll said:
On 09 Jan 2004 13:18:39 +0100, Rune Froysa <[email protected]>
wrote:

Assuming an exception like:

x = ValueError(u'\xf8')

AFAIK the common way to get a string representation of the exception
as a message is to simply cast it to a string: str(x). This will
result in an "UnicodeError: ASCII encoding error: ordinal not in
range(128)". ....
x = ValueError(u'\xf8')
x.args[0]
u'\xf8'

I was aware of the args variable in Exception, though I could not find
any documentation for its usage, thus I wanted to rely on its internal
__str__ method, rather than constructing the message myself. But,
after a quick look at Python/exceptions.c, it seems that this is a
feasable way :)
The only thing is, what to do with it once you get there. I don't think
0xF8 is a valid unicode encoding on its own. IIRC, it's part of a
multibyte character.

Python gives me this, so I think it is correct: u'\xf8'

For my usage, "u'\xf8'.encode('latin-1', 'replace')" is sufficient.
Is this helping any, or am I just flailing around?

It does, thanks a lot for your help.

Regards,
Rune Frøysa
 
T

Terry Carroll

You can try to extract it as above, and then decode it with the codecs
module, but if it's only the first byte, it won't decode correctly:
import codecs
d = codecs.getdecoder('utf-8')
x.args[0] u'\xf8'
d.decode(x.args[0])
Traceback (most recent call last):
File "<stdin>", line 1, in ?
AttributeError: 'builtin_function_or_method' object has no attribute
'decode'

Oops. Copy-and-pasted the wrong line here. Let's try that again:
x = ValueError(u'\xf8')
import codecs
d = codecs.getdecoder('utf-8')
d(x.args[0])
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in
position 0:
ordinal not in range(128)
*That's* the exception I was trying to show, not the AttributeError you
get when you use the decoder wrongly!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,763
Messages
2,569,562
Members
45,038
Latest member
OrderProperKetocapsules

Latest Threads

Top