unicode(obj, errors='foo') raises TypeError - bug?

M

Mike Brown

This works as expected (this is on an ASCII terminal):
u'asdf\ufffd'


This does not work as I expect it to:
.... def __str__(self):
.... return 'asdf\xff'
....Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: coercing to Unicode: need string or buffer, instance found



Shouldn't it work the same as calling unicode(str(self), errors='replace')?

It doesn't matter what value you use for 'errors' (ignore, replace, strict);
you'll get the same TypeError.

What am I doing wrong? Is this a bug in Python?
 
S

Steven Bethard

Mike said:
... def __str__(self):
... return 'asdf\xff'
...Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: coercing to Unicode: need string or buffer, instance found
[snip]

What am I doing wrong? Is this a bug in Python?

No, this is documented behavior[1]:

"""
unicode([object[, encoding [, errors]]])
...
For objects which provide a __unicode__() method, it will call this
method without arguments to create a Unicode string. For all other
objects, the 8-bit string version or representation is requested and
then converted to a Unicode string using the codec for the default
encoding in 'strict' mode.
"""

Note that the documentation basically says that it will call str() on
your object, and then convert it in 'strict' mode. You should either
define __unicode__ or call str() manually on the object.

STeVe

[1] http://docs.python.org/lib/built-in-funcs.html
 
K

Kent Johnson

Steven said:
Mike said:

... def __str__(self):
... return 'asdf\xff'
...
o = C()
unicode(o, errors='replace')

Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: coercing to Unicode: need string or buffer, instance found
[snip]


What am I doing wrong? Is this a bug in Python?


No, this is documented behavior[1]:

"""
unicode([object[, encoding [, errors]]])
...
For objects which provide a __unicode__() method, it will call this
method without arguments to create a Unicode string. For all other
objects, the 8-bit string version or representation is requested and
then converted to a Unicode string using the codec for the default
encoding in 'strict' mode.
"""

Note that the documentation basically says that it will call str() on
your object, and then convert it in 'strict' mode. You should either
define __unicode__ or call str() manually on the object.

Not a bug, I guess, since it is documented, but it seems a bit bizarre that the encoding and errors
parameters are ignored when object does not have a __unicode__ method.

Kent
 
S

Steven Bethard

Kent said:
Steven said:
No, this is documented behavior[1]:

"""
unicode([object[, encoding [, errors]]])
...
For objects which provide a __unicode__() method, it will call
this method without arguments to create a Unicode string. For all
other objects, the 8-bit string version or representation is requested
and then converted to a Unicode string using the codec for the default
encoding in 'strict' mode.
"""

Note that the documentation basically says that it will call str() on
your object, and then convert it in 'strict' mode. You should either
define __unicode__ or call str() manually on the object.

Not a bug, I guess, since it is documented, but it seems a bit bizarre
that the encoding and errors parameters are ignored when object does not
have a __unicode__ method.

Yeah, I agree it's weird. I suspect if someone supplied a patch for
this behavior it would be accepted -- I don't think this should break
backwards compatibility (much).

STeVe
 
G

Guest

Steven said:
Yeah, I agree it's weird. I suspect if someone supplied a patch for
this behavior it would be accepted -- I don't think this should break
backwards compatibility (much).

Notice that the "right" thing to do would be to pass encoding and errors
to __unicode__. If the string object needs to be told what encoding it
is in, why not any other other object as well?

Unfortunately, this apparently was overlooked, and now it is too late
to change it (or else the existing __unicode__ methods would all break
if they suddenly get an encoding argument).

As for using encoding and errors on the result of str() conversion
of the object: how can the caller know what encoding the result of
str() is in, reasonably? It seems more correct to assume that the
str() result in in the system default encoding.

If you can follow so far(*): if it is the right thing to ignore the
encoding argument for the case that the object was str() converted,
why should the errors argument not be ignored? It is inconsistent
to ignore one parameter to the decoding but not the other.

Regards,
Martin

(*) I admit that the reasoning for ignoring the encoding is
somewhat flawed. There are some types (e.g. numbers) where
str() always uses the system encoding (i.e. ASCII - actually,
it always uses ASCII, no matter what the system encoding is).
There may be types where the encoding of the str() result
is not ASCII, and the caller happens to know what it is,
but I'm not aware of any such type.
 
K

Kent Johnson

Martin said:
Notice that the "right" thing to do would be to pass encoding and errors
to __unicode__. If the string object needs to be told what encoding it
is in, why not any other other object as well?

Unfortunately, this apparently was overlooked, and now it is too late
to change it (or else the existing __unicode__ methods would all break
if they suddenly get an encoding argument).

Could this be handled with a try / except in unicode()? Something like this: ... def u(self): # __unicode__ with no args
... print 'A.u()'
... ... def u(self, enc, err): # __unicode__ with two args
... print 'B.u()', enc, err
... ... try:
... obj.u(enc, err)
... except TypeError:
... obj.u()
...B.u() utf-8 replace
As for using encoding and errors on the result of str() conversion
of the object: how can the caller know what encoding the result of
str() is in, reasonably?

The same way that the caller will know the encoding of a byte string, or of the result of
str(some_object) - in my experience, usually by careful detective work on the source of the string
or object followed by attempts to better understand and control the encoding used throughout the
application.

It seems more correct to assume that the
str() result in in the system default encoding.

To assume that in absence of any guidance, sure, that is consistent. But to ignore the guidance the
programmer attempts to provide?


One thing that hasn't been pointed out in this thread yet is that the OP could just define
__unicode__() on his class to do what he wants...

Kent
 
G

Guest

Kent said:
Could this be handled with a try / except in unicode()? Something like
this:

Perhaps. However, this would cause a significant performance hit, and
possbibly undesired side effects. So due process would require that the
interface of __unicode__ first, and then change the actual calls to it.
One thing that hasn't been pointed out in this thread yet is that the OP
could just define __unicode__() on his class to do what he wants...

Actually, Steven Bethard wrote "You should either define __unicode__ or
call str() manually on the object."

Regards,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,051
Latest member
CarleyMcCr

Latest Threads

Top