unicode(obj, errors='foo') raises TypeError - bug?

Mike Brown · Feb 23, 2005

This works as expected (this is on an ASCII terminal):
u'asdf\ufffd'

This does not work as I expect it to:
.... def __str__(self):
.... return 'asdf\xff'
....Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: coercing to Unicode: need string or buffer, instance found

Shouldn't it work the same as calling unicode(str(self), errors='replace')?

It doesn't matter what value you use for 'errors' (ignore, replace, strict);
you'll get the same TypeError.

What am I doing wrong? Is this a bug in Python?

Steven Bethard · Feb 23, 2005

Mike said:
... def __str__(self):
... return 'asdf\xff'
...Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: coercing to Unicode: need string or buffer, instance found
[snip]

What am I doing wrong? Is this a bug in Python?

No, this is documented behavior[1]:

"""
unicode([object[, encoding [, errors]]])
...
For objects which provide a __unicode__() method, it will call this
method without arguments to create a Unicode string. For all other
objects, the 8-bit string version or representation is requested and
then converted to a Unicode string using the codec for the default
encoding in 'strict' mode.
"""

Note that the documentation basically says that it will call str() on
your object, and then convert it in 'strict' mode. You should either
define __unicode__ or call str() manually on the object.

STeVe

[1] http://docs.python.org/lib/built-in-funcs.html

Kent Johnson · Feb 23, 2005

Steven said:
Mike said:

class C:

Click to expand...

... def __str__(self):
... return 'asdf\xff'
...

o = C()
unicode(o, errors='replace')

Click to expand...

Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: coercing to Unicode: need string or buffer, instance found
[snip]

What am I doing wrong? Is this a bug in Python?

Click to expand...

No, this is documented behavior[1]:

"""
unicode([object[, encoding [, errors]]])
...
For objects which provide a __unicode__() method, it will call this
method without arguments to create a Unicode string. For all other
objects, the 8-bit string version or representation is requested and
then converted to a Unicode string using the codec for the default
encoding in 'strict' mode.
"""

Note that the documentation basically says that it will call str() on
your object, and then convert it in 'strict' mode. You should either
define __unicode__ or call str() manually on the object.

Not a bug, I guess, since it is documented, but it seems a bit bizarre that the encoding and errors
parameters are ignored when object does not have a __unicode__ method.

Kent

STeVe

[1] http://docs.python.org/lib/built-in-funcs.html

Steven Bethard · Feb 23, 2005

Kent said:
Steven said:

No, this is documented behavior[1]:

"""
unicode([object[, encoding [, errors]]])
...
For objects which provide a __unicode__() method, it will call
this method without arguments to create a Unicode string. For all
other objects, the 8-bit string version or representation is requested
and then converted to a Unicode string using the codec for the default
encoding in 'strict' mode.
"""

Note that the documentation basically says that it will call str() on
your object, and then convert it in 'strict' mode. You should either
define __unicode__ or call str() manually on the object.

Click to expand...

Not a bug, I guess, since it is documented, but it seems a bit bizarre
that the encoding and errors parameters are ignored when object does not
have a __unicode__ method.

Yeah, I agree it's weird. I suspect if someone supplied a patch for
this behavior it would be accepted -- I don't think this should break
backwards compatibility (much).

STeVe

Guest · Feb 23, 2005

Steven said:
Yeah, I agree it's weird. I suspect if someone supplied a patch for
this behavior it would be accepted -- I don't think this should break
backwards compatibility (much).

Notice that the "right" thing to do would be to pass encoding and errors
to __unicode__. If the string object needs to be told what encoding it
is in, why not any other other object as well?

Unfortunately, this apparently was overlooked, and now it is too late
to change it (or else the existing __unicode__ methods would all break
if they suddenly get an encoding argument).

As for using encoding and errors on the result of str() conversion
of the object: how can the caller know what encoding the result of
str() is in, reasonably? It seems more correct to assume that the
str() result in in the system default encoding.

If you can follow so far(*): if it is the right thing to ignore the
encoding argument for the case that the object was str() converted,
why should the errors argument not be ignored? It is inconsistent
to ignore one parameter to the decoding but not the other.

Regards,
Martin

(*) I admit that the reasoning for ignoring the encoding is
somewhat flawed. There are some types (e.g. numbers) where
str() always uses the system encoding (i.e. ASCII - actually,
it always uses ASCII, no matter what the system encoding is).
There may be types where the encoding of the str() result
is not ASCII, and the caller happens to know what it is,
but I'm not aware of any such type.

Kent Johnson · Feb 23, 2005

Martin said:
Notice that the "right" thing to do would be to pass encoding and errors
to __unicode__. If the string object needs to be told what encoding it
is in, why not any other other object as well?

Unfortunately, this apparently was overlooked, and now it is too late
to change it (or else the existing __unicode__ methods would all break
if they suddenly get an encoding argument).

Could this be handled with a try / except in unicode()? Something like this: ... def u(self): # __unicode__ with no args
... print 'A.u()'
... ... def u(self, enc, err): # __unicode__ with two args
... print 'B.u()', enc, err
... ... try:
... obj.u(enc, err)
... except TypeError:
... obj.u()
...B.u() utf-8 replace

As for using encoding and errors on the result of str() conversion
of the object: how can the caller know what encoding the result of
str() is in, reasonably?

The same way that the caller will know the encoding of a byte string, or of the result of
str(some_object) - in my experience, usually by careful detective work on the source of the string
or object followed by attempts to better understand and control the encoding used throughout the
application.

It seems more correct to assume that the

str() result in in the system default encoding.

To assume that in absence of any guidance, sure, that is consistent. But to ignore the guidance the
programmer attempts to provide?

One thing that hasn't been pointed out in this thread yet is that the OP could just define
__unicode__() on his class to do what he wants...

Kent

Guest · Feb 23, 2005

Kent said:
Could this be handled with a try / except in unicode()? Something like
this:

Perhaps. However, this would cause a significant performance hit, and
possbibly undesired side effects. So due process would require that the
interface of __unicode__ first, and then change the actual calls to it.

One thing that hasn't been pointed out in this thread yet is that the OP
could just define __unicode__() on his class to do what he wants...

Actually, Steven Bethard wrote "You should either define __unicode__ or
call str() manually on the object."

Regards,
Martin

__unicode__() works, unicode() blows up.	3	Nov 4, 2012
socket.makefile raises ValueError when mode = 'rt'	0	Jan 9, 2013
xmltodict - TypeError: list indices must be integers, not str	2	May 10, 2014
TypeError: unbound method add() must be called with BinaryTreeinstance as first argument (got nothin	0	May 18, 2013
Python 3.1.1 bytes decode with replace bug	9	Oct 24, 2009
TypeError: descriptor 'replace' requires a 'str' object but receiveda 'unicode'	4	Feb 21, 2009
unicode question	2	Feb 25, 2006
Trouble with UnicodeEncodeError and email	0	Jan 8, 2014

unicode(obj, errors='foo') raises TypeError - bug?

Mike Brown

Steven Bethard

Kent Johnson

Steven Bethard

Guest

Kent Johnson

Guest

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads