unicode bit me

anuraguniyal · May 8, 2009

#how can I print a list of object which may return unicode
representation?
# -*- coding: utf-8 -*-

class A(object):

def __unicode__(self):
return u"©au"

__str__ = __repr__ = __unicode__

a = A()

try:
print a # doesn't work?
except UnicodeEncodeError,e:
print e
try:
print unicode(a) # works, ok fine, great
except UnicodeEncodeError,e:
print e
try:
print unicode([a]) # what!!!! doesn't work?
except UnicodeEncodeError,e:
print e
"""
Now how can I print a list of object which may return unicode
representation?
loop/map is not an option as it goes much deepr in my real code
any can anyoen explain what is happening here under the hood?
"""

Diez B. Roggisch · May 8, 2009

#how can I print a list of object which may return unicode
representation?
# -*- coding: utf-8 -*-

class A(object):

def __unicode__(self):
return u"Â©au"

__str__ = __repr__ = __unicode__

__str__ and __repr__ are supposed to return *byte*strings. Yet you return
unicode here.

Diez

Terry Reedy · May 8, 2009

Scott said:
<rant>It would be a bit easier if people would bother to mention
their Python version, as we regularly get questions from people
running 2.3, 2.4, 2.5, 2.6, 2.7a, 3.0, and 3.1b. They run computers
with differing operating systems and versions such as: Windows 2000,
OS/X Leopard, ubuntu Hardy Heron, SuSE, ....

And if they copy and paste the actual error messages instead of saying
'It doesn't work'

J. Cliff Dyer · May 8, 2009

#how can I print a list of object which may return unicode
representation?
# -*- coding: utf-8 -*-

class A(object):

def __unicode__(self):
return u"Â©au"

__str__ = __repr__ = __unicode__

Your __str__ and __repr__ methods don't return strings. You should
encode your unicode to the encoding you want before you try to print it.

class A(object):
def __unicode__(self):
return u"Â©au"

def get_utf8_repr(self):
return self.__unicode__().encode('utf-8')

def get_koi8_repr(self):
return self.__unicode__().encode('koi-8')

__str__ = __repr__ = self.get_utf8_repr

a = A()

try:
print a # doesn't work?
except UnicodeEncodeError,e:
print e
try:
print unicode(a) # works, ok fine, great
except UnicodeEncodeError,e:
print e
try:
print unicode([a]) # what!!!! doesn't work?
except UnicodeEncodeError,e:
print e
"""
Now how can I print a list of object which may return unicode
representation?
loop/map is not an option as it goes much deepr in my real code
any can anyoen explain what is happening here under the hood?
"""

Piet van Oostrum · May 8, 2009

JCD> Your __str__ and __repr__ methods don't return strings. You should
JCD> encode your unicode to the encoding you want before you try to print it.

JCD> class A(object):
JCD> def __unicode__(self):
JCD> return u"©au"

JCD> def get_utf8_repr(self):
JCD> return self.__unicode__().encode('utf-8')

JCD> def get_koi8_repr(self):
JCD> return self.__unicode__().encode('koi-8')

JCD> __str__ = __repr__ = self.get_utf8_repr

It might be nicer to have a method that specifies the encoding to be
used in order to make switching encodings easier:

*untested code*

class A(object):
def __unicode__(self):
return u"©au"

def set_encoding(self, encoding):
self._encoding = encoding

def __repr__(self):
return self.__unicode__().encode(self._encoding)

__str__ = __repr__

Of course this feels very wrong because the encoding should be chosen when
the string goes to the output channel, i.e. outside of the object.
Unfortunately this is one of the leftovers from Python's pre-unicode
heritage. Hopefully in Python3 this will work without problems. Anyway,
in Python 3 the string type is unicode, so at least __repr__ can return
unicode.

Steven D'Aprano · May 8, 2009

And if they copy and paste the actual error messages instead of saying
'It doesn't work'

"I tried to copy and paste the actual error message, but it doesn't
work..."

*grin*

anuraguniyal · May 9, 2009

sorry for not being specfic and not given all info

"""
Python 2.5.2 (r252:60911, Jul 31 2008, 17:28:52)
[GCC 4.2.3 (Ubuntu 4.2.3-2ubuntu7)] on linux2
'Linux-2.6.24-19-generic-i686-with-debian-lenny-sid'
"""

My question has not much to do with stdout because I am able to print
unicode
so
print unicode(a) works
print unicode([a]) doesn't

without print too
s1 = u"%s"%a works
s2 = u"%s"%[a] doesn't
niether does s3 = u"%s"%unicode([a])
error is UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in
position 1: ordinal not in range(128)

so question is how can I use a list of object whose representation
contains unicode in another unicode string

I am now using __repr__ = unicode(self).encode("utf-8")
but it give error anyway

anuraguniyal · May 9, 2009

also not sure why (python 2.5)
print a # works
print unicode(a) # works
print [a] # works
print unicode([a]) # doesn't works

Piet van Oostrum · May 9, 2009

[email protected] said:
ac> also not sure why (python 2.5)
ac> print a # works
ac> print unicode(a) # works
ac> print [a] # works
ac> print unicode([a]) # doesn't works

Which code do you use now?

And what does this print?

import sys
print sys.stdout.encoding

J. Clifford Dyer · May 9, 2009

You're still not asking questions in a way that we can answer them.

Define "Doesn't work." Define "a".

anuraguniyal · May 9, 2009

Sorry being unclear again, hmm I am becoming an expert in it.

I pasted that code as continuation of my old code at start
i.e
class A(object):
def __unicode__(self):
return u"©au"

def __repr__(self):
return unicode(self).encode("utf-8")
__str__ = __repr__

doesn't work means throws unicode error
my question boils down to
what is diff between, why one doesn't throws error and another does
print unicode(a)
vs
print unicode([a])

Steven D'Aprano · May 9, 2009

Sorry being unclear again, hmm I am becoming an expert in it.

I pasted that code as continuation of my old code at start i.e
class A(object):
def __unicode__(self):
return u"Â©au"

def __repr__(self):
return unicode(self).encode("utf-8")
__str__ = __repr__

doesn't work means throws unicode error my question

What unicode error?

Stop asking us to GUESS what the error is, and please copy and paste the
ENTIRE TRACEBACK that you get. When you ask for free help, make it easy
for the people trying to help you. If you expect them to copy and paste
your code and run it just to answer the smallest questions, most of them
won't bother.

rurpy · May 9, 2009

What unicode error?

Stop asking us to GUESS what the error is, and please copy and paste the
ENTIRE TRACEBACK that you get. When you ask for free help, make it easy
for the people trying to help you. If you expect them to copy and paste
your code and run it just to answer the smallest questions, most of them
won't bother.

Creua H Jiest!

It took me less then 45 seconds to open a terminal window, start
Python, and paste the OPs code to get:.... def __unicode__(self):
.... return u"©au"
.... def __repr__(self):
.... return unicode(self).encode("utf-8")
.... __str__ = __repr__
....Traceback (most recent call last):

File said:
a=A()
print unicode(a) ©au
print unicode([a])

Click to expand...

Click to expand...

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
1: ordinal not in range(128)

Which is the same error he had already posted!

I am all for encouraging posters to provide a good description
but let's not be ridiculous.

Anecdote:
My sister always gives her dogs the table scraps after eating
dinner. One day when I ate there, I tossed the dogs a piece
of meat I hadn't eaten. "No", she cried! "You mustn't give
him anything without making him do a trick first! Otherwise
he'll forget that you are the boss!".

Mark Tolonen · May 9, 2009

Sorry being unclear again, hmm I am becoming an expert in it.

I pasted that code as continuation of my old code at start
i.e
class A(object):
def __unicode__(self):
return u"Â©au"

def __repr__(self):
return unicode(self).encode("utf-8")
__str__ = __repr__

doesn't work means throws unicode error
my question boils down to
what is diff between, why one doesn't throws error and another does
print unicode(a)
vs
print unicode([a])

That is still an incomplete example. Your results depend on your source
code's encoding and your system's stdout encoding. Assuming a=A(),
unicode(a) returns u'Â©au', but then is converted to stdout's encoding for
display. An encoding such as cp437 (U.S. Windows console) will fail. the
repr of [a] is a byte string in the encoding of your source file. The
unicode() function, given a byte string of unspecified encoding, uses the
ASCII codec. Assuming your source encoding was utf-8, unicode([a],'utf-8')
will correctly convert it to unicode, and then printing that unicode string
will attempt to convert it to stdout encoding. On a utf-8 console, it will
work, on a cp437 console it will not.

Here's a new one:

In PythonWin (from pywin32-313), stdout is utf-8, so:
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0:
ordinal not in range(128)Â©

This gives different results when the stdout encoding is different. Here's
a couple of the same instructions on my Windows console with cp437 encoding,
which doesn't support the copyright character:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\dev\python\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\xa9' in
position 0: character maps to <undefined>

Hope that helps your understanding,
Mark

Piet van Oostrum · May 9, 2009

Mark Tolonen said:
MT> said:

Sorry being unclear again, hmm I am becoming an expert in it.

I pasted that code as continuation of my old code at start
i.e
class A(object):
def __unicode__(self):
return u"©au"

def __repr__(self):
return unicode(self).encode("utf-8")
__str__ = __repr__

doesn't work means throws unicode error
my question boils down to
what is diff between, why one doesn't throws error and another does
print unicode(a)
vs
print unicode([a])

Click to expand...

Click to expand...

MT> That is still an incomplete example. Your results depend on your source
MT> code's encoding and your system's stdout encoding. Assuming a=A(),
MT> unicode(a) returns u'©au', but then is converted to stdout's encoding for
MT> display.

You are confusing the issue. It does not depend on the source code's
encoding (supposing that the encoding declaration in the source is
correct). repr returns unicode(self).encode("utf-8"), so it is utf-8
encoded even when the source code had a different encoding. The u"©au"
string is not dependent on the source encoding.

Mark Tolonen · May 9, 2009

Piet van Oostrum said:
"Mark Tolonen" <[email protected]> (MT) wrote:

Click to expand...

MT>

Sorry being unclear again, hmm I am becoming an expert in it.

I pasted that code as continuation of my old code at start
i.e
class A(object):
def __unicode__(self):
return u"©au"

def __repr__(self):
return unicode(self).encode("utf-8")
__str__ = __repr__

doesn't work means throws unicode error
my question boils down to
what is diff between, why one doesn't throws error and another does
print unicode(a)
vs
print unicode([a])

Click to expand...

MT> That is still an incomplete example. Your results depend on your
source
MT> code's encoding and your system's stdout encoding. Assuming a=A(),
MT> unicode(a) returns u'©au', but then is converted to stdout's encoding
for
MT> display.

Click to expand...

You are confusing the issue. It does not depend on the source code's
encoding (supposing that the encoding declaration in the source is
correct). repr returns unicode(self).encode("utf-8"), so it is utf-8
encoded even when the source code had a different encoding. The u"©au"
string is not dependent on the source encoding.

Sorry about that. I'd forgotten that the OP'd forced __repr__ to utf-8.
You bring up a good point, though, that the encoding the file is actually
saved in and the encoding declaration in the source have to match. Many
people get that wrong as well.

-Mark

anuraguniyal · May 10, 2009

First of all thanks everybody for putting time with my confusing post
and I apologize for not being clear after so many efforts.

here is my last try (you are free to ignore my request for free
advice)

# -*- coding: utf-8 -*-

class A(object):

def __unicode__(self):
return u"©au"

def __repr__(self):
return unicode(self).encode("utf-8")

__str__ = __repr__

a = A()
u1 = unicode(a)
u2 = unicode([a])

now I am not using print so that doesn't matter stdout can print
unicode or not
my naive question is line u2 = unicode([a]) throws
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
1: ordinal not in range(128)

shouldn't list class call unicode on its elements? I was expecting
that
so instead do i had to do this
u3 = "["+u",".join(map(unicode,[a]))+"]"

anuraguniyal · May 10, 2009

and yes replace string by u'\N{COPYRIGHT SIGN}au'
as mentioned earlier non-ascii char may not come correct posted here.

Piet van Oostrum · May 10, 2009

[email protected] said:
ac> and yes replace string by u'\N{COPYRIGHT SIGN}au'
ac> as mentioned earlier non-ascii char may not come correct posted here.

That shouldn't be a problem for any decent new agent when there is a
proper charset declaration in the headers.

Peter Otten · May 10, 2009

First of all thanks everybody for putting time with my confusing post
and I apologize for not being clear after so many efforts.

here is my last try (you are free to ignore my request for free
advice)

Finally! This is the first of your posts that makes sense to me

# -*- coding: utf-8 -*-

class A(object):

def __unicode__(self):
return u"Â©au"

def __repr__(self):
return unicode(self).encode("utf-8")

__str__ = __repr__

a = A()
u1 = unicode(a)
u2 = unicode([a])

now I am not using print so that doesn't matter stdout can print
unicode or not
my naive question is line u2 = unicode([a]) throws
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position
1: ordinal not in range(128)

list doesn't have a __unicode__ method. unicode() therefore converts the
list to str as a fallback and then uses sys.getdefaultencoding() to convert
the result to unicode.

shouldn't list class call unicode on its elements?

No, it calls repr() on its elements. This is done to avoid confusing output:

items = ["a, b", "[c]"]
items ['a, b', '[c]']
"[%s]" % ", ".join(map(str, items))

Click to expand...

Click to expand...

'[a, b, [c]]'

I was expecting that so instead do i had to do this
u3 = "["+u",".join(map(unicode,[a]))+"]"

Peter

Unicode	2	Mar 15, 2013
Unicode	20	Dec 16, 2012
Thinking Unicode	0	Aug 8, 2013
Can I make unicode in a repr() print readably?	3	Sep 9, 2006
Can someone pls help me with a little algorithm script	1	Nov 28, 2024
Flatten an email Message with a non-ASCII body using 8bit CTE	0	Jan 24, 2013
I made a blockchain and want to make a cryptocurrency, but my code doesn't verify hash of each block	2	Jun 2, 2024
BITCOIN PROGRAMMING - CODE INCLUDED - needs slight modification in linux terminal - NSA please do not block	0	Nov 1, 2024

unicode bit me

anuraguniyal

Diez B. Roggisch

Terry Reedy

J. Cliff Dyer

Piet van Oostrum

Steven D'Aprano

anuraguniyal

anuraguniyal

Piet van Oostrum

J. Clifford Dyer

anuraguniyal

Steven D'Aprano

rurpy

Mark Tolonen

Piet van Oostrum

Mark Tolonen

anuraguniyal

anuraguniyal

Piet van Oostrum

Peter Otten

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads