q: how to output a unicode string?

F

Frank Stajano

A simple unicode question. How do I print?

Sample code:

# -*- coding: utf-8 -*-
s1 = u"héllô wórld"
print s1
# Gives UnicodeEncodeError: 'ascii' codec can't encode character
# u'\xe9' in position 1: ordinal not in range(128)


What I actually want to do is slightly more elaborate: read from a text
file which is in utf-8, do some manipulations of the text and print the
result on stdout. I understand I must open the file with

f = codecs.open("input.txt", "r", "utf-8")

but then I get stuck as above.

I tried

s2 = s1.encode("utf-8")
print s2

but got

héllô wórld

Then, in the hope of being able to write the string to a file if not to
stdout, I also tried


import codecs
f = codecs.open("out.txt", "w", "utf-8")
f.write(s2)

but got

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)

So I seem to be stuck.

I have checked several online python+unicode pages, including

http://boodebr.org/main/python/all-about-python-and-unicode#WHYNOPRINT
http://evanjones.ca/python-utf8.html
http://www.reportlab.com/i18n/python_unicode_tutorial.html
http://www.amk.ca/python/howto/unicode
http://www.example-code.com/python/python-charset.asp
http://docs.python.org/lib/csv-examples.html

but none of them was sufficient to make me understand how to deal with
this simple problem. I'm sure it's easy, maybe too easy to be worth
explaining in a tutorial...

Help gratefully received.
 
D

Diez B. Roggisch

Frank said:
A simple unicode question. How do I print?

Sample code:

# -*- coding: utf-8 -*-
s1 = u"héllô wórld"
print s1
# Gives UnicodeEncodeError: 'ascii' codec can't encode character
# u'\xe9' in position 1: ordinal not in range(128)


What I actually want to do is slightly more elaborate: read from a text
file which is in utf-8, do some manipulations of the text and print the
result on stdout. I understand I must open the file with

f = codecs.open("input.txt", "r", "utf-8")

but then I get stuck as above.

I tried

s2 = s1.encode("utf-8")
print s2

but got

héllô wórld

Which is perfectly alright - it's just that your terminal isn't prepared to
decode UTF-8, but some other encoding, like latin1.
Then, in the hope of being able to write the string to a file if not to
stdout, I also tried


import codecs
f = codecs.open("out.txt", "w", "utf-8")
f.write(s2)

but got

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)

Instead of writing s2 (which is a byte-string!!!), write s1. It will work.

The error you get stems from f.write wanting a unicode-object, but s2 is a
bytestring (you explicitly converted it before), so python tries to encode
the bytestring with the default encoding - ascii - to a unicode string.
This of course fails.

Diez
 
F

Frank Stajano

Diez said:
Which is perfectly alright - it's just that your terminal isn't prepared to
decode UTF-8, but some other encoding, like latin1.

Aha! Thanks for spotting this. You are right about the terminal
(rxvt/cygwin) not being ready to handle utf-8, as I can now confirm with a

cat t2.py

(t2.py being the program above) which displays the source code garbled
in the same way.

If I do

s1 = u"héllô wórld"
print s1

at the interactive prompt of Idle, I get the proper output

héllô wórld

So why is it that in the first case I got UnicodeEncodeError: 'ascii'
codec can't encode? Seems as if, within Idle, a utf-8 codec is being
selected automagically... why should that be so there and not in the
first case?
Instead of writing s2 (which is a byte-string!!!), write s1. It will work.

OK, many thanks, I got this to work!
The error you get stems from f.write wanting a unicode-object, but s2 is a
bytestring (you explicitly converted it before), so python tries to encode
the bytestring with the default encoding - ascii - to a unicode string.
This of course fails.

I think I have a better understanding of it now. If the terminal hadn't
fooled me, I probably wouldn't have assumed that the code I originally
wrote (following the first examples I found) was wrong! I assume that
when you say "bytestring" you mean "a string of bytes in a certain
encoding (here utf-8) that can be used as an external representation for
the unicode string which is instead a sequence of code points".

Thanks again
 
D

Diez B. Roggisch

So why is it that in the first case I got UnicodeEncodeError: 'ascii'
codec can't encode? Seems as if, within Idle, a utf-8 codec is being
selected automagically... why should that be so there and not in the
first case?

I'm a bit confused on what you did when.... the error appears if you try to
output a unicode-object without prior encoding - then the default encoding
(ascii) is used.
OK, many thanks, I got this to work!


I think I have a better understanding of it now. If the terminal hadn't
fooled me, I probably wouldn't have assumed that the code I originally
wrote (following the first examples I found) was wrong! I assume that
when you say "bytestring" you mean "a string of bytes in a certain
encoding (here utf-8) that can be used as an external representation for
the unicode string which is instead a sequence of code points".

Yes. That is exactly the difference.

Diez
 
F

Frank Stajano

Diez said:
I'm a bit confused on what you did when.... the error appears if you try to
output a unicode-object without prior encoding - then the default encoding
(ascii) is used.

Here's a minimal example for you.
I put these four lines into a utf-8 file.

# -*- coding: utf-8 -*-
# this file is called t3.py
s1 = u"héllô wórld"
print s1


If I invoke "python t3.py" at the cygwin/rxvt/bash prompt, I get:

Traceback (most recent call last):
File "t3.py", line 4, in <module>
print s1
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in
position 1: ordinal not in range(128)

If I load the exact same file in Idle and press F5 (for Run), I get:

héllô wórld

So obviously "the system" is not behaving in the same way in the two
cases. Maybe Python senses that it can do utf-8 when it's inside Idle
and sets the default to utf-8 without me asking for it, and senses that
it can't do (or more precisely output) utf-8 when it's in
cygwin/rxvt/bash so there it sets the default codec to ascii. That's my
best guess so far...

I find the encode/decode terminology somewhat confusing, because
arguably both sides are "encoded". For example, a unicode-encoded string
(I mean a sequence of unicode code points) should count as "decoded" in
the terminology of this framework, right?

Anyway, thanks again for your help, for deepening my modest
understanding of the issue and for solving my original problem!
 
R

Richard Brodie

I find the encode/decode terminology somewhat confusing, because arguably both sides are
"encoded". For example, a unicode-encoded string (I mean a sequence of unicode code
points) should count as "decoded" in the terminology of this framework, right?

Yes. Unicode is the one true Universal Character Set, and everything else
(including ASCII and UTF-8) is a mere encoding. Once you've got your head
round that, things may make more sense.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top