q: how to output a unicode string?

Frank Stajano · Apr 24, 2007

A simple unicode question. How do I print?

Sample code:

# -*- coding: utf-8 -*-
s1 = u"hÃ©llÃ´ wÃ³rld"
print s1
# Gives UnicodeEncodeError: 'ascii' codec can't encode character
# u'\xe9' in position 1: ordinal not in range(128)

What I actually want to do is slightly more elaborate: read from a text
file which is in utf-8, do some manipulations of the text and print the
result on stdout. I understand I must open the file with

f = codecs.open("input.txt", "r", "utf-8")

but then I get stuck as above.

I tried

s2 = s1.encode("utf-8")
print s2

but got

hÃƒÂ©llÃƒÂ´ wÃƒÂ³rld

Then, in the hope of being able to write the string to a file if not to
stdout, I also tried

import codecs
f = codecs.open("out.txt", "w", "utf-8")
f.write(s2)

but got

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)

So I seem to be stuck.

I have checked several online python+unicode pages, including

http://boodebr.org/main/python/all-about-python-and-unicode#WHYNOPRINT
http://evanjones.ca/python-utf8.html
http://www.reportlab.com/i18n/python_unicode_tutorial.html
http://www.amk.ca/python/howto/unicode
http://www.example-code.com/python/python-charset.asp
http://docs.python.org/lib/csv-examples.html

but none of them was sufficient to make me understand how to deal with
this simple problem. I'm sure it's easy, maybe too easy to be worth
explaining in a tutorial...

Help gratefully received.

Diez B. Roggisch · Apr 24, 2007

Frank said:
A simple unicode question. How do I print?

Sample code:

# -*- coding: utf-8 -*-
s1 = u"hÃ©llÃ´ wÃ³rld"
print s1
# Gives UnicodeEncodeError: 'ascii' codec can't encode character
# u'\xe9' in position 1: ordinal not in range(128)

What I actually want to do is slightly more elaborate: read from a text
file which is in utf-8, do some manipulations of the text and print the
result on stdout. I understand I must open the file with

f = codecs.open("input.txt", "r", "utf-8")

but then I get stuck as above.

I tried

s2 = s1.encode("utf-8")
print s2

but got

hÃƒÂ©llÃƒÂ´ wÃƒÂ³rld

Which is perfectly alright - it's just that your terminal isn't prepared to
decode UTF-8, but some other encoding, like latin1.

Then, in the hope of being able to write the string to a file if not to
stdout, I also tried

import codecs
f = codecs.open("out.txt", "w", "utf-8")
f.write(s2)

but got

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)

Instead of writing s2 (which is a byte-string!!!), write s1. It will work.

The error you get stems from f.write wanting a unicode-object, but s2 is a
bytestring (you explicitly converted it before), so python tries to encode
the bytestring with the default encoding - ascii - to a unicode string.
This of course fails.

Diez

Frank Stajano · Apr 25, 2007

Diez said:
Which is perfectly alright - it's just that your terminal isn't prepared to
decode UTF-8, but some other encoding, like latin1.

Aha! Thanks for spotting this. You are right about the terminal
(rxvt/cygwin) not being ready to handle utf-8, as I can now confirm with a

cat t2.py

(t2.py being the program above) which displays the source code garbled
in the same way.

If I do

s1 = u"hÃ©llÃ´ wÃ³rld"
print s1

at the interactive prompt of Idle, I get the proper output

hÃ©llÃ´ wÃ³rld

So why is it that in the first case I got UnicodeEncodeError: 'ascii'
codec can't encode? Seems as if, within Idle, a utf-8 codec is being
selected automagically... why should that be so there and not in the
first case?

Instead of writing s2 (which is a byte-string!!!), write s1. It will work.

OK, many thanks, I got this to work!

The error you get stems from f.write wanting a unicode-object, but s2 is a
bytestring (you explicitly converted it before), so python tries to encode
the bytestring with the default encoding - ascii - to a unicode string.
This of course fails.

I think I have a better understanding of it now. If the terminal hadn't
fooled me, I probably wouldn't have assumed that the code I originally
wrote (following the first examples I found) was wrong! I assume that
when you say "bytestring" you mean "a string of bytes in a certain
encoding (here utf-8) that can be used as an external representation for
the unicode string which is instead a sequence of code points".

Thanks again

Diez B. Roggisch · Apr 25, 2007

So why is it that in the first case I got UnicodeEncodeError: 'ascii'

codec can't encode? Seems as if, within Idle, a utf-8 codec is being
selected automagically... why should that be so there and not in the
first case?

I'm a bit confused on what you did when.... the error appears if you try to
output a unicode-object without prior encoding - then the default encoding
(ascii) is used.

OK, many thanks, I got this to work!

I think I have a better understanding of it now. If the terminal hadn't
fooled me, I probably wouldn't have assumed that the code I originally
wrote (following the first examples I found) was wrong! I assume that
when you say "bytestring" you mean "a string of bytes in a certain
encoding (here utf-8) that can be used as an external representation for
the unicode string which is instead a sequence of code points".

Yes. That is exactly the difference.

Diez

Frank Stajano · Apr 25, 2007

Diez said:
I'm a bit confused on what you did when.... the error appears if you try to
output a unicode-object without prior encoding - then the default encoding
(ascii) is used.

Here's a minimal example for you.
I put these four lines into a utf-8 file.

# -*- coding: utf-8 -*-
# this file is called t3.py
s1 = u"héllô wórld"
print s1

If I invoke "python t3.py" at the cygwin/rxvt/bash prompt, I get:

Traceback (most recent call last):
File "t3.py", line 4, in <module>
print s1
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in
position 1: ordinal not in range(128)

If I load the exact same file in Idle and press F5 (for Run), I get:

héllô wórld

So obviously "the system" is not behaving in the same way in the two
cases. Maybe Python senses that it can do utf-8 when it's inside Idle
and sets the default to utf-8 without me asking for it, and senses that
it can't do (or more precisely output) utf-8 when it's in
cygwin/rxvt/bash so there it sets the default codec to ascii. That's my
best guess so far...

I find the encode/decode terminology somewhat confusing, because
arguably both sides are "encoded". For example, a unicode-encoded string
(I mean a sequence of unicode code points) should count as "decoded" in
the terminology of this framework, right?

Anyway, thanks again for your help, for deepening my modest
understanding of the issue and for solving my original problem!

Richard Brodie · Apr 25, 2007

I find the encode/decode terminology somewhat confusing, because arguably both sides are
"encoded". For example, a unicode-encoded string (I mean a sequence of unicode code
points) should count as "decoded" in the terminology of this framework, right?

Yes. Unicode is the one true Universal Character Set, and everything else
(including ASCII and UTF-8) is a mere encoding. Once you've got your head
round that, things may make more sense.

Convert unicode escape sequences to unicode in a file	1	Jan 11, 2011
Thinking Unicode	0	Aug 8, 2013
Python 3.3, gettext and Unicode problems	0	Dec 30, 2012
Unicode	20	Dec 16, 2012
Right solution to unicode error?	21	Nov 7, 2012
Why can't I set sys.ps1 to a unicode string?	3	Aug 12, 2010
Unicode error	19	Jul 23, 2010
how to write a unicode string to a file ?	0	Oct 15, 2009

q: how to output a unicode string?

Frank Stajano

Diez B. Roggisch

Frank Stajano

Diez B. Roggisch

Frank Stajano

Richard Brodie

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads