error when printing a UTF-8 string (python 2.6.2)

Guest · Apr 21, 2010

Hello.

I read a string from an utf-8 file:

fichierLaTeX = codecs.open(sys.argv[1], "r", "utf-8")
s = fichierLaTeX.read()
fichierLaTeX.close()

I can then print the string without error with 'print s'.

Next I parse this string:

def parser(s):
i = 0
while i < len(s):
if s[i:i+1] == '\\':
i += 1
if s[i:i+1] == '\\':
print "backslash"
elif s[i:i+1] == '%':
print "pourcentage"
else:
if estUnCaractere(s[i:i+1]):
motcle = ""
while estUnCaractere(s[i:i+1]):
motcle += s[i:i+1]
i += 1
print "mot-clé '"+motcle+"'"

but when I run this code, I get this error:

Traceback (most recent call last):
File "./versOO.py", line 115, in <module>
parser(s)
File "./versOO.py", line 105, in parser
print "mot-clé '"+motcle+"'"
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in
position 6: ordinal not in range(128)

What must I do to solve this?

Thanks!

Peter Otten · Apr 21, 2010

Hello.

I read a string from an utf-8 file:

fichierLaTeX = codecs.open(sys.argv[1], "r", "utf-8")
s = fichierLaTeX.read()
fichierLaTeX.close()

I can then print the string without error with 'print s'.

Next I parse this string:

def parser(s):
i = 0
while i < len(s):
if s[i:i+1] == '\\':
i += 1
if s[i:i+1] == '\\':
print "backslash"
elif s[i:i+1] == '%':
print "pourcentage"
else:
if estUnCaractere(s[i:i+1]):
motcle = ""
while estUnCaractere(s[i:i+1]):
motcle += s[i:i+1]
i += 1
print "mot-clÃ© '"+motcle+"'"

but when I run this code, I get this error:

Traceback (most recent call last):
File "./versOO.py", line 115, in <module>
parser(s)
File "./versOO.py", line 105, in parser
print "mot-clÃ© '"+motcle+"'"
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in
position 6: ordinal not in range(128)

What must I do to solve this? 'mot-cl\xc3\xa9mot-cl\xc3\xa9'
u'mot-cl\xe9mot-cl\xe9'

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6:
ordinal not in range(128)

codecs.open().read() returns unicode, but your literals are all bytestrings.
When you are mixing unicode and str Python tries to convert the bytestring
to unicode using the ascii codec, and of course fails for non-ascii
characters.

Change your string literals to unicode by adding the u-prefix and you should
be OK.

Peter

Guest · Apr 21, 2010

Change your string literals to unicode by adding the u-prefix and you should

be OK.

Thanks, it solved the problem... for a while!

I need now to know if s gives the next byte or the next character,
when I scan the string s. I've googled pages about python and unicode,
but didn't find a solution to that. I scan the string read from the
file char by char to construct s, but now get the same error when just
trying 'print s'.

Is there a way to tell python that all strings and characters are to
be treated as UTF-8? I have LC_ALL=en_GB.utf-8 in my shell, but python
does'nt seem to use this variable?

Thanks!

Chris Rebert · Apr 21, 2010

Change your string literals to unicode by adding the u-prefix and you should
be OK.

Click to expand...

Thanks, it solved the problem... for a while!

I need now to know if s gives the next byte or the next character,
when I scan the string s. I've googled pages about python and unicode,
but didn't find a solution to that. I scan the string read from the
file char by char to construct s, but now get the same error when just
trying 'print s'.

Assuming s = fichierLaTeX.read() as from your code snippet, the next
character. When in doubt, check what `type(s)` is; if it's <type
'str'>, indices are in bytes; if it's <type 'unicode'>, indices are in
code points.

Please give the full stack traceback for your error.

Cheers,
Chris

Guest · Apr 21, 2010

Thanks for your insights.

I have taken the easy way out, I read on a page that python 3 worked
by default in UTF-8, so I downloaded and installed it.

Apart from a few surprises (print is not a funtion, and rules about
mixing spaces and tabs in indentation are much more strict, and I
guess more is to come :^) everything now works transparently.

Thanks again.

Peter Otten · Apr 21, 2010

I have taken the easy way out, I read on a page that python 3 worked
by default in UTF-8, so I downloaded and installed it.

Just a quick reminder: UTF-8 is not the same as unicode. Python3 works in
unicode and by default uses UTF-8 to read from or write into files.

Peter

python · Apr 21, 2010

Hi Peter,

Just a quick reminder: UTF-8 is not the same as unicode. Python3 works in unicode and by default uses UTF-8 to read from or write into files.

I'm not the OP, but wanted to make sure I was fully understanding your
point.

Are you saying all open() calls in Python that read text files,
automatically convert UTF-8 content to Unicode in the same manner as the
following might when using Python 2.6?

codecs.open( fileName, mode='r', encoding='UTF8', ... )

Thanks for your feedback,

Malcolm

Peter Otten · Apr 21, 2010

Are you saying all open() calls in Python that read text files,
automatically convert UTF-8 content to Unicode in the same manner as the
following might when using Python 2.6?

codecs.open( fileName, mode='r', encoding='UTF8', ... )

That's what I meant to say, but it's not actually true.

Quoting http://docs.python.org/py3k/library/functions.html#open

"""
open(file, mode='r', buffering=None, encoding=None, errors=None,
newline=None, closefd=True)

[...]

encoding is the name of the encoding used to decode or encode the file. This
should only be used in text mode. The default encoding is platform dependent
(whatever locale.getpreferredencoding() returns), but any encoding supported
by Python can be used. See the codecs module for the list of supported
encodings.
"""

So it just happend to be UTF-8 on my machine.

Peter

python · Apr 21, 2010

Hi Peter,

That's what I meant to say, but it's not actually true.

Thanks for the clarification.

It sounds like Python 3 has unified the standard library open() function
and the codecs.open() into a single function?

In other words, would it be accurate to say that in Python 3, there is
no longer a need to use codecs.open()?

Any idea if the above applies to Python 2.7?

Thank you Peter!

Malcolm

Terry Reedy · Apr 21, 2010

That's what I meant to say, but it's not actually true.

I wish it were, though.

Quoting http://docs.python.org/py3k/library/functions.html#open

"""
open(file, mode='r', buffering=None, encoding=None, errors=None,
newline=None, closefd=True)

[...]

encoding is the name of the encoding used to decode or encode the file. This
should only be used in text mode. The default encoding is platform dependent
(whatever locale.getpreferredencoding() returns), but any encoding supported
by Python can be used. See the codecs module for the list of supported
encodings.
"""

So it just happend to be UTF-8 on my machine.

Unfortunately, it is not on US Windows.

Terry Jan Reedy

MeCab UTF-8 Decoding Problem	6	Jun 29, 2013
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position	58	Sep 29, 2013
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position	4	Dec 6, 2012
usage of <string>.encode('utf-8','xmlcharrefreplace')?	7	Feb 19, 2008
How to send utf-8 mail in Python 3?	2	Mar 5, 2010
Printing UTF-8	1	Sep 21, 2006
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position	67	Jul 4, 2013
Encoding of surrogate code points to UTF-8	14	Oct 8, 2013

error when printing a UTF-8 string (python 2.6.2)

Guest

Peter Otten

Guest

Chris Rebert

Guest

Peter Otten

python

Peter Otten

python

Terry Reedy

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads