error when printing a UTF-8 string (python 2.6.2)

G

Guest

Hello.

I read a string from an utf-8 file:

fichierLaTeX = codecs.open(sys.argv[1], "r", "utf-8")
s = fichierLaTeX.read()
fichierLaTeX.close()

I can then print the string without error with 'print s'.

Next I parse this string:

def parser(s):
i = 0
while i < len(s):
if s[i:i+1] == '\\':
i += 1
if s[i:i+1] == '\\':
print "backslash"
elif s[i:i+1] == '%':
print "pourcentage"
else:
if estUnCaractere(s[i:i+1]):
motcle = ""
while estUnCaractere(s[i:i+1]):
motcle += s[i:i+1]
i += 1
print "mot-clé '"+motcle+"'"

but when I run this code, I get this error:

Traceback (most recent call last):
File "./versOO.py", line 115, in <module>
parser(s)
File "./versOO.py", line 105, in parser
print "mot-clé '"+motcle+"'"
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in
position 6: ordinal not in range(128)

What must I do to solve this?

Thanks!
 
P

Peter Otten

Hello.

I read a string from an utf-8 file:

fichierLaTeX = codecs.open(sys.argv[1], "r", "utf-8")
s = fichierLaTeX.read()
fichierLaTeX.close()

I can then print the string without error with 'print s'.

Next I parse this string:

def parser(s):
i = 0
while i < len(s):
if s[i:i+1] == '\\':
i += 1
if s[i:i+1] == '\\':
print "backslash"
elif s[i:i+1] == '%':
print "pourcentage"
else:
if estUnCaractere(s[i:i+1]):
motcle = ""
while estUnCaractere(s[i:i+1]):
motcle += s[i:i+1]
i += 1
print "mot-clé '"+motcle+"'"

but when I run this code, I get this error:

Traceback (most recent call last):
File "./versOO.py", line 115, in <module>
parser(s)
File "./versOO.py", line 105, in parser
print "mot-clé '"+motcle+"'"
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in
position 6: ordinal not in range(128)

What must I do to solve this? 'mot-cl\xc3\xa9mot-cl\xc3\xa9'
u'mot-cl\xe9mot-cl\xe9'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6:
ordinal not in range(128)

codecs.open().read() returns unicode, but your literals are all bytestrings.
When you are mixing unicode and str Python tries to convert the bytestring
to unicode using the ascii codec, and of course fails for non-ascii
characters.

Change your string literals to unicode by adding the u-prefix and you should
be OK.

Peter
 
G

Guest

Change your string literals to unicode by adding the u-prefix and you should

Thanks, it solved the problem... for a while!

I need now to know if s gives the next byte or the next character,
when I scan the string s. I've googled pages about python and unicode,
but didn't find a solution to that. I scan the string read from the
file char by char to construct s, but now get the same error when just
trying 'print s'.

Is there a way to tell python that all strings and characters are to
be treated as UTF-8? I have LC_ALL=en_GB.utf-8 in my shell, but python
does'nt seem to use this variable?

Thanks!
 
C

Chris Rebert

Change your string literals to unicode by adding the u-prefix and you should
be OK.

Thanks, it solved the problem... for a while!

I need now to know if s gives the next byte or the next character,
when I scan the string s. I've googled pages about python and unicode,
but didn't find a solution to that. I scan the string read from the
file char by char to construct s, but now get the same error when just
trying 'print s'.


Assuming s = fichierLaTeX.read() as from your code snippet, the next
character. When in doubt, check what `type(s)` is; if it's <type
'str'>, indices are in bytes; if it's <type 'unicode'>, indices are in
code points.

Please give the full stack traceback for your error.

Cheers,
Chris
 
G

Guest

Thanks for your insights.

I have taken the easy way out, I read on a page that python 3 worked
by default in UTF-8, so I downloaded and installed it.

Apart from a few surprises (print is not a funtion, and rules about
mixing spaces and tabs in indentation are much more strict, and I
guess more is to come :^) everything now works transparently.

Thanks again.
 
P

Peter Otten

I have taken the easy way out, I read on a page that python 3 worked
by default in UTF-8, so I downloaded and installed it.

Just a quick reminder: UTF-8 is not the same as unicode. Python3 works in
unicode and by default uses UTF-8 to read from or write into files.

Peter
 
P

python

Hi Peter,
Just a quick reminder: UTF-8 is not the same as unicode. Python3 works in unicode and by default uses UTF-8 to read from or write into files.

I'm not the OP, but wanted to make sure I was fully understanding your
point.

Are you saying all open() calls in Python that read text files,
automatically convert UTF-8 content to Unicode in the same manner as the
following might when using Python 2.6?

codecs.open( fileName, mode='r', encoding='UTF8', ... )

Thanks for your feedback,

Malcolm
 
P

Peter Otten

Are you saying all open() calls in Python that read text files,
automatically convert UTF-8 content to Unicode in the same manner as the
following might when using Python 2.6?

codecs.open( fileName, mode='r', encoding='UTF8', ... )


That's what I meant to say, but it's not actually true.

Quoting http://docs.python.org/py3k/library/functions.html#open

"""
open(file, mode='r', buffering=None, encoding=None, errors=None,
newline=None, closefd=True)

[...]

encoding is the name of the encoding used to decode or encode the file. This
should only be used in text mode. The default encoding is platform dependent
(whatever locale.getpreferredencoding() returns), but any encoding supported
by Python can be used. See the codecs module for the list of supported
encodings.
"""

So it just happend to be UTF-8 on my machine.

Peter
 
P

python

Hi Peter,
That's what I meant to say, but it's not actually true.

Thanks for the clarification.

It sounds like Python 3 has unified the standard library open() function
and the codecs.open() into a single function?

In other words, would it be accurate to say that in Python 3, there is
no longer a need to use codecs.open()?

Any idea if the above applies to Python 2.7?

Thank you Peter!

Malcolm
 
T

Terry Reedy

That's what I meant to say, but it's not actually true.

I wish it were, though.
Quoting http://docs.python.org/py3k/library/functions.html#open

"""
open(file, mode='r', buffering=None, encoding=None, errors=None,
newline=None, closefd=True)

[...]

encoding is the name of the encoding used to decode or encode the file. This
should only be used in text mode. The default encoding is platform dependent
(whatever locale.getpreferredencoding() returns), but any encoding supported
by Python can be used. See the codecs module for the list of supported
encodings.
"""

So it just happend to be UTF-8 on my machine.

Unfortunately, it is not on US Windows.

Terry Jan Reedy
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,763
Messages
2,569,563
Members
45,039
Latest member
CasimiraVa

Latest Threads

Top