Python UTF-8 and codecs

Mike Currie · Jun 27, 2006

I'm trying to write out files that have utf-8 characters 0x85 and 0x08 in
them. Every configuration I try I get a UnicodeError: ascii codec can't
decode byte 0x85 in position 255: oridinal not in range(128)

I've tried using the codecs.open('foo.txt', 'rU', 'utf-8', errors='strict')
and that doesn't work and I've also try wrapping the file in an utf8_writer
using codecs.lookup('utf8')

Any clues?

Thanks
Mike

Dennis Benzinger · Jun 27, 2006

Mike said:
I'm trying to write out files that have utf-8 characters 0x85 and 0x08 in
them. Every configuration I try I get a UnicodeError: ascii codec can't
decode byte 0x85 in position 255: oridinal not in range(128)

I've tried using the codecs.open('foo.txt', 'rU', 'utf-8', errors='strict')
and that doesn't work
> [...]

You want to write to a file but you used the 'rU' mode. This should be
'wU'. Don't know if this is the only reason it doesn't work. Could you
show more of your code?

Bye,
Dennis

Serge Orlov · Jun 27, 2006

I'm trying to write out files that have utf-8 characters 0x85 and 0x08 in
them. Every configuration I try I get a UnicodeError: ascii codec can't
decode byte 0x85 in position 255: oridinal not in range(128)

I've tried using the codecs.open('foo.txt', 'rU', 'utf-8', errors='strict')
and that doesn't work and I've also try wrapping the file in an utf8_writer
using codecs.lookup('utf8')

Any clues?

Use unicode strings for non-ascii characters. The following program "works":

import codecs

c1 = unichr(0x85)
f = codecs.open('foo.txt', 'wU', 'utf-8')
f.write(c1)
f.close()

But unichr(0x85) is a control characters, are you sure you want it?
What is the encoding of your data?

Mike Currie · Jun 27, 2006

I did make a mistake, it should have been 'wU'.

The starting data is ASCII.

What I'm doing is data processing on files with new line and tab characters
inside quoted fields. The idea is to convert all the new line and
characters to 0x85 and 0x88 respectivly, then process the files. Finally
right before importing them into a database convert them back to new line
and tab's thus preserving the field values.

Will python not handle the control characters correctly?

Mike Currie · Jun 27, 2006

Okay,

Here is a sample of what I'm doing:

Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information..... filterMap[chr(i)] = chr(i)
....

filterMap[chr(9)] = chr(136)
filterMap[chr(10)] = chr(133)
filterMap[chr(136)] = chr(9)
filterMap[chr(133)] = chr(10)
line = '''this has

Click to expand...

Click to expand...

.... tabs and line
.... breaks'''

filteredLine = ''.join([ filterMap[a] for a in line])
import codecs
f = codecs.open('foo.txt', 'wU', 'utf-8')
print filteredLine thisêhasêàtabsêandêlineàbreaks
f.write(filteredLine)

Click to expand...

Click to expand...

Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "C:\Python24\lib\codecs.py", line 501, in write
return self.writer.write(data)
File "C:\Python24\lib\codecs.py", line 178, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 4:
ordinal
not in range(128)

Serge Orlov · Jun 27, 2006

Okay,

Here is a sample of what I'm doing:

Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.... filterMap[chr(i)] = chr(i)
...

filterMap[chr(9)] = chr(136)
filterMap[chr(10)] = chr(133)
filterMap[chr(136)] = chr(9)
filterMap[chr(133)] = chr(10)

Click to expand...

Click to expand...

This part is incorrect, it should be:

filterMap = {}
for i in range(0,128):
filterMap[chr(i)] = chr(i)

filterMap[chr(9)] = unichr(136)
filterMap[chr(10)] = unichr(133)
filterMap[unichr(136)] = chr(9)
filterMap[unichr(133)] = chr(10)

Mike Currie · Jun 28, 2006

Well, not really. It doesn't affect the result. I still get the error
message. Did you get a different result?

Serge Orlov said:
Okay,

Here is a sample of what I'm doing:

Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.

filterMap = {}
for i in range(0,255):

Click to expand...

... filterMap[chr(i)] = chr(i)
...

filterMap[chr(9)] = chr(136)
filterMap[chr(10)] = chr(133)
filterMap[chr(136)] = chr(9)
filterMap[chr(133)] = chr(10)

Click to expand...

Click to expand...

This part is incorrect, it should be:

filterMap = {}
for i in range(0,128):
filterMap[chr(i)] = chr(i)

filterMap[chr(9)] = unichr(136)
filterMap[chr(10)] = unichr(133)
filterMap[unichr(136)] = chr(9)
filterMap[unichr(133)] = chr(10)

Serge Orlov · Jun 28, 2006

Well, not really. It doesn't affect the result. I still get the error
message. Did you get a different result?

Yes, the program succesfully wrote text file. Without magic abilities
to read the screen of your computer I guess you now get exception in
print statement. It is because you use legacy windows console (I use
unicode-capable console of lightning compiler
<http://www.python.org/pypi/Lightning Compiler> to run snippets of
code). You can either change console or comment out print statement or
change your program to print unicode representation: print
repr(filteredLine)

MeCab UTF-8 Decoding Problem	6	Jun 29, 2013
Trouble with utf-8 values	0	Nov 5, 2013
Encoding of surrogate code points to UTF-8	14	Oct 8, 2013
Stuck with urllib.quote and Unicode/UTF-8	0	May 7, 2011
error when printing a UTF-8 string (python 2.6.2)	9	Apr 21, 2010
codecs, csv issues	2	Aug 22, 2008
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position	58	Sep 29, 2013
sqlite3 and UTF-8	3	Dec 7, 2010

Python UTF-8 and codecs

Mike Currie

Dennis Benzinger

Serge Orlov

Mike Currie

Mike Currie

Serge Orlov

Mike Currie

Serge Orlov

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads