Python UTF-8 and codecs

M

Mike Currie

I'm trying to write out files that have utf-8 characters 0x85 and 0x08 in
them. Every configuration I try I get a UnicodeError: ascii codec can't
decode byte 0x85 in position 255: oridinal not in range(128)

I've tried using the codecs.open('foo.txt', 'rU', 'utf-8', errors='strict')
and that doesn't work and I've also try wrapping the file in an utf8_writer
using codecs.lookup('utf8')

Any clues?

Thanks
Mike
 
D

Dennis Benzinger

Mike said:
I'm trying to write out files that have utf-8 characters 0x85 and 0x08 in
them. Every configuration I try I get a UnicodeError: ascii codec can't
decode byte 0x85 in position 255: oridinal not in range(128)

I've tried using the codecs.open('foo.txt', 'rU', 'utf-8', errors='strict')
and that doesn't work
> [...]

You want to write to a file but you used the 'rU' mode. This should be
'wU'. Don't know if this is the only reason it doesn't work. Could you
show more of your code?


Bye,
Dennis
 
S

Serge Orlov

I'm trying to write out files that have utf-8 characters 0x85 and 0x08 in
them. Every configuration I try I get a UnicodeError: ascii codec can't
decode byte 0x85 in position 255: oridinal not in range(128)

I've tried using the codecs.open('foo.txt', 'rU', 'utf-8', errors='strict')
and that doesn't work and I've also try wrapping the file in an utf8_writer
using codecs.lookup('utf8')

Any clues?

Use unicode strings for non-ascii characters. The following program "works":

import codecs

c1 = unichr(0x85)
f = codecs.open('foo.txt', 'wU', 'utf-8')
f.write(c1)
f.close()

But unichr(0x85) is a control characters, are you sure you want it?
What is the encoding of your data?
 
M

Mike Currie

I did make a mistake, it should have been 'wU'.

The starting data is ASCII.

What I'm doing is data processing on files with new line and tab characters
inside quoted fields. The idea is to convert all the new line and
characters to 0x85 and 0x88 respectivly, then process the files. Finally
right before importing them into a database convert them back to new line
and tab's thus preserving the field values.

Will python not handle the control characters correctly?
 
M

Mike Currie

Okay,

Here is a sample of what I'm doing:


Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information..... filterMap[chr(i)] = chr(i)
....
filterMap[chr(9)] = chr(136)
filterMap[chr(10)] = chr(133)
filterMap[chr(136)] = chr(9)
filterMap[chr(133)] = chr(10)
line = '''this has
.... tabs and line
.... breaks'''
filteredLine = ''.join([ filterMap[a] for a in line])
import codecs
f = codecs.open('foo.txt', 'wU', 'utf-8')
print filteredLine thisêhasêàtabsêandêlineàbreaks
f.write(filteredLine)
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "C:\Python24\lib\codecs.py", line 501, in write
return self.writer.write(data)
File "C:\Python24\lib\codecs.py", line 178, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0x88 in position 4:
ordinal
not in range(128)
 
S

Serge Orlov

Okay,

Here is a sample of what I'm doing:


Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.... filterMap[chr(i)] = chr(i)
...
filterMap[chr(9)] = chr(136)
filterMap[chr(10)] = chr(133)
filterMap[chr(136)] = chr(9)
filterMap[chr(133)] = chr(10)

This part is incorrect, it should be:

filterMap = {}
for i in range(0,128):
filterMap[chr(i)] = chr(i)

filterMap[chr(9)] = unichr(136)
filterMap[chr(10)] = unichr(133)
filterMap[unichr(136)] = chr(9)
filterMap[unichr(133)] = chr(10)
 
M

Mike Currie

Well, not really. It doesn't affect the result. I still get the error
message. Did you get a different result?


Serge Orlov said:
Okay,

Here is a sample of what I'm doing:


Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
filterMap = {}
for i in range(0,255):
... filterMap[chr(i)] = chr(i)
...
filterMap[chr(9)] = chr(136)
filterMap[chr(10)] = chr(133)
filterMap[chr(136)] = chr(9)
filterMap[chr(133)] = chr(10)

This part is incorrect, it should be:

filterMap = {}
for i in range(0,128):
filterMap[chr(i)] = chr(i)

filterMap[chr(9)] = unichr(136)
filterMap[chr(10)] = unichr(133)
filterMap[unichr(136)] = chr(9)
filterMap[unichr(133)] = chr(10)
 
S

Serge Orlov

Well, not really. It doesn't affect the result. I still get the error
message. Did you get a different result?

Yes, the program succesfully wrote text file. Without magic abilities
to read the screen of your computer I guess you now get exception in
print statement. It is because you use legacy windows console (I use
unicode-capable console of lightning compiler
<http://www.python.org/pypi/Lightning Compiler> to run snippets of
code). You can either change console or comment out print statement or
change your program to print unicode representation: print
repr(filteredLine)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,045
Latest member
DRCM

Latest Threads

Top