UnicodeDecodeError quick question

P

patrick.waldo

Hi Everyone,

I am using Python 2.4 and I am converting an excel spreadsheet to a
pipe delimited text file and some of the cells contain utf-8
characters. I solved this problem in a very unintuitive way and I
wanted to ask why. If I do,

csvfile.write(cell.encode("utf-8"))

I get a UnicodeDecodeError. However if I do,

c = unicode(cell.encode("utf-8"),"utf-8")
csvfile.write(c)

Why should I have to encode the cell to utf-8 and then make it unicode
in order to write to a text file? Is there a more intuitive way to
get around these bothersome unicode errors?

Thanks for any advice,
Patrick

Code:

# -*- coding: utf-8 -*-
import xlrd,codecs,os

xls_file = "/home/pwaldo2/work/docpool_plone/2008-12-4/
EU-2008-12-4.xls"
book = xlrd.open_workbook(xls_file)
bibliography_sheet = book.sheet_by_index(0)

csv = os.path.split(xls_file)[0] + '/' + os.path.split(xls_file)[1]
[:-4] + '.csv'
csvfile = codecs.open(csv,'w',encoding='utf-8')

rowcount = 0
data = []
while rowcount<bibliography_sheet.nrows:
data.append(bibliography_sheet.row_values(rowcount,
start_colx=0,end_colx=None))
rowcount+=1
for row in data:
for cell in row:
#csvfile.write(cell.encode("utf-8")) This causes the
UnicodeDecodeError
c = unicode(cell.encode("utf-8"),"utf-8")
csvfile.write(c)
csvfile.write('|')
csvfile.write('\r\n')
csvfile.close()
 
T

Tim Golden

Hi Everyone,

I am using Python 2.4 and I am converting an excel spreadsheet to a
pipe delimited text file and some of the cells contain utf-8
characters. I solved this problem in a very unintuitive way and I
wanted to ask why. If I do,

csvfile.write(cell.encode("utf-8"))

I get a UnicodeDecodeError. However if I do,

c = unicode(cell.encode("utf-8"),"utf-8")
csvfile.write(c)

Why should I have to encode the cell to utf-8 and then make it unicode
in order to write to a text file? Is there a more intuitive way to
get around these bothersome unicode errors?


The short answer is that you're writing to a file
you've opened with the codecs module. Any write to
this file expects unicode data and will automatically
encode it to the encoding you specified. You're trying
to send it utf8-encoded data -- ie a string of bytes,
*not* unicode -- and it presumably tries to decode it
to a unicode object before encoding it as utf8 like
you asked it to. Without looking at the implementation,
it probably just does unicode (x) on what you've passed
in, will will use the default ascii codec and fail in
the way you saw.

(Honestly, that was the short answer).

To solve it, assuming cell is already unicode, just pass
it unadulterated to csvfile.write.

The reason the other thing works is because you're in
control of the -- unncessary -- unicode conversion, and
you're telling Python what encoding to use for decoding
and encoding.

TJG
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,074
Latest member
StanleyFra

Latest Threads

Top