csv and mixed lists of unicode and numbers

S

Sibylle Koczian

Hello,

I want to put data from a database into a tab separated text file. This
looks like a typical application for the csv module, but there is a
snag: the rows I get from the database module (kinterbasdb in this case)
contain unicode objects and numbers. And of course the unicode objects
contain lots of non-ascii characters.

If I try to use csv.writer as is, I get UnicodeEncodeErrors. If I use
the UnicodeWriter from the module documentation, I get TypeErrors with
the numbers. (I'm using Python 2.6 - upgrading to 3.1 on this machine
would cause other complications.)

So do I have to process the rows myself and treat numbers and text
fields differently? Or what's the best way?

Here is a small example:

########################################################################
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import csv, codecs, cStringIO
import tempfile

cData = [u'Ärger', u'Ödland', 5, u'Süßigkeit', u'élève', 6.9, u'forêt']

class UnicodeWriter:
"""
A CSV writer which will write rows to CSV file "f",
which is encoded in the given encoding.
"""

def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
# Redirect output to a queue
self.queue = cStringIO.StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()

def writerow(self, row):
self.writer.writerow([s.encode("utf-8") for s in row])
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)

def writerows(self, rows):
for row in rows:
self.writerow(row)

def writewithcsv(outfile, datalist):
wrt = csv.writer(outfile, dialect=csv.excel)
wrt.writerow(datalist)

def writeunicode(outfile, datalist):
wrt = UnicodeWriter(outfile)
wrt.writerow(datalist)

def main():
with tempfile.NamedTemporaryFile() as csvfile:
print "CSV file:", csvfile.name
print "Try with csv.writer"
try:
writewithcsv(csvfile, cData)
except UnicodeEncodeError as e:
print e
print "Try with UnicodeWriter"
writeunicode(csvfile, cData)
print "Ready."

if __name__ == "__main__":
main()


##############################################################################

Hoping for advice,

Sibylle
 
B

Benjamin Kaplan

Hello,

I want to put data from a database into a tab separated text file. This
looks like a typical application for the csv module, but there is a
snag: the rows I get from the database module (kinterbasdb in this case)
contain unicode objects and numbers. And of course the unicode objects
contain lots of non-ascii characters.

If I try to use csv.writer as is, I get UnicodeEncodeErrors. If I use
the UnicodeWriter from the module documentation, I get TypeErrors with
the numbers. (I'm using Python 2.6 - upgrading to 3.1 on this machine
would cause other complications.)

So do I have to process the rows myself and treat numbers and text
fields differently? Or what's the best way?

Here is a small example:

########################################################################
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import csv, codecs, cStringIO
import tempfile

cData = [u'Ärger', u'Ödland', 5, u'Süßigkeit', u'élève', 6.9, u'forêt']

class UnicodeWriter:
   """
   A CSV writer which will write rows to CSV file "f",
   which is encoded in the given encoding.
   """

   def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
       # Redirect output to a queue
       self.queue = cStringIO.StringIO()
       self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
       self.stream = f
       self.encoder = codecs.getincrementalencoder(encoding)()

   def writerow(self, row):
       self.writer.writerow([s.encode("utf-8") for s in row])

try doing [s.encode("utf-8") if isinstance(s,unicode) else s for s in row]
That way, you'll only encode the unicode strings
 
P

Peter Otten

Sibylle said:
I want to put data from a database into a tab separated text file. This
looks like a typical application for the csv module, but there is a
snag: the rows I get from the database module (kinterbasdb in this case)
contain unicode objects and numbers. And of course the unicode objects
contain lots of non-ascii characters.

If I try to use csv.writer as is, I get UnicodeEncodeErrors. If I use
the UnicodeWriter from the module documentation, I get TypeErrors with
the numbers. (I'm using Python 2.6 - upgrading to 3.1 on this machine
would cause other complications.)

So do I have to process the rows myself and treat numbers and text
fields differently? Or what's the best way?

I'd preprocess the rows as I tend to prefer the simplest approach I can come
up with. Example:

def recode_rows(rows, source_encoding, target_encoding):
def recode(field):
if isinstance(field, unicode):
return field.encode(target_encoding)
elif isinstance(field, str):
return unicode(field, source_encoding).encode(target_encoding)
return unicode(field).encode(target_encoding)

return (map(recode, row) for row in rows)

rows = [[1.23], [u"äöü"], [u"ÄÖÜ".encode("latin1")], [1, 2, 3]]
writer = csv.writer(sys.stdout)
writer.writerows(recode_rows(rows, "latin1", "utf-8"))

The only limitation I can see: target_encoding probably has to be a superset
of ASCII.

Peter
 
S

Sibylle Koczian

Peter said:
I'd preprocess the rows as I tend to prefer the simplest approach I can come
up with. Example:

def recode_rows(rows, source_encoding, target_encoding):
def recode(field):
if isinstance(field, unicode):
return field.encode(target_encoding)
elif isinstance(field, str):
return unicode(field, source_encoding).encode(target_encoding)
return unicode(field).encode(target_encoding)

return (map(recode, row) for row in rows)

For this case isinstance really seems to be quite reasonable. And it was
silly of me not to think of sys.stdout as file object for the example!
rows = [[1.23], [u"äöü"], [u"ÄÖÜ".encode("latin1")], [1, 2, 3]]
writer = csv.writer(sys.stdout)
writer.writerows(recode_rows(rows, "latin1", "utf-8"))

The only limitation I can see: target_encoding probably has to be a superset
of ASCII.

Coping with umlauts and accents is quite enough for me.

This problem really goes away with Python 3 (tried it on another
machine), but something else changes too: in Python 2.6 the
documentation for the csv module explicitly says "If csvfile is a file
object, it must be opened with the ‘b’ flag on platforms where that
makes a difference." The documentation for Python 3.1 doesn't have this
sentence, and if I do that in Python 3.1 I get for all sorts of data,
even for a list with only one integer literal:

TypeError: must be bytes or buffer, not str

I don't really understand that.

Regards,
Sibylle
 
P

Peter Otten

Sibylle said:
This problem really goes away with Python 3 (tried it on another
machine), but something else changes too: in Python 2.6 the
documentation for the csv module explicitly says "If csvfile is a file
object, it must be opened with the ‘b’ flag on platforms where that
makes a difference." The documentation for Python 3.1 doesn't have this
sentence, and if I do that in Python 3.1 I get for all sorts of data,
even for a list with only one integer literal:

TypeError: must be bytes or buffer, not str

Read the documentation for open() at

http://docs.python.org/3.1/library/functions.html#open

There are significant changes with respect to 2.x; you won't even get a file
object anymore:
Traceback (most recent call last):
5

If you specify the "b" flag in 3.x the write() method expects bytes, not
str. The translation of newlines is now controlled by the "newline"
argument.

Peter
 
T

Terry Reedy

Sibylle said:
Peter said:
I'd preprocess the rows as I tend to prefer the simplest approach I can come
up with. Example:

def recode_rows(rows, source_encoding, target_encoding):
def recode(field):
if isinstance(field, unicode):
return field.encode(target_encoding)
elif isinstance(field, str):
return unicode(field, source_encoding).encode(target_encoding)
return unicode(field).encode(target_encoding)

return (map(recode, row) for row in rows)

For this case isinstance really seems to be quite reasonable. And it was
silly of me not to think of sys.stdout as file object for the example!
rows = [[1.23], [u"äöü"], [u"ÄÖÜ".encode("latin1")], [1, 2, 3]]
writer = csv.writer(sys.stdout)
writer.writerows(recode_rows(rows, "latin1", "utf-8"))

The only limitation I can see: target_encoding probably has to be a superset
of ASCII.

Coping with umlauts and accents is quite enough for me.

This problem really goes away with Python 3 (tried it on another
machine), but something else changes too: in Python 2.6 the
documentation for the csv module explicitly says "If csvfile is a file
object, it must be opened with the ‘b’ flag on platforms where that
makes a difference." The documentation for Python 3.1 doesn't have this
sentence, and if I do that in Python 3.1 I get for all sorts of data,
even for a list with only one integer literal:

TypeError: must be bytes or buffer, not str

I don't really understand that.

In Python 3, a file opened in 'b' mode is for reading and writing bytes
with no encoding/decoding. I believe cvs works with files in text mode
as it returns and expects strings/text for reading and writing. Perhaps
the cvs doc should say must not be opened in 'b' mode. Not sure.

tjr
 
S

Sibylle Koczian

Terry said:
In Python 3, a file opened in 'b' mode is for reading and writing bytes
with no encoding/decoding. I believe cvs works with files in text mode
as it returns and expects strings/text for reading and writing. Perhaps
the cvs doc should say must not be opened in 'b' mode. Not sure.

I think that might really be better, because for version 2.6 they
explicitly stated 'b' mode was necessary. The results I couldn't
understand, even after reading the documentation for open():
>>> import csv
>>> acsv = open(r"d:\home\sibylle\temp\tmp.csv", "wb")
>>> row = [b"abc", b"def", b"ghi"]
>>> wtr = csv.writer(acsv)
>>> wtr.writerow(row)
Traceback (most recent call last):
File "<pyshell#22>", line 1, in <module>
wtr.writerow(row)
TypeError: must be bytes or buffer, not str

Same error message with row = [5].

But I think I understand it now: the cvs.writer takes mixed lists of
text and numbers - that's exactly why I like to use it - so it has to
convert them before writing. And it converts into text - even bytes for
a file opened in 'b' mode. Right?

Thank you, everybody, for explaining.

Sibylle
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,743
Messages
2,569,478
Members
44,899
Latest member
RodneyMcAu

Latest Threads

Top