csv reader

E

Emmanuel

I have a problem with csv.reader from the library csv. I'm not able to
import accentuated caracters. For example, I'm trying to import a
simple file containing a single word "equação" using the following
code:

import csv
arquivoCSV='test'
a=csv.reader(open(arquivoCSV),delimiter=',')
tab=[]
for row in a:
tab.append(row)
print tab

As a result, I get:

[['equa\xe7\xe3o']]

How can I solve this problem?
 
C

Chris Rebert

I have a problem with csv.reader from the library csv. I'm not able to
import accentuated caracters. For example, I'm trying to import a
simple file containing a single word "equação" using the following
code:

import csv
arquivoCSV='test'
a=csv.reader(open(arquivoCSV),delimiter=',')
tab=[]
for row in a:
   tab.append(row)
print tab

As a result, I get:

[['equa\xe7\xe3o']]

How can I solve this problem?
"""
Note:
This version of the csv module doesn’t support Unicode input. Also,
there are currently some issues regarding ASCII NUL characters.
Accordingly, all input should be UTF-8 or printable ASCII to be safe;
see the examples in section Examples. These restrictions will be
removed in the future.
"""

Thus, you'll have to decode the results into Unicode manually; this
will require knowing what encoding your file is using. Files in some
encodings may not parse correctly due to the aforementioned NUL
problem.

Cheers,
Chris
 
J

Jerry Hill

I have a problem with csv.reader from the library csv. I'm not able to
import accentuated caracters. For example, I'm trying to import a
simple file containing a single word "equação" using the following
code:

import csv
arquivoCSV='test'
a=csv.reader(open(arquivoCSV),delimiter=',')
tab=[]
for row in a:
   tab.append(row)
print tab

As a result, I get:

[['equa\xe7\xe3o']]

How can I solve this problem?

I don't think it is a problem. \xe7 is the character ç encoded in
Windows-1252, which is probably the encoding of your csv file. If you
want to convert that to a unicode string, do something like the
following.

s = 'equa\xe7\xe3o'
uni_s = s.decode('Windows-1252')
print uni_s
 
E

Emmanuel

Then my problem is diferent!

In fact I'm reading a csv file saved from openoffice oocalc using
UTF-8 encoding. I get a list of list (let's cal it tab) with the csv
data.
If I do:

print tab[2][4]
In ipython, I get:
equação de Toricelli. Tarefa exercícios PVR 1 e 2 ; PVP 1

If I only do:
tab[2][4]

In ipython, I get:
'equa\xc3\xa7\xc3\xa3o de Toricelli. Tarefa exerc\xc3\xadcios PVR 1 e
2 ; PVP 1'

Does that mean that my problem is not the one I'm thinking?

My real problem is when I use that that kind of UTF-8 encoded (?) with
selenium here.
Here is an small code example of a not-working case giving the same
error that on my bigger program:


#!/usr/bin/env python
# -*- coding: utf-8 -*-

from selenium import selenium
import sys,os,csv,re


class test:
'''classe para interagir com o sistema acadêmico'''
def __init__(self):
self.webpage=''
self.arquivo=''
self.script=[]
self.sel = selenium('localhost', 4444, '*firefox', 'http://
www.google.com.br')
self.sel.start()
self.sel.open('/')
self.sel.wait_for_page_to_load(30000)
self.sel.type("q", "equação")
#self.sel.type("q", u"equacao")
self.sel.click("btnG")
self.sel.wait_for_page_to_load("30000")


def main():
teste=test()


if __name__ == "__main__":
main()



If I just switch the folowing line:
self.sel.type("q", "equação")

by:
self.sel.type("q", u"equação")


It works fine!
The problem is that the csv.reader does give a "equação" and not a
u"equação"


Here is the error given with bad code (with "equação"):
ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (1202, 0))

---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call
last)

/home/manu/Labo/Cefetes_Colatina/Scripts/
20091215_test_acentuated_caracters.py in <module>()
27
28 if __name__ == "__main__":
---> 29 main()
30
31

/home/manu/Labo/Cefetes_Colatina/Scripts/
20091215_test_acentuated_caracters.py in main()
23
24 def main():
---> 25 teste=test()
26
27

/home/manu/Labo/Cefetes_Colatina/Scripts/
20091215_test_acentuated_caracters.py in __init__(self)
16 self.sel.open('/')
17 self.sel.wait_for_page_to_load(30000)
---> 18 self.sel.type("q", "equação")
19 #self.sel.type("q", u"equacao")
20 self.sel.click("btnG")

/home/manu/Labo/Cefetes_Colatina/Scripts/selenium.pyc in type(self,
locator, value)
588 'value' is the value to type
589 """
--> 590 self.do_command("type", [locator,value,])
591
592

/home/manu/Labo/Cefetes_Colatina/Scripts/selenium.pyc in do_command
(self, verb, args)
201 body = u'cmd=' + urllib.quote_plus(unicode(verb).encode
('utf-8'))
202 for i in range(len(args)):
--> 203 body += '&' + unicode(i+1) + '=' +
urllib.quote_plus(unicode(args).encode('utf-8'))
204 if (None != self.sessionId):
205 body += "&sessionId=" + unicode(self.sessionId)

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
4: ordinal not in range(128)
WARNING: Failure executing file:
<20091215_test_acentuated_caracters.py>
Python 2.6.4 (r264:75706, Oct 27 2009, 06:16:59)
 
E

Emmanuel

As csv.reader does not suport utf-8 encoded files, I'm using:

fp = codecs.open(arquivoCSV, "r", "utf-8")
self.tab=[]
for l in fp:
l=l.replace('\"','').strip()
self.tab.append(l.split(','))

It works much better except that when I do self.sel.type("q", ustring)
where ustring is a unicode string obtained from the file using the
code showed above.

Remaining problem is that I obtain <sp> insted of a regular space...
 
G

Gabriel Genellina

Then my problem is diferent!

In fact I'm reading a csv file saved from openoffice oocalc using
UTF-8 encoding. I get a list of list (let's cal it tab) with the csv
data.
If I do:

print tab[2][4]
In ipython, I get:
equação de Toricelli. Tarefa exercícios PVR 1 e 2 ; PVP 1

If I only do:
tab[2][4]

In ipython, I get:
'equa\xc3\xa7\xc3\xa3o de Toricelli. Tarefa exerc\xc3\xadcios PVR 1 e
2 ; PVP 1'

Does that mean that my problem is not the one I'm thinking?

Yes. You have a real problem, but not this one. When you say `print
something`, you get a nice view of `something`, basically the result of
doing `str(something)`. When you say `something` alone in the interpreter,
you get a more formal representation, the result of calling
`repr(something)`:

py> x = "ecuação"
py> print x
ecuação
py> x
'ecua\x87\xc6o'
py> print repr(x)
'ecua\x87\xc6o'

Those '' around the text and the \xNN notation allow for an unambiguous
representation. Two strings may "look like" the same but be different, and
repr shows that.
('ecua\x87\xc6o' is encoded in windows-1252; you should see
'equa\xc3\xa7\xc3\xa3o' in utf-8)
My real problem is when I use that that kind of UTF-8 encoded (?) with
selenium here.
If I just switch the folowing line:
self.sel.type("q", "equação")

by:
self.sel.type("q", u"equação")


It works fine!

Yes: you should work with unicode most of the time. The "recipe" for
having as little unicode problems as possible says:

- convert the input data (read from external sources, like a file) from
bytes to unicode, using the (known) encoding of those bytes

- handle unicode internally everywhere in your program

- and convert from unicode to bytes as late as possible, when writing
output (to screen, other files, etc) using the encoding expected by those
external files.

See the Unicode How To: http://docs.python.org/howto/unicode.html
The problem is that the csv.reader does give a "equação" and not a
u"equação"

The csv module cannot handle unicode text directly, but see the last
example in the csv documentation for a simple workaround:
http://docs.python.org/library/csv.html
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top