csv reader

Discussion in 'Python' started by Emmanuel, Dec 15, 2009.

  1. Emmanuel

    Emmanuel Guest

    I have a problem with csv.reader from the library csv. I'm not able to
    import accentuated caracters. For example, I'm trying to import a
    simple file containing a single word "equação" using the following
    code:

    import csv
    arquivoCSV='test'
    a=csv.reader(open(arquivoCSV),delimiter=',')
    tab=[]
    for row in a:
    tab.append(row)
    print tab

    As a result, I get:

    [['equa\xe7\xe3o']]

    How can I solve this problem?
     
    Emmanuel, Dec 15, 2009
    #1
    1. Advertising

  2. Emmanuel

    Chris Rebert Guest

    On Tue, Dec 15, 2009 at 1:24 PM, Emmanuel <> wrote:
    > I have a problem with csv.reader from the library csv. I'm not able to
    > import accentuated caracters. For example, I'm trying to import a
    > simple file containing a single word "equação" using the following
    > code:
    >
    > import csv
    > arquivoCSV='test'
    > a=csv.reader(open(arquivoCSV),delimiter=',')
    > tab=[]
    > for row in a:
    >    tab.append(row)
    > print tab
    >
    > As a result, I get:
    >
    > [['equa\xe7\xe3o']]
    >
    > How can I solve this problem?


    >From http://docs.python.org/library/csv.html :

    """
    Note:
    This version of the csv module doesn’t support Unicode input. Also,
    there are currently some issues regarding ASCII NUL characters.
    Accordingly, all input should be UTF-8 or printable ASCII to be safe;
    see the examples in section Examples. These restrictions will be
    removed in the future.
    """

    Thus, you'll have to decode the results into Unicode manually; this
    will require knowing what encoding your file is using. Files in some
    encodings may not parse correctly due to the aforementioned NUL
    problem.

    Cheers,
    Chris
    --
    http://blog.rebertia.com
     
    Chris Rebert, Dec 15, 2009
    #2
    1. Advertising

  3. Emmanuel

    Jerry Hill Guest

    On Tue, Dec 15, 2009 at 4:24 PM, Emmanuel <> wrote:
    > I have a problem with csv.reader from the library csv. I'm not able to
    > import accentuated caracters. For example, I'm trying to import a
    > simple file containing a single word "equação" using the following
    > code:
    >
    > import csv
    > arquivoCSV='test'
    > a=csv.reader(open(arquivoCSV),delimiter=',')
    > tab=[]
    > for row in a:
    >    tab.append(row)
    > print tab
    >
    > As a result, I get:
    >
    > [['equa\xe7\xe3o']]
    >
    > How can I solve this problem?


    I don't think it is a problem. \xe7 is the character ç encoded in
    Windows-1252, which is probably the encoding of your csv file. If you
    want to convert that to a unicode string, do something like the
    following.

    s = 'equa\xe7\xe3o'
    uni_s = s.decode('Windows-1252')
    print uni_s

    --
    Jerry
     
    Jerry Hill, Dec 15, 2009
    #3
  4. Emmanuel

    Emmanuel Guest

    Then my problem is diferent!

    In fact I'm reading a csv file saved from openoffice oocalc using
    UTF-8 encoding. I get a list of list (let's cal it tab) with the csv
    data.
    If I do:

    print tab[2][4]
    In ipython, I get:
    equação de Toricelli. Tarefa exercícios PVR 1 e 2 ; PVP 1

    If I only do:
    tab[2][4]

    In ipython, I get:
    'equa\xc3\xa7\xc3\xa3o de Toricelli. Tarefa exerc\xc3\xadcios PVR 1 e
    2 ; PVP 1'

    Does that mean that my problem is not the one I'm thinking?

    My real problem is when I use that that kind of UTF-8 encoded (?) with
    selenium here.
    Here is an small code example of a not-working case giving the same
    error that on my bigger program:


    #!/usr/bin/env python
    # -*- coding: utf-8 -*-

    from selenium import selenium
    import sys,os,csv,re


    class test:
    '''classe para interagir com o sistema acadêmico'''
    def __init__(self):
    self.webpage=''
    self.arquivo=''
    self.script=[]
    self.sel = selenium('localhost', 4444, '*firefox', 'http://
    www.google.com.br')
    self.sel.start()
    self.sel.open('/')
    self.sel.wait_for_page_to_load(30000)
    self.sel.type("q", "equação")
    #self.sel.type("q", u"equacao")
    self.sel.click("btnG")
    self.sel.wait_for_page_to_load("30000")


    def main():
    teste=test()


    if __name__ == "__main__":
    main()



    If I just switch the folowing line:
    self.sel.type("q", "equação")

    by:
    self.sel.type("q", u"equação")


    It works fine!
    The problem is that the csv.reader does give a "equação" and not a
    u"equação"


    Here is the error given with bad code (with "equação"):
    ERROR: An unexpected error occurred while tokenizing input
    The following traceback may be corrupted or invalid
    The error message is: ('EOF in multi-line statement', (1202, 0))

    ---------------------------------------------------------------------------
    UnicodeDecodeError Traceback (most recent call
    last)

    /home/manu/Labo/Cefetes_Colatina/Scripts/
    20091215_test_acentuated_caracters.py in <module>()
    27
    28 if __name__ == "__main__":
    ---> 29 main()
    30
    31

    /home/manu/Labo/Cefetes_Colatina/Scripts/
    20091215_test_acentuated_caracters.py in main()
    23
    24 def main():
    ---> 25 teste=test()
    26
    27

    /home/manu/Labo/Cefetes_Colatina/Scripts/
    20091215_test_acentuated_caracters.py in __init__(self)
    16 self.sel.open('/')
    17 self.sel.wait_for_page_to_load(30000)
    ---> 18 self.sel.type("q", "equação")
    19 #self.sel.type("q", u"equacao")
    20 self.sel.click("btnG")

    /home/manu/Labo/Cefetes_Colatina/Scripts/selenium.pyc in type(self,
    locator, value)
    588 'value' is the value to type
    589 """
    --> 590 self.do_command("type", [locator,value,])
    591
    592

    /home/manu/Labo/Cefetes_Colatina/Scripts/selenium.pyc in do_command
    (self, verb, args)
    201 body = u'cmd=' + urllib.quote_plus(unicode(verb).encode
    ('utf-8'))
    202 for i in range(len(args)):
    --> 203 body += '&' + unicode(i+1) + '=' +
    urllib.quote_plus(unicode(args).encode('utf-8'))
    204 if (None != self.sessionId):
    205 body += "&sessionId=" + unicode(self.sessionId)

    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
    4: ordinal not in range(128)
    WARNING: Failure executing file:
    <20091215_test_acentuated_caracters.py>
    Python 2.6.4 (r264:75706, Oct 27 2009, 06:16:59)
     
    Emmanuel, Dec 15, 2009
    #4
  5. Emmanuel

    Emmanuel Guest

    As csv.reader does not suport utf-8 encoded files, I'm using:

    fp = codecs.open(arquivoCSV, "r", "utf-8")
    self.tab=[]
    for l in fp:
    l=l.replace('\"','').strip()
    self.tab.append(l.split(','))

    It works much better except that when I do self.sel.type("q", ustring)
    where ustring is a unicode string obtained from the file using the
    code showed above.

    Remaining problem is that I obtain <sp> insted of a regular space...
     
    Emmanuel, Dec 16, 2009
    #5
  6. En Tue, 15 Dec 2009 19:12:01 -0300, Emmanuel <> escribió:

    > Then my problem is diferent!
    >
    > In fact I'm reading a csv file saved from openoffice oocalc using
    > UTF-8 encoding. I get a list of list (let's cal it tab) with the csv
    > data.
    > If I do:
    >
    > print tab[2][4]
    > In ipython, I get:
    > equação de Toricelli. Tarefa exercícios PVR 1 e 2 ; PVP 1
    >
    > If I only do:
    > tab[2][4]
    >
    > In ipython, I get:
    > 'equa\xc3\xa7\xc3\xa3o de Toricelli. Tarefa exerc\xc3\xadcios PVR 1 e
    > 2 ; PVP 1'
    >
    > Does that mean that my problem is not the one I'm thinking?


    Yes. You have a real problem, but not this one. When you say `print
    something`, you get a nice view of `something`, basically the result of
    doing `str(something)`. When you say `something` alone in the interpreter,
    you get a more formal representation, the result of calling
    `repr(something)`:

    py> x = "ecuação"
    py> print x
    ecuação
    py> x
    'ecua\x87\xc6o'
    py> print repr(x)
    'ecua\x87\xc6o'

    Those '' around the text and the \xNN notation allow for an unambiguous
    representation. Two strings may "look like" the same but be different, and
    repr shows that.
    ('ecua\x87\xc6o' is encoded in windows-1252; you should see
    'equa\xc3\xa7\xc3\xa3o' in utf-8)

    > My real problem is when I use that that kind of UTF-8 encoded (?) with
    > selenium here.
    > If I just switch the folowing line:
    > self.sel.type("q", "equação")
    >
    > by:
    > self.sel.type("q", u"equação")
    >
    >
    > It works fine!


    Yes: you should work with unicode most of the time. The "recipe" for
    having as little unicode problems as possible says:

    - convert the input data (read from external sources, like a file) from
    bytes to unicode, using the (known) encoding of those bytes

    - handle unicode internally everywhere in your program

    - and convert from unicode to bytes as late as possible, when writing
    output (to screen, other files, etc) using the encoding expected by those
    external files.

    See the Unicode How To: http://docs.python.org/howto/unicode.html

    > The problem is that the csv.reader does give a "equação" and not a
    > u"equação"


    The csv module cannot handle unicode text directly, but see the last
    example in the csv documentation for a simple workaround:
    http://docs.python.org/library/csv.html

    --
    Gabriel Genellina
     
    Gabriel Genellina, Dec 16, 2009
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Maxim Khesin

    closing file opened by csv reader

    Maxim Khesin, Dec 9, 2003, in forum: Python
    Replies:
    1
    Views:
    318
    Paul Phillabaum
    Dec 9, 2003
  2. Stephan

    Python's CSV reader

    Stephan, Aug 4, 2005, in forum: Python
    Replies:
    8
    Views:
    29,947
    Peter Otten
    Aug 8, 2005
  3. Guilherme Grillo

    reader inside a reader

    Guilherme Grillo, Nov 7, 2007, in forum: ASP .Net
    Replies:
    5
    Views:
    534
    sloan
    Nov 7, 2007
  4. Tim
    Replies:
    1
    Views:
    331
    Peter Otten
    Jul 5, 2010
  5. Pokkai Dokkai
    Replies:
    1
    Views:
    241
    Hassan Schroeder
    Mar 24, 2008
Loading...

Share This Page