csv.reader has trouble with comma inside quotes inside brackets

Discussion in 'Python' started by Bret, Jun 9, 2009.

  1. Bret

    Bret Guest

    i have a csv file like so:
    row1,field1,[field2][text in field2 "quote, quote"],field3,field
    row2,field1,[field2]text in field2 "quote, quote",field3,field

    using csv.reader to read the file, the first row is broken into two
    fields:
    [field2][text in field2 "quote
    and
    quote"

    while the second row is read correctly with:
    [field2]text in field2 "quote, quote"
    being one field.

    any ideas how to make csv.reader work correctly for the first case?
    the problem is the comma inside the quote inside the brackets, ie:
    [","]
     
    Bret, Jun 9, 2009
    #1
    1. Advertising

  2. Bret

    John Machin Guest

    Bret <bretrouse <at> gmail.com> writes:

    >
    > i have a csv file like so:
    > row1,field1,[field2][text in field2 "quote, quote"],field3,field
    > row2,field1,[field2]text in field2 "quote, quote",field3,field
    >
    > using csv.reader to read the file, the first row is broken into two
    > fields:
    > [field2][text in field2 "quote
    > and
    > quote"
    >
    > while the second row is read correctly with:
    > [field2]text in field2 "quote, quote"
    > being one field.
    >
    > any ideas how to make csv.reader work correctly for the first case?
    > the problem is the comma inside the quote inside the brackets, ie:
    > [","]


    I can't reproduce the behaviour that you describe based on these reasonable
    assumptions: Python 2.6, delimiter=',', quotechar='"':

    C:\junk\weird_csv>\python26\python
    Python 2.6.2 (r262:71605, Apr 14 2009, 22:40:02) [MSC v.1500 32 bit (Intel)] on
    win32
    Type "help", "copyright", "credits" or "license" for more information.
    |>>> import pprint, csv
    |>>> open('weird.csv', 'rb').read()
    'row1,field1,[field2][text in field2 "quote, quote"],field3,field\r\nrow2,field1
    ,[field2]text in field2 "quote, quote",field3,field\r\n'
    >>> pprint.pprint(list(csv.reader(open('weird.csv', 'rb'))))

    [['row1',
    'field1',
    '[field2][text in field2 "quote',
    ' quote"]',
    'field3',
    'field'],
    ['row2',
    'field1',
    '[field2]text in field2 "quote',
    ' quote"',
    'field3',
    'field']]
    |>>>

    As you can see, it treats both lines in the same way. Opening the file with each
    of Microsoft Excel 2003, OpenOffice.org Calc 3, and Gnumeric has exactly the
    same effect.

    The problem is that your data has not been prepared using the generally accepted
    rules for quoting in CSV files:
    [pseudocode]
    if field contains any of quotechar, delimiter, newline, maybe others:
    field = (quotechar
    + field.replace(quotechar, quotechar + quotechar)
    + quotechar)
    which would change your first line from
    row1,field1,[field2][text in field2 "quote, quote"],field3,field
    to
    row1,field1,"[field2][text in field2 ""quote, quote""]",field3,field

    There's no option in the csv module to get around this, AFAICT. You'd have to
    roll your own, something along these lines (assumes no embedded newlines etc):

    8<--- parse_unquoted_csv.py
    def parse_unquoted_csv(line, delimiter=',', quotechar='"'):
    line = line.strip()
    inside_quotes = False
    fields = []
    field = ''
    for c in line:
    if inside_quotes:
    if c == quotechar:
    inside_quotes = False
    field += c
    else:
    if c == delimiter:
    fields.append(field)
    field = ''
    else:
    if c == quotechar:
    inside_quotes = True
    field += c
    if inside_quotes:
    print repr(line)
    print fields
    print repr(field)
    raise Exception("Quotes not balanced")
    fields.append(field)
    return fields

    if __name__ == "__main__":
    tests = [
    'row1,field1,[field2][text in field2 "quote, quote"],field3,field',
    'row2,field1,[field2]text in field2 "quote, quote",field3,field',
    'a,b,c',
    'a,b,',
    '',
    'Look,here"s,a,typo',
    ]
    for test in tests:
    print repr(test)
    print parse_unquoted_csv(test)
    print
    8<---

    HTH,
    John
     
    John Machin, Jun 9, 2009
    #2
    1. Advertising

  3. Bret

    Bret Guest

    Thanks John,

    I didn't realize that the quotes were supposed to surround the entire
    field. I ended up making a quick script to replace comma's outside
    quotes with tabs. I was just trying to clean this crazy "csv" file to
    import into msyql.

    thanks again,

    bret
     
    Bret, Jun 9, 2009
    #3
  4. Bret

    Terry Reedy Guest

    Bret wrote:
    > i have a csv file like so:
    > row1,field1,[field2][text in field2 "quote, quote"],field3,field
    > row2,field1,[field2]text in field2 "quote, quote",field3,field
    >
    > using csv.reader to read the file, the first row is broken into two
    > fields:
    > [field2][text in field2 "quote
    > and
    > quote"
    >
    > while the second row is read correctly with:
    > [field2]text in field2 "quote, quote"
    > being one field.
    >
    > any ideas how to make csv.reader work correctly for the first case?
    > the problem is the comma inside the quote inside the brackets, ie:
    > [","]


    When posting, give version, minimum code that has problem, and actual
    output. Cut and past latter two. Reports are less credible otherwise.

    Using 3.1rc1

    txt = [
    '''row1,field1,[field2][text in field2 "quote, quote"],field3,field''',
    '''row2,field1,[field2] text in field2 "quote, quote", field3,field''',
    '''row2,field1, field2 text in field2 "quote, quote", field3,field''',
    ]
    import csv
    for row in csv.reader(txt): print(len(row),row)

    produces

    6 ['row1', 'field1', '[field2][text in field2 "quote', ' quote"]',
    field3', 'field']
    6 ['row2', 'field1', '[field2] text in field2 "quote', ' quote"', '
    field3', 'field']
    6 ['row2', 'field1', ' field2 text in field2 "quote', ' quote"', '
    field3', 'field']

    In 3.1 at least, the presence or absence of brackets is irrelevant, as I
    expected it to be. For double quotes to protect the comma delimiter,
    the *entire field* must be quoted, not just part of it.

    If you want to escape the delimiter without quoting entire fields, use
    an escape char and change the dialect. For example

    txt = [
    '''row1,field1,[field2][text in field2 "quote`, quote"],field3,field''',
    '''row2,field1,[field2] text in field2 "quote`, quote", field3,field''',
    '''row2,field1, field2 text in field2 "quote`, quote", field3,field''',
    ]
    import csv
    for row in csv.reader(txt, quoting=csv.QUOTE_NONE, escapechar = '`'):
    print(len(row),row)

    produces what you desire

    5 ['row1', 'field1', '[field2][text in field2 "quote, quote"]',
    'field3', 'field']
    5 ['row2', 'field1', '[field2] text in field2 "quote, quote"', '
    field3', 'field']
    5 ['row2', 'field1', ' field2 text in field2 "quote, quote"', '
    field3', 'field']


    Terry Jan Reedy
     
    Terry Reedy, Jun 9, 2009
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. bartek
    Replies:
    1
    Views:
    7,948
    Unforgiven
    May 26, 2004
  2. Guilherme Grillo

    reader inside a reader

    Guilherme Grillo, Nov 7, 2007, in forum: ASP .Net
    Replies:
    5
    Views:
    528
    sloan
    Nov 7, 2007
  3. sso
    Replies:
    20
    Views:
    2,673
    Martin Gregorie
    Apr 26, 2009
  4. Tim
    Replies:
    1
    Views:
    321
    Peter Otten
    Jul 5, 2010
  5. Peter Stacy
    Replies:
    1
    Views:
    106
    John W. Krahn
    Nov 8, 2009
Loading...

Share This Page