CSV module: incorrectly parsed file.

Discussion in 'Python' started by Christopher Barrington-Leigh, Feb 18, 2008.

  1. Here is a file "test.csv"
    number,name,description,value
    1,"wer","tape 2"",5
    1,vvv,"hoohaa",2

    I want to convert it to tab-separated without those silly quotes. Note
    in the second line that a field is 'tape 2"' , ie two inches: there is
    a double quote in the string.

    When I use csv module to read this:


    import sys
    outf=open(sys.argv[1]+'.tsv','wt')
    import csv
    reader=csv.reader(open(sys.argv[1], "rb"))
    for row in reader:
    outf.write('\t'.join([rr.strip() for rr in row]) +'\n')


    it mangles it, messing up the double double-quote.
    Can anyone help me? How do I use CSV to get it right?
    Tjhanks!
    c
    Christopher Barrington-Leigh, Feb 18, 2008
    #1
    1. Advertising

  2. >Here is a file "test.csv"
    >number,name,description,value
    >1,"wer","tape 2"",5
    >1,vvv,"hoohaa",2
    >
    >I want to convert it to tab-separated without those silly quotes. Note
    >in the second line that a field is 'tape 2"' , ie two inches: there is
    >a double quote in the string.


    The input format is ambiguous - how is the parser to distinguish between
    a double-quote in the field, and the double-quote that delimits the
    field? Excel would have written that field as "tape 2""" (it doubles
    double-quotes that appear within a field).

    You can turn off the double-double-quote handling by passing
    "doublequote=False" to the parser, but the results still might not be
    what you want (because the format is ambiguous).


    --
    Andrew McNamara, Senior Developer, Object Craft
    http://www.object-craft.com.au/
    Andrew McNamara, Feb 18, 2008
    #2
    1. Advertising

  3. Christopher Barrington-Leigh

    Paul McGuire Guest

    On Feb 17, 8:09 pm, Christopher Barrington-Leigh
    <> wrote:
    > Here is a file "test.csv"
    > number,name,description,value
    > 1,"wer","tape 2"",5
    > 1,vvv,"hoohaa",2
    >
    > I want to convert it to tab-separated without those silly quotes. Note
    > in the second line that a field is 'tape 2"' , ie two inches: there is
    > a double quote in the string.
    >


    What is needed to disambiguate this data is to only accept closing
    quotes if they are followed by a comma or the end of the line. In
    pyparsing, you can define your own quoted string format. Here is one
    solution using pyparsing. At the end, you can extract the data by
    field name, and print it out however you choose:

    data = """\
    number,name,description,value
    1,"wer","tape 2"",5
    1,vvv,"hoohaa",2"""


    from pyparsing import *

    # very special definition of a quoted string, that ends with a " only
    if
    # followed by a , or the end of line
    quotedString = ('"' +
    ZeroOrMore(CharsNotIn('"')|('"' + ~FollowedBy(','|lineEnd))) +
    '"')
    quotedString.setParseAction(keepOriginalText, removeQuotes)
    integer = Word(nums).setParseAction(lambda toks:int(toks[0]))
    value = integer | quotedString | Word(printables.replace(",",""))

    # first pass, just parse the comma-separated values
    for line in data.splitlines():
    print delimitedList(value).parseString(line)
    print

    # now second pass, assign field names using names from first line
    names = data.splitlines()[0].split(',')
    def setValueNames(tokens):
    for k,v in zip(names,tokens):
    tokens[k] = v
    lineDef = delimitedList(value).setParseAction(setValueNames)

    # parse each line, and extract data by field name
    for line in data.splitlines()[1:]:
    results = lineDef.parseString(line)
    print "Desc:", results.description
    print results.dump()


    Prints:
    ['number', 'name', 'description', 'value']
    [1, 'wer', 'tape 2"', 5]
    [1, 'vvv', 'hoohaa', 2]

    Desc: tape 2"
    [1, 'wer', 'tape 2"', 5]
    - description: tape 2"
    - name: wer
    - number: 1
    - value : 5
    Desc: hoohaa
    [1, 'vvv', 'hoohaa', 2]
    - description: hoohaa
    - name: vvv
    - number: 1
    - value : 2

    -- Paul
    Paul McGuire, Feb 18, 2008
    #3
  4. Christopher Barrington-Leigh

    7stud Guest

    On Feb 17, 7:09 pm, Christopher Barrington-Leigh
    <> wrote:
    > Here is a file "test.csv"
    > number,name,description,value
    > 1,"wer","tape 2"",5
    > 1,vvv,"hoohaa",2
    >
    > I want to convert it to tab-separated without those silly quotes. Note
    > in the second line that a field is 'tape 2"' , ie two inches: there is
    > a double quote in the string.
    >
    > When I use csv module to read this:
    >
    > import sys
    > outf=open(sys.argv[1]+'.tsv','wt')
    > import csv
    > reader=csv.reader(open(sys.argv[1], "rb"))
    > for row in reader:
    >     outf.write('\t'.join([rr.strip() for rr in row]) +'\n')
    >
    > it mangles it, messing up the double double-quote.
    > Can anyone help me? How do I use CSV to get it right?
    > Tjhanks!
    > c



    Try this:

    infile = open('data.txt')
    outfile = open('outfile.txt', 'w')

    for line in infile:
    pieces = line.strip().split(',')

    data = []
    for piece in pieces:
    if piece[0] == '"':
    data.append(piece[1:-2])
    else:
    data.append(piece)

    out_line = '%s\n' % '\t'.join(data)
    outfile.write(out_line)
    7stud, Feb 18, 2008
    #4
  5. Christopher Barrington-Leigh

    7stud Guest

    On Feb 17, 9:11 pm, 7stud <> wrote:
    > On Feb 17, 7:09 pm, Christopher Barrington-Leigh
    >
    >
    >
    > <> wrote:
    > > Here is a file "test.csv"
    > > number,name,description,value
    > > 1,"wer","tape 2"",5
    > > 1,vvv,"hoohaa",2

    >
    > > I want to convert it to tab-separated without those silly quotes. Note
    > > in the second line that a field is 'tape 2"' , ie two inches: there is
    > > a double quote in the string.

    >
    > > When I use csv module to read this:

    >
    > > import sys
    > > outf=open(sys.argv[1]+'.tsv','wt')
    > > import csv
    > > reader=csv.reader(open(sys.argv[1], "rb"))
    > > for row in reader:
    > >     outf.write('\t'.join([rr.strip() for rr in row]) +'\n')

    >
    > > it mangles it, messing up the double double-quote.
    > > Can anyone help me? How do I use CSV to get it right?
    > > Tjhanks!
    > > c

    >
    > Try this:
    >
    > infile = open('data.txt')
    > outfile = open('outfile.txt', 'w')
    >
    > for line in infile:
    >     pieces = line.strip().split(',')
    >
    >     data = []
    >     for piece in pieces:
    >         if piece[0] == '"':
    >             data.append(piece[1:-2])
    >         else:
    >             data.append(piece)
    >
    >     out_line = '%s\n' % '\t'.join(data)
    >     outfile.write(out_line)


    Whoops. The line:

    data.append(piece[1:-2])

    should be:

    data.append(piece[1:-1])
    7stud, Feb 18, 2008
    #5
  6. Christopher Barrington-Leigh

    Steve Holden Guest

    7stud wrote:
    > On Feb 17, 9:11 pm, 7stud <> wrote:
    >> On Feb 17, 7:09 pm, Christopher Barrington-Leigh
    >>
    >>
    >>
    >> <> wrote:
    >>> Here is a file "test.csv"
    >>> number,name,description,value
    >>> 1,"wer","tape 2"",5
    >>> 1,vvv,"hoohaa",2
    >>> I want to convert it to tab-separated without those silly quotes. Note
    >>> in the second line that a field is 'tape 2"' , ie two inches: there is
    >>> a double quote in the string.
    >>> When I use csv module to read this:
    >>> import sys
    >>> outf=open(sys.argv[1]+'.tsv','wt')
    >>> import csv
    >>> reader=csv.reader(open(sys.argv[1], "rb"))
    >>> for row in reader:
    >>> outf.write('\t'.join([rr.strip() for rr in row]) +'\n')
    >>> it mangles it, messing up the double double-quote.
    >>> Can anyone help me? How do I use CSV to get it right?
    >>> Tjhanks!
    >>> c

    >> Try this:
    >>
    >> infile = open('data.txt')
    >> outfile = open('outfile.txt', 'w')
    >>
    >> for line in infile:
    >> pieces = line.strip().split(',')
    >>
    >> data = []
    >> for piece in pieces:
    >> if piece[0] == '"':
    >> data.append(piece[1:-2])
    >> else:
    >> data.append(piece)
    >>
    >> out_line = '%s\n' % '\t'.join(data)
    >> outfile.write(out_line)

    >
    > Whoops. The line:
    >
    > data.append(piece[1:-2])
    >
    > should be:
    >
    > data.append(piece[1:-1])
    >

    Even when you have done all this you will still have problems. As Andrew
    pointed out the form is ambiguous, and you'd just better hope none of
    your data items look like

    Nails 2", soldiers for the use of

    because then you will be completely screwed. So there's a need for a
    certain amount of visual scrutiny of the data: I would definitely write
    a validation program first that tries to read the data and catches any
    exceptions like unmatched quotes or the wrong number of items in a line.
    If there aren't too many (and there usually aren't) just edit them out
    of your input data by hand.

    If this is to be a regular task then you'll have to program to recognize
    and correct the common error cases.

    regards
    Steve

    --
    Steve Holden +1 571 484 6266 +1 800 494 3119
    Holden Web LLC http://www.holdenweb.com/
    Steve Holden, Feb 18, 2008
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Katrin Tomanek

    Java and huge XML file to be parsed

    Katrin Tomanek, Jun 17, 2004, in forum: Java
    Replies:
    47
    Views:
    1,700
    Dale King
    Apr 15, 2006
  2. Ben
    Replies:
    4
    Views:
    258
  3. Tintin92
    Replies:
    1
    Views:
    1,692
    Andrew Thompson
    Feb 14, 2007
  4. jliu66
    Replies:
    0
    Views:
    504
    jliu66
    Oct 19, 2007
  5. Li Chen
    Replies:
    18
    Views:
    653
    Azmi Farih
    Mar 23, 2010
Loading...

Share This Page