Unexpected behaviour of csv module

A

Andrew McLean

I have a bunch of csv files that have the following characteristics:

- field delimiter is a comma
- all fields quoted with double quotes
- lines terminated by a *space* followed by a newline

What surprised me was that the csv reader included the trailing space in
the final field value returned, even though it is outside of the quotes.


I've produced a test program (see below) that demonstrates this. There
is a workaround, which is to not pass the csv reader the file iterator,
but rather a generator that returns lines from the file with the
trailing space stripped.

Interestingly, the same behaviour is seen if there are spaces before the
field separator. They are also included in the preceding field value,
even if they are outside the quotations. My workaround wouldn't help here.

Anyway is this a bug or a feature? If it is a feature then I'm curious
as to why it is considered desirable behaviour.

- Andrew



import csv
filename = "test_data.csv"

# Generate a test file - note the spaces before the newlines
fout = open(filename, "wb")
fout.write('"Field1","Field2","Field3" \n')
fout.write('"a","b","c" \n')
fout.write('"d" ,"e","f" \n')
fout.close()

# Function to test a reader
def read_and_print(reader):
for line in reader:
print ",".join(['"%s"' % field for field in line])

# Read the test file - and print the output
reader = csv.reader(open("test_data.csv", "rb"))
read_and_print(reader)

# Now the workaround: a generator to strip the strings before the reader
decodes them
def stripped(input):
for line in input:
yield line.strip()
reader = csv.reader(stripped(open("test_data.csv", "rb")))
read_and_print(reader)

# Try using lineterminator instead - it doesn't work
reader = csv.reader(open("test_data.csv", "rb"), lineterminator=" \r\n")
read_and_print(reader)
 
J

John Machin

Andrew said:
I have a bunch of csv files that have the following characteristics:

- field delimiter is a comma
- all fields quoted with double quotes
- lines terminated by a *space* followed by a newline

What surprised me was that the csv reader included the trailing space in
the final field value returned, even though it is outside of the quotes.


I've produced a test program (see below) that demonstrates this. There
is a workaround, which is to not pass the csv reader the file iterator,
but rather a generator that returns lines from the file with the
trailing space stripped.

Interestingly, the same behaviour is seen if there are spaces before the
field separator. They are also included in the preceding field value,
even if they are outside the quotations. My workaround wouldn't help here.

A better workaround IMHO is to strip each *field* after it is received
from the csv reader. In fact, it is very rare that leading or trailing
space in CSV fields is of any significance at all. Multiple spaces
ditto. Just do this all the time:

row = [' '.join(x.split()) for x in row]
Anyway is this a bug or a feature? If it is a feature then I'm curious
as to why it is considered desirable behaviour.

IMHO, a bug. In that state, it should be expecting another quotechar, a
delimiter, or a lineterminator. A case could be made for either (a)
ignore space characters (b) raise an exception (c) a or b depending on
an arg ..., ignore_trailing_space=False.

But it gets even more bizarre; see output from revised test script:

DOS_prompt>cat amclean2.py
import csv
filename = "test_data.csv"

# Generate a test file - note the spaces before the newlines
fout = open(filename, "w")
fout.write('"Field1","Field2","Field3" \n')
fout.write('"a","b","c" \n')
fout.write('"d" ,"e","f" \n')
fout.write('"g"xxx,"h" yyy,"i"zzz \n')
fout.write('Fred "Supercoder" Nerk,p,q\n')
fout.write('Fred "Supercoder\' Nerk,p,q\n')
fout.write('Fred \'Supercoder" Nerk,p,q\n')
fout.write('"Fred "Supercoder" Nerk",p,q\n')
fout.write('"Fred "Supercoder\' Nerk",p,q\n')
fout.write('"Fred \'Supercoder" Nerk",p,q\n')
fout.write('"Emoh Ruo", 123 Smith St, Sometown,p,q\n')
fout.write('""Emoh Ruo", 123 Smith St, Sometown","p","q"\n')
fout.close()

# Function to test a reader
def read_and_print(reader):
for line in reader:
# print ",".join(['"%s"' % field for field in line])
# sheesh
print repr(line)

# Read the test file - and print the output
reader = csv.reader(open("test_data.csv", "rb"))
read_and_print(reader)

DOS_prompt>\python25\python amclean2.py
['Field1', 'Field2', 'Field3 ']
['a', 'b', 'c ']
['d ', 'e', 'f ']
['gxxx', 'h yyy', 'izzz ']
['Fred "Supercoder" Nerk', 'p', 'q']
['Fred "Supercoder\' Nerk', 'p', 'q']
['Fred \'Supercoder" Nerk', 'p', 'q']
['Fred Supercoder" Nerk"', 'p', 'q']
['Fred Supercoder\' Nerk"', 'p', 'q']
['Fred \'Supercoder Nerk"', 'p', 'q']
['Emoh Ruo', ' 123 Smith St', ' Sometown', 'p', 'q']
['Emoh Ruo"', ' 123 Smith St', ' Sometown"', 'p', 'q']

Input like the 4th line (and subsequent lines) in the test file can not
have been produced by code which was following the usual algorithm for
quoting CSV fields. Either it is *concatenating* properly-quoted
segments (unlikely) or it is not doing CSV quoting at all or it is
blindly wrapping quotes around the field without doubling internal
quotes.

IMHO such problems should not be silently ignored.
# Try using lineterminator instead - it doesn't work
reader = csv.reader(open("test_data.csv", "rb"), lineterminator=" \r\n")

lineterminator is silently ignored by the reader.

Cheers,
John
 
A

Andrew McLean

John said:
A better workaround IMHO is to strip each *field* after it is received
from the csv reader. In fact, it is very rare that leading or trailing
space in CSV fields is of any significance at all. Multiple spaces
ditto. Just do this all the time:

row = [' '.join(x.split()) for x in row]

The problem with removing the spaces after they are received from the
csv reader is if you want to use DictReader. I like to use DictReader,
without passing it the field list. The module then reads the field list
from the first line, and in this situation you don't get an opportunity
to strip the spaces from that.
 
J

John Machin

Andrew said:
John said:
A better workaround IMHO is to strip each *field* after it is received
from the csv reader. In fact, it is very rare that leading or trailing
space in CSV fields is of any significance at all. Multiple spaces
ditto. Just do this all the time:

row = [' '.join(x.split()) for x in row]

The problem with removing the spaces after they are received from the
csv reader is if you want to use DictReader. I like to use DictReader,
without passing it the field list. The module then reads the field list
from the first line, and in this situation you don't get an opportunity
to strip the spaces from that.

You can fix that. The beauty of open source is that you can grab it
(Windows: c:\python2?\lib\csv.py (typically)) and hack it about till it
suits your needs. Go fer it!

Cheers,
John
 
S

skip

One could argue that your CSV file is broken. Of course, since CSV is a
poorly specified format, that's a pretty weak statement. I don't remember
just what your original problem was, but it concerned itself with white
space as I recall. Have you tried setting the skipinitialspace parameter in
your call to create a reader object?

Skip
 
J

John Machin

One could argue that your CSV file is broken.

Hi Skip,

His CSV file is mildly broken. The examples that I gave are even more
broken, and are typical of real world files created by clueless
developers from databases which contain quotes and commas in the data
(e.g. addresses). The brokenness is not the point at issue. The point
is that the csv module is weakly silent about the brokenness and in
some cases munges the data even further.
Of course, since CSV is a
poorly specified format, that's a pretty weak statement.

It would help if the csv module docs did specify what format it
expects/allows on reading, and what it does on writing. How to quote a
field properly isn't all that mindbogglingly difficult (leaving out
options like escapechar and more-than-minimal quoting):

qc = quotechar
if qc in field:
out = qc + field.replace(qc, qc+qc) + qc
elif delimiter in field or '\n' in field or '\r' in field:
out = qc + field + qc
else:
out = field

Notice how if there are any special characters in the input, the output
has a quotechar at each end. If not, it's broken, and detectably
broken:
abc"def
___^ unexpected quote inside unquoted field
"abc"def
_____^ after quote, expected quote, delimiter, or end-of-line

I don't remember
just what your original problem was, but it concerned itself with white
space as I recall. Have you tried setting the skipinitialspace parameter in
your call to create a reader object?

The problem has nothing to do with *initial* spaces; the OP's problem
cases involved *trailing* spaces. And that's only a subset of the real
problem: casual attitude towards fields that contain quotes but don't
start and/or end with quotes i.e. they have *not* been created by
applying the usual quoting algorithm to raw data.

HTH,
John
 
A

Andrew McLean

John said:
You can fix that. The beauty of open source is that you can grab it
(Windows: c:\python2?\lib\csv.py (typically)) and hack it about till it
suits your needs. Go fer it!

Unfortunately the bits I should change are in _csv.c and, as I'm not
very proficient at C, that wouldn't be a good idea. Anyway, for the
specific brokenness of my CSV file, the simple workaround from my
original post is fine.
 
J

John Machin

Andrew said:
Unfortunately the bits I should change are in _csv.c

We must be talking about different things. IIRC we were talking about
stripping spaces off each field, in your own code. You said that
DictReader was a problem. This can be fixed by changing each of the 3
occurrences of
row = self.reader.next()
in the DictReader.next method to
row = [x.strip() for x in self.reader.next()]
or, less kludgedly, by defining a function or method to do that.

Cheers,
John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,077
Latest member
SangMoor21

Latest Threads

Top