throwing exceptions from csv.DictReader or even csv.reader


T

Tim

Hullo
Csv is a very common format for publishing data as a form of primitive
integration. It's an annoyingly brittle approach, so I'd like to
ensure that I capture errors as soon as possible, so that I can get
the upstream processes fixed, or at worst put in some correction
mechanisms and avoid getting polluted data into my analyses.

A symptom of several types of errors is that the number of fields
being interpreted varies over a file (eg from wrongly embedded quote
strings or mishandled embedded newlines). My preferred approach would
be to get DictReader to throw an exception when encountering such
oddities, but at the moment it seems to try to patch over the error
and fill in the blanks for short lines, or ignore long lines. I know
that I can use the restval parameter and then check for what's been
parsed when I get my results back, but this seems brittle as whatever
I use for restval could legitimately be in the data.

Is there any way to get csv.DictReader to throw and exception on such
simple line errors, or am I going to have to use csv.reader and
explicitly check for the number of fields read in on each line?

cheers

Tim
 
Ad

Advertisements

P

Peter Otten

Tim said:
Csv is a very common format for publishing data as a form of primitive
integration. It's an annoyingly brittle approach, so I'd like to
ensure that I capture errors as soon as possible, so that I can get
the upstream processes fixed, or at worst put in some correction
mechanisms and avoid getting polluted data into my analyses.

A symptom of several types of errors is that the number of fields
being interpreted varies over a file (eg from wrongly embedded quote
strings or mishandled embedded newlines). My preferred approach would
be to get DictReader to throw an exception when encountering such
oddities, but at the moment it seems to try to patch over the error
and fill in the blanks for short lines, or ignore long lines. I know
that I can use the restval parameter and then check for what's been
parsed when I get my results back, but this seems brittle as whatever
I use for restval could legitimately be in the data.

Is there any way to get csv.DictReader to throw and exception on such
simple line errors, or am I going to have to use csv.reader and
explicitly check for the number of fields read in on each line?

I think you have to use csv.reader. Untested:

def DictReader(f, fieldnames=None, *args, **kw):
reader = csv.reader(f, *args, **kw)
if fieldnames is None:
fieldnames = next(reader)
for row in reader:
if row:
if len(fieldnames) != len(row):
raise ValueError
yield dict(zip(fieldnames, row))

Peter
 

Top