Using csv.DictReader with \r\n in the middle of fields

pstatham · Oct 13, 2010

Hello everyone!

Hopefully this will interest some, I have a csv file (can be
downloaded from http://www.paulstathamphotography.co.uk/45.txt) which
has five fields separated by ~ delimiters. To read this I've been
using a csv.DictReader which works in 99% of the cases. Occasionally
however the description field has errant \r\n characters in the middle
of the record. This causes the reader to assume it's a new record and
try to read it.

Here's the code I had

import csv

fields = ["PROGTITLE", "SUBTITLE", "EPISODE", "DESCRIPTION", "DATE"]
delim = '~'

lineReader = csv.DictReader(open('45.txt', 'rbU'),
delimiter=delim,fieldnames=fields)

def FormatDate(date):
return date[6:10] +"-" +date[3:5] + "-" +date[0:2]

channelPrograms = []

for row in lineReader:
row["DATE"] = FormatDate(row["DATE"])
channelPrograms.append(row)

Which when run would give me an error as it was trying to pass a
NoneType to the FormatDate method, which obviously couldn't handle it.

I'd like to find a way to read this record correctly despite the \r
\n's in the middle of the description. The problem is I can't change
the behaviour in which it reads a record.

For the moment I've had to resort to extending the csv.DictReader and
overriding the next() method to set the number of fields versus the
number of values, if they're not equal I don't add those lines to my
list of records.

import csv

class ChanDictReader(csv.DictReader):
def __init__(self, f, fieldnames=None, restkey=None, restval=None,
dialect="excel", *args, **kwds):
csv.DictReader.__init__(self, f, fieldnames, restkey, restval,
dialect, *args, **kwds)
self.lf = 0
self.lr = 0

def next(self):
if self.line_num == 0:
# Used only for its side effect.
self.fieldnames
row = self.reader.next()
self.line_num = self.reader.line_num

# unlike the basic reader, we prefer not to return blanks,
# because we will typically wind up with a dict full of None
# values
while row == []:
row = self.reader.next()
d = dict(zip(self.fieldnames, row))
self.lf = len(self.fieldnames)
self.lr = len(row)
if self.lf < self.lr:
d[self.restkey] = row[self.lf:]
elif self.lf > self.lr:
for key in self.fieldnames[self.lr:]:
d[key] = self.restval
return d

fields = ["PROGTITLE", "SUBTITLE", "EPISODE", "DESCRIPTION", "DATE"]
delim = '~'

lineReader = ChanDictReader(open('45.txt', 'rbU'),
delimiter=delim,fieldnames=fields)

def FormatDate(date):
return date[6:10] +"-" +date[3:5] + "-" +date[0:2]

channelPrograms = []

for row in lineReader:
print "Number of fields: " + str(lineReader.lf) + " Number of
values: " + str(lineReader.lr)
if lineReader.lf == lineReader.lr:
row["DATE"] = FormatDate(row["DATE"])
channelPrograms.append(row)

Anyone have any ideas?

)

Paul

Neil Cerutti · Oct 13, 2010

Hopefully this will interest some, I have a csv file (can be
downloaded from http://www.paulstathamphotography.co.uk/45.txt) which
has five fields separated by ~ delimiters. To read this I've been
using a csv.DictReader which works in 99% of the cases. Occasionally
however the description field has errant \r\n characters in the middle
of the record. This causes the reader to assume it's a new record and
try to read it.

Here's an alternative idea. Working with csv module for this job
is too difficult for me.

import re

record_re = "(?P<PROGTITLE>.*?)~(?P<SUBTITLE>.*?)~(?P<EPISODE>.*?)~(?P<DESCRIPTION>.*?)~(?P<DATE>.*?)\n(.*)"

def parse_file(fname):
with open(fname) as f:
data = f.read()
m = re.match(record_re, data, flags=re.M | re.S)
while m:
yield m.groupdict()
m = re.match(record_re, m.group(6), flags=re.M | re.S)

for record in parse_file('45.txt'):
print(record)

Dennis Lee Bieber · Oct 13, 2010

Hopefully this will interest some, I have a csv file (can be
downloaded from http://www.paulstathamphotography.co.uk/45.txt) which
has five fields separated by ~ delimiters. To read this I've been
using a csv.DictReader which works in 99% of the cases. Occasionally
however the description field has errant \r\n characters in the middle
of the record. This causes the reader to assume it's a new record and
try to read it.

How is the data file being generated? Could the generation procedure
be modified?

While I've not tested it, my understanding of the documentation
indicates that the reader /can/ handle multi-line fields IF QUOTED...
(you may still have to strip the terminator out of the description data
after it has been loaded).

That is:

Some Title~Subtitle~Episode~"A description with<cr><lf>
an embedded new line terminator"~Date

should be properly parsed.

Tim Chase · Oct 13, 2010

While I've not tested it, my understanding of the documentation
indicates that the reader /can/ handle multi-line fields IF QUOTED...
(you may still have to strip the terminator out of the description data
after it has been loaded).

That is:

Some Title~Subtitle~Episode~"A description with<cr><lf>
an embedded new line terminator"~Date

should be properly parsed.

I believe this was fixed in 2.5 The following worked in 2.5 but
2.4 rejected it:

# saved as testr.py
from cStringIO import StringIO
from csv import DictReader

data = StringIO(
'one,"two two",three\n'
'"1a\r1b","2a\n2b","3a\r\n3b"\n'
'"1x\r1y","2x\n2y","3x\r\n3y"\n'
)

data.reset()
dr = DictReader(data)
for row in dr:
for k,v in row.iteritems():
print '%r ==> %r' % (k,v)

tim@rubbish:~/tmp$ python2.5 testr.py
'two two' ==> '2a\n2b'
'three' ==> '3a\r\n3b'
'one' ==> '1a\r1b'
'two two' ==> '2x\n2y'
'three' ==> '3x\r\n3y'
'one' ==> '1x\r1y'
tim@rubbish:~/tmp$ python2.4 testr.py
Traceback (most recent call last):
File "testr.py", line 12, in ?
for row in dr:
File "/usr/lib/python2.4/csv.py", line 109, in next
row = self.reader.next()
_csv.Error: newline inside string

-tkc

pstatham · Oct 14, 2010

Here's an alternative idea. Working with csv module for this job
is too difficult for me.

import re

record_re = "(?P<PROGTITLE>.*?)~(?P<SUBTITLE>.*?)~(?P<EPISODE>.*?)~(?P<DESCRIPTION>.*?)~(?P<DATE>.*?)\n(.*)"

def parse_file(fname):
with open(fname) as f:
data = f.read()
m = re.match(record_re, data, flags=re.M | re.S)
while m:
yield m.groupdict()
m = re.match(record_re, m.group(6), flags=re.M | re.S)

for record in parse_file('45.txt'):
print(record)

Thanks guys, I can't alter the source data.

I wouldn't of considered regex, but it's a good idea as I can then
define my own record structure instead of reader dictating to me what
a record is.

Trouble with prediction code, for the life of me I can't figure out why it isnt running properly. Help would be appreciated.	0	Jul 8, 2023
CSV module, DictReader problem (bug?)	10	Nov 1, 2006
sqlite3 in Python 2.5b1: my out-of-the-box experience	2	Jul 3, 2006
How to change the color of the modified fields in a Datagrid?	1	Aug 26, 2004
A data transformation framework. A presentation inviting commentary.	0	Aug 21, 2013
Editing Rows in DetailsView with ModalPopupExtender	0	Jan 25, 2010
Use of uninitialized value in print	2	Sep 18, 2006
Newbie ? file structures in Dict, List, Tuples etc How	1	Dec 12, 2007

Using csv.DictReader with \r\n in the middle of fields

pstatham

Neil Cerutti

Dennis Lee Bieber

Tim Chase

pstatham

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads