Python's CSV reader

S

Stephan

I'm fairly new to python and am working on parsing some delimited text
files. I noticed that there's a nice CSV reading/writing module
included in the libraries.

My data files however, are odd in that they are composed of lines with
alternating formats. (Essentially the rows are a header record and a
corresponding detail record on the next line. Each line type has a
different number of fields.)

Can the CSV module be coerced to read two line formats at once or am I
better off using read and split?

Thanks for your insight,
Stephan
 
C

Christopher Subich

Stephan said:
Can the CSV module be coerced to read two line formats at once or am I
better off using read and split?

Well, readlines/split really isn't bad. So long as the file fits
comfortably in memory:

fi = open(file)
lines = fi.readlines()
evens = iter(lines[0::2])
odds = iter(lines[1::2])
csv1 = csv.reader(evens)
csv2 = csv.reader(odds)

The trick is that the "csvfile" in the CSV object doesn't have to be a
real file, it just has to be an iterator that returns strings. If the
file's too big to fit in memory, you could piece together a pair of
iterators that execute read() on the file appropriately.
 
P

Peter Otten

Stephan said:
Can the CSV module be coerced to read two line formats at once or am I
better off using read and split?

Yes, it can:

import csv
import sys

reader = csv.reader(sys.stdin)

while True:
try:
names = reader.next()
values = reader.next()
except StopIteration:
break
print dict(zip(names, values))

Python offers an elegant way to do the same using the zip() or
itertools.izip() function:

import csv
import sys
from itertools import izip

reader = csv.reader(sys.stdin)

for names, values in izip(reader, reader):
print dict(izip(names, values))

Now let's add some minimal error checking, and we are done:

import csv
import sys
from itertools import izip, chain

def check_orphan():
raise Exception("Unexpected end of input")
yield None

reader = csv.reader(sys.stdin)
for names, values in izip(reader, chain(reader, check_orphan())):
if len(names) != len(values):
if len(names) > len(values):
raise Exception("More names than values")
else:
raise Exception("More values than names")
print dict(izip(names, values))

Peter
 
A

Andrew McLean

I'm fairly new to python and am working on parsing some delimited text
files. I noticed that there's a nice CSV reading/writing module
included in the libraries.

My data files however, are odd in that they are composed of lines with
alternating formats. (Essentially the rows are a header record and a
corresponding detail record on the next line. Each line type has a
different number of fields.)

Can the CSV module be coerced to read two line formats at once or am I
better off using read and split?

Thanks for your insight,
Stephan

The csv module should be suitable. The reader just takes each line,
parses it, then returns a list of strings. It doesn't matter if
different lines have different numbers of fields.

To get an idea of what I mean, try something like the following
(untested):

import csv

reader = csv.reader(open(filename))

while True:

# Read next "header" line, if there isn't one then exit the
loop
header = reader.next()
if not header: break

# Assume that there is a "detail" line if the preceding
# "header" line exists
detail = reader.next()

# Print the parsed data
print '-' * 40
print "Header (%d fields): %s" % (len(header), header)
print "Detail (%d fields): %s" % (len(detail), detail)

You could wrap this up into a class which returns (header, detail) pairs
and does better error handling, but the above code should illustrate the
basics.
 
S

Stephan

Thank you all for these interesting examples and methods!

Supposing I want to use DictReader to bring in the CSV lines and tie
them to field names, (again, with alternating lines having different
fields), should I use two two DictReaders as in Christopher's example
or is there a better way?
 
P

Peter Otten

Stephan said:
Thank you all for these interesting examples and methods!

You're welcome.
Supposing I want to use DictReader to bring in the CSV lines and tie
them to field names, (again, with alternating lines having different
fields), should I use two two DictReaders as in Christopher's example
or is there a better way?

For a clean design you would need not just two DictReader instances, but one
DictReader for every two lines.
However, with the current DictReader implementation, the following works,
too:

import csv
import sys

reader = csv.DictReader(sys.stdin)

for record in reader:
print record
reader.fieldnames = None

Peter
 
A

Andrew McLean

Thank you all for these interesting examples and methods!

You are welcome. One point. I think there have been at least two
different interpretations of precisely what you task is.

I had assumed that all the different "header" lines contained data for
the same fields in the same order, and similarly that all the "detail"
lines contained data for the same fields in the same order.

However, I think Peter has answered on the basis that you have records
consisting of pairs of lines, the first line being a header containing
field names specific to that record with the second line containing the
corresponding data.

It would help of you let us know which (if any) was correct.
 
S

Stephan

Andrew said:
You are welcome. One point. I think there have been at least two
different interpretations of precisely what you task is.

I had assumed that all the different "header" lines contained data for
the same fields in the same order, and similarly that all the "detail"
lines contained data for the same fields in the same order.

Indeed, you are correct. Peter's version is interesting in its own
right, but not precisely what I had in mind. However, from his example
I saw what I was missing: I didn't realize that you could reassign the
DictReader field names on the fly. Here is a rudimentary example of my
working code and the data it can parse.

-------------------------------------
John|Smith
Beef|Potatos|Dinner Roll|Ice Cream
Susan|Jones
Chicken|Peas|Biscuits|Cake
Roger|Miller
Pork|Salad|Muffin|Cookies
-------------------------------------

import csv

HeaderFields = ["First Name", "Last Name"]
DetailFields = ["Entree", "Side Dish", "Starch", "Desert"]

reader = csv.DictReader(open("testdata.txt"), [], delimiter="|")

while True:
try:
# Read next "header" line (if there isn't one then exit the
loop)
reader.fieldnames = HeaderFields
header = reader.next()

# Read the next "detail" line
reader.fieldnames = DetailFields
detail = reader.next()

# Print the parsed data
print '-' * 40
print "Header (%d fields): %s" % (len(header), header)
print "Detail (%d fields): %s" % (len(detail), detail)

except StopIteration: break

Regards,
-Stephan
 
P

Peter Otten

Stephan said:
DictReader field names on the fly. Here is a rudimentary example of my
working code and the data it can parse.

-------------------------------------
John|Smith
Beef|Potatos|Dinner Roll|Ice Cream
Susan|Jones
Chicken|Peas|Biscuits|Cake
Roger|Miller
Pork|Salad|Muffin|Cookies
-------------------------------------

That sample data would have been valuable information in your original post.
Here's what becomes of your code if you apply the "zip trick" from my first
post (yes, I am sometimes stubborn):

import itertools
import csv

HeaderFields = ["First Name", "Last Name"]
DetailFields = ["Entree", "Side Dish", "Starch", "Desert"]

instream = open("testdata.txt")

heads = csv.DictReader(instream, HeaderFields, delimiter="|")
details = csv.DictReader(instream, DetailFields, delimiter="|")

for header, detail in itertools.izip(heads, details):
print "Header (%d fields): %s" % (len(header), header)
print "Detail (%d fields): %s" % (len(detail), detail)

Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,014
Latest member
BiancaFix3

Latest Threads

Top