CSV performance

P

psaffrey

I'm using the CSV library to process a large amount of data - 28
files, each of 130MB. Just reading in the data from one file and
filing it into very simple data structures (numpy arrays and a
cstringio) takes around 10 seconds. If I just slurp one file into a
string, it only takes about a second, so I/O is not the bottleneck. Is
it really taking 9 seconds just to split the lines and set the
variables?

Is there some way I can improve the CSV performance? Is there a way I
can slurp the file into memory and read it like a file from there?

Peter
 
J

John Machin

I'm using the CSV library to process a large amount of data - 28
files, each of 130MB. Just reading in the data from one file and
filing it into very simple data structures (numpy arrays and a
cstringio) takes around 10 seconds. If I just slurp one file into a
string, it only takes about a second, so I/O is not the bottleneck. Is
it really taking 9 seconds just to split the lines and set the
variables?

I'll assume that that's a rhetorical question. Why are you assuming
that it's a problem with the csv module and not with the "filing it
into very simple data structures"? How long does it take just to read
the CSV file i.e. without any setting the variables? Have you run your
timing tests multiple times and discarded the first 1 or two results?
Is there some way I can improve the CSV performance?

I doubt it.
Is there a way I
can slurp the file into memory and read it like a file from there?

Of course. However why do you think that the double handling will be
faster? Do you have 130MB of real memory free for the file image?

In order to get some meaningful advice, you will need to tell us:
version of Python, version of numpy, OS, amount of memory on the
machine, what CPU; and supply: sample of a few lines of a typical CSV
file, plus your code.

Cheers,
John
 
P

Peter Otten

I'm using the CSV library to process a large amount of data - 28
files, each of 130MB. Just reading in the data from one file and
filing it into very simple data structures (numpy arrays and a
cstringio) takes around 10 seconds. If I just slurp one file into a
string, it only takes about a second, so I/O is not the bottleneck. Is
it really taking 9 seconds just to split the lines and set the
variables?

Is there some way I can improve the CSV performance?

My ideas:

(1) Disable cyclic garbage collection while you read the file into your data
structure:

import gc

gc.disable()
# create many small objects that you want to keep
gc.enable()


(2) If your data contains only numerical data without quotes use

numpy.fromfile()

Peter
 
T

Tim Chase

I'm using the CSV library to process a large amount of data -
28 files, each of 130MB. Just reading in the data from one
file and filing it into very simple data structures (numpy
arrays and a cstringio) takes around 10 seconds. If I just
slurp one file into a string, it only takes about a second, so
I/O is not the bottleneck. Is it really taking 9 seconds just
to split the lines and set the variables?

You've omitted one important test: spinning through the file
with csv-parsing, but not doing an "filing it into very simple
data structures". Without that metric, there's no way to know
whether the csv module is at fault, or if you're doing something
malperformant with the data-structures.

-tkc
 
P

psaffrey

Thanks for your replies. Many apologies for not including the right
information first time around. More information is below.

I have tried running it just on the csv read:

import time
import csv

afile = "largefile.txt"

t0 = time.clock()

print "working at file", afile
reader = csv.reader(open(afile, "r"), delimiter="\t")
for row in reader:
x,y,z = row


t1 = time.clock()

print "finished: %f.2" % (t1 - t0)


$ ./largefilespeedtest.py
working at file largefile.txt
finished: 3.860000.2


A tiny bit of background on the final application: this is biological
data from an affymetrix platform. The csv files are a chromosome name,
a coordinate and a data point, like this:

chr1 3754914 1.19828
chr1 3754950 1.56557
chr1 3754982 1.52371

In the "simple data structures" cod below, I do some jiggery pokery
with the chromosome names to save me storing the same string millions
of times.


import csv
import cStringIO
import numpy
import time

afile = "largefile.txt"

chrommap = {'chrY': 'y', 'chrX': 'x', 'chr13': 'c',
'chr12': 'b', 'chr11': 'a', 'chr10': '0',
'chr17': 'g', 'chr16': 'f', 'chr15': 'e',
'chr14': 'd', 'chr19': 'i', 'chr18': 'h',
'chrM': 'm', 'chr22': 'l', 'chr20': 'j',
'chr21': 'k', 'chr7': '7', 'chr6': '6',
'chr5': '5', 'chr4': '4', 'chr3': '3',
'chr2': '2', 'chr1': '1', 'chr9': '9', 'chr8': '8'}


def getFileLength(fh):
wholefile = fh.read()
numlines = wholefile.count("\n")
fh.seek(0)
return numlines

count = 0
print "reading affy file", afile
fh = open(afile)
n = getFileLength(fh)
chromio = cStringIO.StringIO()
coords = numpy.zeros(n, dtype=int)
points = numpy.zeros(n)

t0 = time.clock()
reader = csv.reader(fh, delimiter="\t")
for row in reader:
if not row:
continue
chrom, coord, point = row
mappedc = chrommap[chrom]
chromio.write(mappedc)
coords[count] = coord
points[count] = point
count += 1
t1 = time.clock()

print "finished: %f.2" % (t1 - t0)


$ ./affyspeedtest.py
reading affy file largefile.txt
finished: 15.540000.2


Thanks again (tugs forelock),

Peter
 
G

grocery_stocker

My ideas:

(1) Disable cyclic garbage collection while you read the file into your data
structure:

import gc

gc.disable()
# create many small objects that you want to keep
gc.enable()

(2) If your data contains only numerical data without quotes use

numpy.fromfile()

How would disabling the cyclic garbage collection make it go faster in
this case?
 
P

Peter Otten

grocery_stocker said:
How would disabling the cyclic garbage collection make it go faster in
this case?

When Python creates many objects and doesn't release any it is assumed that
they are kept due to cyclic references. When you know that you actually
want to keep all those objects you can temporarily disable garbage
collection. E. g.:

$ cat gcdemo.py
import time
import sys
import gc


def main(float=float):
if "-d" in sys.argv:
gc.disable()
status = "disabled"
else:
status = "enabled"
all = []
append = all.append
start = time.time()
floats = ["1.234"] * 10
assert len(set(map(id, map(float, floats)))) == len(floats)
for _ in xrange(10**6):
append(map(float, floats))
print time.time() - start, "(garbage collection %s)" % status


main()

$ python gcdemo.py -d
11.6144971848 (garbage collection disabled)
$ python gcdemo.py
15.5317759514 (garbage collection enabled)

Of course I don't know whether this is actually a problem for the OP's code.

Peter
 
T

Tim Chase

I have tried running it just on the csv read:
....
print "finished: %f.2" % (t1 - t0)

I presume you wanted "%.2f" here. :)
$ ./largefilespeedtest.py
working at file largefile.txt
finished: 3.860000.2

So just the CSV processing of the file takes just shy of 4
seconds and you said that just the pure file-read took about a
second, so that leaves about 3 seconds for CSV processing (or
about 1/3 of the total runtime). In your code example in your
2nd post (with the timing in it), it looks like it took 15+
seconds, meaning the csv code is a mere 1/5 of the runtime. I
also notice that you're reading the file once to find the length,
and reading again to process it.
The csv files are a chromosome name,
a coordinate and a data point, like this:

chr1 3754914 1.19828
chr1 3754950 1.56557
chr1 3754982 1.52371

Depending on the simplicity of the file-format (assuming nothing
like spaces/tabs in the chromosome name, which your dictionary
seems to indicate is the case), it may be faster to use .split()
to do the work:

for line in file(afile):
a,b,c = line.rstrip('\n\r').split()

The csv module does a lot of smart stuff that it looks like you
may not need.

However, you're still only cutting from that 3-second subset of
your total time. Focusing on the "filing it into very simple
data structures" will likely net you greater improvements. I
don't have much experience with numpy, so I can't offer much to
help. However, rather than reading the file twice, you might try
a general heuristic, assuming lines are no longer than N
characters (they look like they're each 20 chars + a newline) and
then using "filesize/N" to estimate an adequately sized array.
Using stat() on a file to get its size will be a heckuva lot
faster than reading the whole file. I also don't know the
performance of cStringIO.CString() with lots of appending.
However, since each write is just a character, you might do well
to use the array module (unless numpy also has char-arrays) to
preallocate n chars just like you do with your ints and floats:

chromeio[count] = chrommap[chrom]
coords[count] = coord
points[count] = point
count += 1

Just a few ideas to try.

-tkc
 
P

Peter Otten

Thanks for your replies. Many apologies for not including the right
information first time around. More information is below.

I have tried running it just on the csv read:
$ ./largefilespeedtest.py
working at file largefile.txt
finished: 3.860000.2


A tiny bit of background on the final application: this is biological
data from an affymetrix platform. The csv files are a chromosome name,
a coordinate and a data point, like this:

chr1 3754914 1.19828
chr1 3754950 1.56557
chr1 3754982 1.52371

In the "simple data structures" cod below, I do some jiggery pokery
with the chromosome names to save me storing the same string millions
of times.
$ ./affyspeedtest.py
reading affy file largefile.txt
finished: 15.540000.2

It looks like most of the time is not spent in the csv.reader().
Here's an alternative way to read your data:

rows = fh.read().split()
coords = numpy.array(map(int, rows[1::3]), dtype=int)
points = numpy.array(map(float, rows[2::3]), dtype=float)
chromio.writelines(map(chrommap.__getitem__, rows[::3]))

Do things improve if you simplify your code like that?

Peter
 
D

dean

I'm using the CSV library to process a large amount of data - 28
files, each of 130MB. Just reading in the data from one file and
filing it into very simple data structures (numpy arrays and a
cstringio) takes around 10 seconds. If I just slurp one file into a
string, it only takes about a second, so I/O is not the bottleneck. Is
it really taking 9 seconds just to split the lines and set the
variables?

I assume you're reading a 130 MB text file in 1 second only after OS
already cashed it, so you're not really measuring disk I/O at all.

Parsing a 130 MB text file will take considerable time no matter what.
Perhaps you should consider using a database instead of CSV.
 
L

Lawrence D'Oliveiro

gc.disable()
# create many small objects that you want to keep
gc.enable()

Every time I see something like this, I feel the urge to save the previous
state and restore it afterwards:

save_enabled = gc.isenabled()
gc.disable()
# create many small objects that you want to keep
if save_enabled :
gc.enable()
#end if

Maybe that's just me. :)
 
P

Peter Otten

Lawrence said:
Every time I see something like this, I feel the urge to save the previous
state and restore it afterwards:

save_enabled = gc.isenabled()
gc.disable()
# create many small objects that you want to keep
if save_enabled :
gc.enable()
#end if

Maybe that's just me. :)

There's probably someone out there who does nested GC states on a daily
basis ;)

When I see the sequence

save state
change state
do something
restore state

I feel compelled to throw in a try ... finally

save state
try:
change state
do something
finally:
restore state

which in turn leads to

import gc

from contextlib import contextmanager

@contextmanager
def gcdisabled():
was_enabled = gc.isenabled()
gc.disable()
try:
yield
finally:
if was_enabled:
gc.enable()

if __name__ == "__main__":
try:
with gcdisabled():
assert not gc.isenabled()
try:
with gcdisabled():
assert not gc.isenabled()
1/0
finally:
assert not gc.isenabled()
except ZeroDivisionError:
pass
assert gc.isenabled()

So far, so good. But is it thread-safe?

I think you are beginning to see why the original suggestion was my best
option...

Peter
 
L

Lawrence D'Oliveiro

When I see the sequence

save state
change state
do something
restore state

I feel compelled to throw in a try ... finally

Yeah, but I try to avoid using exceptions to that extent. :)
 
P

psaffrey

rows = fh.read().split()
coords = numpy.array(map(int, rows[1::3]), dtype=int)
points = numpy.array(map(float, rows[2::3]), dtype=float)
chromio.writelines(map(chrommap.__getitem__, rows[::3]))

My original version is about 15 seconds. This version is about 9. The
chunks version posted by Scott is about 11 seconds with a chunk size
of 16384.

When integrated into the overall code, reading all 28 files, it
improves the performance by about 30%.

Many thanks to everybody for their help,

Peter
 
J

Jorgen Grahn

I assume you're reading a 130 MB text file in 1 second only after OS
already cashed it, so you're not really measuring disk I/O at all.

Parsing a 130 MB text file will take considerable time no matter what.
Perhaps you should consider using a database instead of CSV.

Why would that be faster? (Assuming all data is actually read from the
database into data structures in the program, as in the text file
case.)

I am asking because people who like databases tend to overestimate the
time it takes to parse text. (And I guess people like me who prefer
text files tend to underestimate the usefullness of databases.)

/Jorgen
 
L

Lawrence D'Oliveiro

Jorgen Grahn said:
I am asking because people who like databases tend to overestimate the
time it takes to parse text.

And those of us who regularly load databases from text files, or unload them
in the opposite direction, have a good idea of EXACTLY how long it takes to
parse text. :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,066
Latest member
VytoKetoReviews

Latest Threads

Top