Processing large CSV files - how to maximise throughput?

V

Victor Hooi

Hi,

We have a directory of large CSV files that we'd like to process in Python.

We process each input CSV, then generate a corresponding output CSV file.

input CSV -> munging text, lookups etc. -> output CSV

My question is, what's the most Pythonic way of handling this? (Which I'm assuming

For the reading, I'd

with open('input.csv', 'r') as input, open('output.csv', 'w') as output:
csv_writer = DictWriter(output)
for line in DictReader(input):
# Do some processing for that line...
output = process_line(line)
# Write output to file
csv_writer.writerow(output)

So for the reading, it'll iterates over the lines one by one, and won't read it into memory which is good.

For the writing - my understanding is that it writes a line to the file object each loop iteration, however, this will only get flushed to disk every now and then, based on my system default buffer size, right?

So if the output file is going to get large, there isn't anything I need to take into account for conserving memory?

Also, if I'm trying to maximise throughput of the above, is there anything I could try? The processing in process_line is quite line - just a bunch of string splits and regexes.

If I have multiple large CSV files to deal with, and I'm on a multi-core machine, is there anything else I can do to boost throughput?

Cheers,
Victor
 
D

Dave Angel

Hi,

We have a directory of large CSV files that we'd like to process in Python.

We process each input CSV, then generate a corresponding output CSV file.

input CSV -> munging text, lookups etc. -> output CSV

My question is, what's the most Pythonic way of handling this? (Which I'm assuming

For the reading, I'd

with open('input.csv', 'r') as input, open('output.csv', 'w') as output:
csv_writer = DictWriter(output)
for line in DictReader(input):
# Do some processing for that line...
output = process_line(line)
# Write output to file
csv_writer.writerow(output)

So for the reading, it'll iterates over the lines one by one, and won't read it into memory which is good.

For the writing - my understanding is that it writes a line to the file object each loop iteration, however, this will only get flushed to disk every now and then, based on my system default buffer size, right?

So if the output file is going to get large, there isn't anything I need to take into account for conserving memory?

No, the system will flush so often that you'll never use much memory.
Also, if I'm trying to maximise throughput of the above, is there anything I could try? The processing in process_line is quite line - just a bunch of string splits and regexes.

If you want help optimizing process_line(), you'd have to show us the
source. For the regex, you can precompile it and not have to build it
each time. Or just write the equivalent Python code, which many times
is faster than a regex.

If I have multiple large CSV files to deal with, and I'm on a multi-core machine, is there anything else I can do to boost throughput?

Start multiple processes. For what you're doing, there's probably no
point in multithreading.

And as always, in performance tuning, you never know till you measure.
 
S

Steven D'Aprano

Hi,

We have a directory of large CSV files that we'd like to process in
Python.

We process each input CSV, then generate a corresponding output CSV
file.

input CSV -> munging text, lookups etc. -> output CSV

My question is, what's the most Pythonic way of handling this? (Which
I'm assuming


Start with the simplest thing that could work:

for infile, outfile in zip(input_list, output_list):
for line in infile:
munge line
write to outfile


sort of thing. If and only if it isn't fast enough, then try to speed it
up.

For the reading, I'd

with open('input.csv', 'r') as input, open('output.csv', 'w') as
output:
csv_writer = DictWriter(output)
for line in DictReader(input):
# Do some processing for that line... output =
process_line(line)
# Write output to file
csv_writer.writerow(output)

Looks good to me!


So for the reading, it'll iterates over the lines one by one, and won't
read it into memory which is good.

For the writing - my understanding is that it writes a line to the file
object each loop iteration, however, this will only get flushed to disk
every now and then, based on my system default buffer size, right?

Have you read the csv_writer documentation?

http://docs.python.org/2/library/csv.html#writer-objects

Unfortunately it is pretty light documentation, but it seems that
writer.writerow *probably* just calls write on the underlying file object
immediately.

But there's really no way to tell when the data hits the disk platter:
the Python file object could be doing caching, the OS could be doing
caching, the file system could be doing caching, and even the disk itself
could be doing caching. Really, the only way to be sure that the data has
hit the disk platter is to call os.sync(), and even then some hard drives
lie and report that they're synced when in fact the data is still in
volatile cache. Bad hard drive, no biscuit.

But, really, do you need to care? Better to buy better hard drives (e.g.
server grade), or use software RAID with two different brands (so their
failure characteristics will be different), or just be prepared to re-
process a batch of files if the power goes out mid-run. (You do have a
UPS, don't you?)

Anyway, I wouldn't bother about calling os.sync directly, the OS will
sync when it needs to. But if you need it, it's there. Or you can call
flush() on the output file, which is a bit less invasive than calling
os.sync. But really I wouldn't bother. Let the OS handle it.

So if the output file is going to get large, there isn't anything I need
to take into account for conserving memory?

I shouldn't think so.

Also, if I'm trying to maximise throughput of the above, is there
anything I could try? The processing in process_line is quite line -
just a bunch of string splits and regexes.

If I have multiple large CSV files to deal with, and I'm on a multi-core
machine, is there anything else I can do to boost throughput?

Since this is likely to be I/O-bound, you could use threads. Each thread
is responsible for reading the file, processing it, then writing it back
again. Have you used threads before? If not, start here:

http://www.ibm.com/developerworks/aix/library/au-threadingpython/
http://pymotw.com/2/threading/‎


If the amount of processing required becomes heavier, and the task
becomes CPU-bound, you can either:

- move to an implementation of Python without the GIL, like
IronPython or Jython;

- or use multiprocessing.
 
S

Steven D'Aprano

Start multiple processes. For what you're doing, there's probably no
point in multithreading.

Since the bottleneck will probably be I/O, reading and writing data from
files, I expect threading actually may help.
 
D

Dave Angel

Since the bottleneck will probably be I/O, reading and writing data from
files, I expect threading actually may help.

We approach the tradeoff from opposite sides. I would use
multiprocessing to utilize multiple cores unless the communication costs
(between the processes) would get too high.

They won't in this case.

But I would concur -- probably they'll both give about the same speedup.
I just detest the pain that multithreading can bring, and tend to avoid
it if at all possible.
 
C

Chris Angelico

But I would concur -- probably they'll both give about the same speedup.
I just detest the pain that multithreading can bring, and tend to avoid
it if at all possible.

I don't have a history of major pain from threading. Is this a Python
thing, or have I just been really really fortunate (growing up on OS/2
rather than Windows has definitely been, for me, a major boon)?
Generally, I find threads to be convenient, though of course not
always useful (especially in serialized languages).

ChrisA
 
S

Stefan Behnel

Chris Angelico, 25.10.2013 08:13:
I don't have a history of major pain from threading. Is this a Python
thing, or have I just been really really fortunate

Likely the latter. Threads are ok if what they do is essentially what you
could easily use multiple processes for as well, i.e. process independent
data, maybe from/to independent files etc., using dedicated channels for
communication.

As soon as you need them to share any state, however, it's really easy to
get it wrong and to run into concurrency issues that are difficult to
reproduce and debug.

Basically, with multiple processes, you start with independent systems and
add connections specifically where needed, whereas with threads, you start
with completely shared state and then prune away interdependencies and
concurrency until it seems to work safely. That approach makes it
essentially impossible to prove that threading is safe in a given setup,
except for the really trivial cases.

Stefan
 
C

Chris Angelico

Basically, with multiple processes, you start with independent systems and
add connections specifically where needed, whereas with threads, you start
with completely shared state and then prune away interdependencies and
concurrency until it seems to work safely. That approach makes it
essentially impossible to prove that threading is safe in a given setup,
except for the really trivial cases.

Not strictly true. With multiple threads, you start with completely
shared global state and completely independent local state (in
assembly language, shared data segment and separate stack). If you
treat your globals as either read-only or carefully controlled, then
it makes little difference whether you're forking processes or
spinning off threads, except that with threads you don't need special
data structures (IPC-based ones) for the global state.

For me, threading largely grew out of the same sorts of concerns as
recursion - as long as all your internal state is in locals, nothing
can hurt you. Of course, it's still far easier to shoot yourself in
the foot with threads than with processes, but for the tasks I've used
them for, I've never found footholes; that may, however, be inherent
to the simplicity of the two main jobs I used threads for: socket
handling (where nearly everything's I/O bound) and worker threads spun
off to let the GUI remain responsive (posting a message back to the
main thread when there's a result).

ChrisA
 
D

Dave Angel

I don't have a history of major pain from threading. Is this a Python
thing,

No, all my pain has been in C++. But nearly all my Python code has
been written solo, so I can adopt strict rules.

The problem comes that with many people's sticky fingers in the code
pie, things that seem innocuous are broken once they can happen from
multiple threads. And C++ does not give you any way to tell at a glance
where you're asking for trouble.

I've also been involved in projects that existed and were broken long
before I came on the scene. Finding that something is broken when it
wasn't caused by any recent change is painful. And in large, old C++
projects, memory management is convoluted. Even when not broken,
sometimes performance takes big hits because multiple threads are
banging at the same cache line. (using thread affinity to make sure
each thread uses a consistent core) So you have a dozen memory
allocation strategies, each with different rules and restrictions.
or have I just been really really fortunate (growing up on OS/2
rather than Windows has definitely been, for me, a major boon)?
Generally, I find threads to be convenient, though of course not
always useful (especially in serialized languages).


All my heavy duty multi-task/multi-thread stuff has been on Linux.

And Python floats above most of the issues I'm trying to describe.
E.g. you don't have to tune the memory management because it's out of
your control. Still my bias persists.
 
C

Chris Angelico

No, all my pain has been in C++. But nearly all my Python code has
been written solo, so I can adopt strict rules.

The problem comes that with many people's sticky fingers in the code
pie, things that seem innocuous are broken once they can happen from
multiple threads. And C++ does not give you any way to tell at a glance
where you're asking for trouble.

Yeah, that is a big issue. And even the simple rule I mentioned
earlier (keep everything on the stack) isn't sufficient when you use
library functions that aren't thread-safe... even slabs of the
standard library aren't, and I'm not just talking about obvious ones
like strtok().
I've also been involved in projects that existed and were broken long
before I came on the scene. Finding that something is broken when it
wasn't caused by any recent change is painful. And in large, old C++
projects, memory management is convoluted. Even when not broken,
sometimes performance takes big hits because multiple threads are
banging at the same cache line. (using thread affinity to make sure
each thread uses a consistent core) So you have a dozen memory
allocation strategies, each with different rules and restrictions.

Absolutely. Something might have just-happened to work, even for
years. I came across something that would almost be brown-paper-bag
quality, in my own production code at work, today; the TCP socket
protocol between two processes wasn't properly looking for the newline
that marks the end of the message, and would have bugged out badly if
ever two messages had been combined into one socket-read call. With
everything working perfectly, like on a low-latency LAN where all our
testing happens, it was all safe, but if anything lagged out, it would
have meant messages got lost. Yet it's been working for years without
a noticed glitch.
All my heavy duty multi-task/multi-thread stuff has been on Linux.

OS/2 is very Unix-like in many areas, but it doesn't have a convenient
fork(), so processes are rather more fiddly to spawn than threads are.
Actually, sliding from OS/2 to Linux has largely led to me sliding
from threads to processes - forking a process under Linux is as easy
as spinning off a thread under OS/2, and as easy to pass state to
(though (obviously) harder to pass state back from).
And Python floats above most of the issues I'm trying to describe.
E.g. you don't have to tune the memory management because it's out of
your control. Still my bias persists.

Hrmmmmm.... yes, I can imagine running into difficulties with a
non-thread-safe malloc/free implementation... I think that would rate
about a 4 on the XKCD 883 scale.

ChrisA
 
D

Dennis Lee Bieber

For the reading, I'd

with open('input.csv', 'r') as input, open('output.csv', 'w') as output:
csv_writer = DictWriter(output)
for line in DictReader(input):
# Do some processing for that line...
output = process_line(line)
# Write output to file
csv_writer.writerow(output)

So for the reading, it'll iterates over the lines one by one, and won't read it into memory which is good.
My first comment would be: Do you really need the overhead of using
/dictionary/ reader/writer mode? Surely this processing isn't taking in
ad-hoc CSV files, is it? If all the inputs look the same using the regular
CSV reader may be faster (same for the writer) and your processing would be
using direct positional indexing rather than dictionary hashing/lookups.
So if the output file is going to get large, there isn't anything I need to take into account for conserving memory?
Also, if I'm trying to maximise throughput of the above, is there anything I could try? The processing in process_line is quite line - just a bunch of string splits and regexes.

If I have multiple large CSV files to deal with, and I'm on a multi-core machine, is there anything else I can do to boost throughput?

You are likely I/O bound, not CPU bound -- so multi-core doesn't really
affect things (neither would the GIL).

You could maybe try using three Threads and a pair of Queues: reader,
process, writer. Limit the Queues to maybe 50-100 entries (tweak to liking
-- I run a task where using 100 entries results in a long "load" phase
before processing begins).

Reader thread just reads entries from the input file and adds them to
the input queue. Processing thread takes entries from the input queue,
mashes them, and puts them onto the output queue. Writer thread, obviously,
takes items from the output queue and writes them to the file.

Ideally, what should happen is, after a few seconds of the reader
monopolizing the CPU, you will have a backlog of records in the input
queue, and will initially block on "queue full", letting the processing
thread crunch some entries -- at some quantum the reader will get control,
find the queue is not full, and initiate the next read and blocks for the
I/O to complete; the processing thread gets control again and continues
number crunching. Hopefully, before the queue goes empty the reader will
complete the I/O and add the next entry to the queue. Same for output
queue.

The main idea is that you don't block crunching while waiting for the
next record to be read or written. Crunching on blocks if the input runs
empty or the output fills up -- conditions in which the reader or writer
then gets control.
 
R

Roy Smith

Dennis Lee Bieber said:
Memory is cheap -- I/O is slow. <G> Just how massive are these CSV
files?

Actually, these days, the economics of hardware are more like, "CPU is
cheap, memory is expensive".

I suppose it all depends on what kinds of problems you're solving, but
my experience is I'm much more likely to run out of memory on big
problems than I am to peg the CPU. Also, pegging the CPU leads to
well-behaved performance degradation. Running out of memory leads to
falling off a performance cliff as you start to page.

And, with the advent of large-scale SSD (you can get 1.6 TB SSD in 2.5
inch form-factor!), I/O is as fast as you're willing to pay for :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,043
Latest member
CannalabsCBDReview

Latest Threads

Top