is lots of files with Threads faster?

Chris Richards · Feb 7, 2008

Im required to open 50+ files and parse the data in them. WOuld using
multiple threads give me the best performance? or is it best just to do
it sequentially?

THanks
Chris

Tim Pease · Feb 7, 2008

Im required to open 50+ files and parse the data in them. WOuld using
multiple threads give me the best performance? or is it best just
to do
it sequentially?

Better to do it sequentially since (1) ruby is single threaded
anyways, (2) the disk IO is going to be the biggest bottleneck, and
(3) you'll most likely run out of file descriptors.

Blessings,
TwP

Phrogz · Feb 7, 2008

Im required to open 50+ files and parse the data in them. WOuld using
multiple threads give me the best performance? or is it best just to do
it sequentially?

I suspect it depends on how long the parsing of data takes.

If it's fast, trying to read 50 files simultaneously will likely (I'm
guessing) cause disk thrashing that will slow you down.

If processing each file is much longer than reading the file from
disk, and you have multiple CPUs, and can use native threads, and can
schedule the read of one file to begin after another ends...probably
you can speed things up.

I made all those answers up, but I'm guessing they're correct

Joel VanderWerf · Feb 7, 2008

Chris said:
Im required to open 50+ files and parse the data in them. WOuld using
multiple threads give me the best performance? or is it best just to do
it sequentially?

Is it possible that in the future you will need to do this with sockets
in place of files?

MenTaLguY · Feb 7, 2008

Im required to open 50+ files and parse the data in them. WOuld using
multiple threads give me the best performance? or is it best just to do
it sequentially?

There's the same amount of IO bandwidth to go around no matter how many
threads you throw at the problem (and in practice if you add more threads you
start wasting bandwidth due to seeking and other overhead). Given that,
it's almost always best to do things sequentially.

If you are using a native-threaded runtime (e.g. JRuby), and you can prove
that you aren't consuming most of the available IO bandwidth yet (e.g. because
parsing is taking longer than the IO), then _maybe_ consider using multiple
threads, but then you need to be careful to only use enough to consume the
available IO bandwidth and no more. If you want to use your IO bandwidth most
effectively, asynchronous IO (e.g. with libev, etc.) is often a better idea.

-mental

Phlip · Feb 7, 2008

Chris said:
Im required to open 50+ files and parse the data in them. WOuld using
multiple threads give me the best performance? or is it best just to do
it sequentially?

Fifty files of sub-megabyte size is a piffling on a modern CPU. Between your
code and the hard drive surface are several layers of buffers, most supported by
dedicated hardware. They are all geared to sequential reads. For example, if you
read 1k from a file, and if the read-write head is still flying over that file
when it reaches the end of that 1k, it will continue scooping up file data. This
goes into the drive's memory cache, so the next request for 1k will return from
the memory cache. You generally cannot go wrong by reading files sequentially.

Almost all these memory caches (on the drive, in your memory, on your bus, and
inside your CPU but outside your actual ALU) use dedicated hardware to operate
asynchronously. The only thing better than a simulated thread is a real thread
in alternate hardware. You already have that in these caches.

Now, do you need to cross-reference these files, and alternate reads and writes
between distant points among them? That will cause thrashing - and if you must
synchronize these threads with semaphores then you will probably increase the
thrashing, unless you are a computer scientist who can determine the exact
algorithm required to keep every thread well-fed, without thread starvation.

Conclusion: Open each one, in order, process it sequentially, and close it. Then
profile your program, paying attention to user time, CPU time, and IO time. If
the IO time is very high, you are spending too much time waiting. If this
happens, you might consider breaking everything into threads, then sending all
the files simultaneously to your filesystem driver. It may have a function that
lets you batch up a whole bunch of file commands and simultaneously execute
them. This allows the harddrive to optimize its read operations, and multiplex
all the results together.

Don't do any of this unless you have a working program, _and_ you think its
slow, _AND_ your customers think it's slow. Premature optimization is the root
of all evil.

John Carter · Feb 7, 2008

Im required to open 50+ files and parse the data in them. WOuld using
multiple threads give me the best performance? or is it best just to do
it sequentially?

Prefer processes to threads on unix.

Depends on whether you have multiple cores.

Depends on what the file devices are. I have one small app where the
fd's are sockets to machines that may or may not have a certain other
application up. (The app finds out)

I spin one thread per machine, and open all connections in
parallel. The time to completion is the time for a single connect
fail, which is about N times faster than testing each connection in
series.

Depends also of data locality. Cache is many times faster than
ram. If you can live in cache, you go much faster. If multiple threads
mean you spend less time in cache, you go much slower.

John Carter Phone : (64)(3) 358 6639
Tait Electronics Fax : (64)(3) 359 4632
PO Box 1645 Christchurch Email : (e-mail address removed)
New Zealand

Robert Klemme · Feb 8, 2008

2008/2/7 said:
There's the same amount of IO bandwidth to go around no matter how many
threads you throw at the problem (and in practice if you add more threads you
start wasting bandwidth due to seeking and other overhead). Given that,
it's almost always best to do things sequentially.

... unless all files reside on different IO devices in which case
parallel reading *can* be faster than sequentially. If they are on
the same filesystem I'd certainly prefer to read them sequentially.
There might be a slight performance gain by decoupling reading,
parsing (and probably output) into different threads. But that mostly
depends on IO speed and processing complexity and the slowest part
determines throughput - no matter what.

If you are using a native-threaded runtime (e.g. JRuby), and you can prove
that you aren't consuming most of the available IO bandwidth yet (e.g. because
parsing is taking longer than the IO), then _maybe_ consider using multiple
threads, but then you need to be careful to only use enough to consume the
available IO bandwidth and no more. If you want to use your IO bandwidth most
effectively, asynchronous IO (e.g. with libev, etc.) is often a better idea.

Good points.

Cheers

robert

James Tucker · Feb 9, 2008

Take a look at the wide finder implementations on Tim Brays blog.

It's quite interesting to see over there how little IO was a
bottleneck. (Which seems to have been repeated a number of times here).

Whilst the test environment is probably drastically different from
your own, it might be worth looking at how some of those solutions
solved the problem, and also give you some good reading on the topic.

Francis Cianfrocca · Feb 10, 2008

[Note: parts of this message were removed to make it a legal post.]

I basically gave up on optimizing hard-disk I/O long ago. (In
Ruby/EventMachine, I started adding an event-driven interface for disk
files, and will probably complete it someday, but initial profiling showed
relatively little benefit.)

A big part of the problem is that different machines have different
controller hardware, with a wide variance not only in raw performance, but
also in caching strategies and in the way they schedule the physical seeks.
Multispindle systems change the behavior yet again. You can develop on one
machine hoping to get some level of performance improvement, and find a
totally different behavior when you go to production.

Take a look at the wide finder implementations on Tim Brays blog.

It's quite interesting to see over there how little IO was a
bottleneck. (Which seems to have been repeated a number of times here).

Whilst the test environment is probably drastically different from
your own, it might be worth looking at how some of those solutions
solved the problem, and also give you some good reading on the topic.

ara howard · Feb 10, 2008

I basically gave up on optimizing hard-disk I/O long ago. (In
Ruby/EventMachine, I started adding an event-driven interface for disk
files, and will probably complete it someday, but initial profiling
showed
relatively little benefit.)

A big part of the problem is that different machines have different
controller hardware, with a wide variance not only in raw
performance, but
also in caching strategies and in the way they schedule the physical
seeks.
Multispindle systems change the behavior yet again. You can develop
on one
machine hoping to get some level of performance improvement, and
find a
totally different behavior when you go to production.

good advice. i've had quite a bit of experience optimizing large
scale processing (really large) and seen that there is always an
optimal io/cpu usage pattern (two processes per cpu in dual-cpu
machines with dual disk controllers, etc) but also that it is *always*
specific to the exact hardware setup. i agree that it's mostly
impossible to try to come up with a generic solution.

cheers.

a @ http://codeforpeople.com/

Chris Richards · Feb 11, 2008

wow.... just tried Jruby1.1 on my script that opens a thousand files and
processes them.

Ruby : 11seconds
JRuby 1st run : 3.3 seconds
Jruby second run : 1.1 Second

very nice darlin!

Find and count strings of text from multiple files	17	Dec 16, 2021
How can I train a neural network by reading different csv files	0	Nov 24, 2022
Sending Error when attaching files	1	Aug 7, 2023
Select Eof extension files based on text list of filenames with if condition	0	May 4, 2022
Select files based on text list of filenames(part of the name:date) with condition	0	May 4, 2022
Faster way to XOR?	0	Jan 7, 2011
Help with importing from multiple files and printing lines in designated spot to spit out one file.	1	Jan 16, 2023
I need help in understanding these files on my phone, Could someone help me understand these files? Urgent help needed. Please help.	1	Jun 4, 2023

is lots of files with Threads faster?

Chris Richards

Tim Pease

Phrogz

Joel VanderWerf

MenTaLguY

Phlip

John Carter

Robert Klemme

James Tucker

Francis Cianfrocca

ara howard

Chris Richards

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads