is lots of files with Threads faster?

C

Chris Richards

Im required to open 50+ files and parse the data in them. WOuld using
multiple threads give me the best performance? or is it best just to do
it sequentially?

THanks
Chris
 
T

Tim Pease

Im required to open 50+ files and parse the data in them. WOuld using
multiple threads give me the best performance? or is it best just
to do
it sequentially?

Better to do it sequentially since (1) ruby is single threaded
anyways, (2) the disk IO is going to be the biggest bottleneck, and
(3) you'll most likely run out of file descriptors.

Blessings,
TwP
 
P

Phrogz

Im required to open 50+ files and parse the data in them. WOuld using
multiple threads give me the best performance? or is it best just to do
it sequentially?

I suspect it depends on how long the parsing of data takes.

If it's fast, trying to read 50 files simultaneously will likely (I'm
guessing) cause disk thrashing that will slow you down.

If processing each file is much longer than reading the file from
disk, and you have multiple CPUs, and can use native threads, and can
schedule the read of one file to begin after another ends...probably
you can speed things up.

I made all those answers up, but I'm guessing they're correct :)
 
J

Joel VanderWerf

Chris said:
Im required to open 50+ files and parse the data in them. WOuld using
multiple threads give me the best performance? or is it best just to do
it sequentially?

Is it possible that in the future you will need to do this with sockets
in place of files?
 
M

MenTaLguY

Im required to open 50+ files and parse the data in them. WOuld using
multiple threads give me the best performance? or is it best just to do
it sequentially?

There's the same amount of IO bandwidth to go around no matter how many
threads you throw at the problem (and in practice if you add more threads you
start wasting bandwidth due to seeking and other overhead). Given that,
it's almost always best to do things sequentially.

If you are using a native-threaded runtime (e.g. JRuby), and you can prove
that you aren't consuming most of the available IO bandwidth yet (e.g. because
parsing is taking longer than the IO), then _maybe_ consider using multiple
threads, but then you need to be careful to only use enough to consume the
available IO bandwidth and no more. If you want to use your IO bandwidth most
effectively, asynchronous IO (e.g. with libev, etc.) is often a better idea.

-mental
 
P

Phlip

Chris said:
Im required to open 50+ files and parse the data in them. WOuld using
multiple threads give me the best performance? or is it best just to do
it sequentially?

Fifty files of sub-megabyte size is a piffling on a modern CPU. Between your
code and the hard drive surface are several layers of buffers, most supported by
dedicated hardware. They are all geared to sequential reads. For example, if you
read 1k from a file, and if the read-write head is still flying over that file
when it reaches the end of that 1k, it will continue scooping up file data. This
goes into the drive's memory cache, so the next request for 1k will return from
the memory cache. You generally cannot go wrong by reading files sequentially.

Almost all these memory caches (on the drive, in your memory, on your bus, and
inside your CPU but outside your actual ALU) use dedicated hardware to operate
asynchronously. The only thing better than a simulated thread is a real thread
in alternate hardware. You already have that in these caches.

Now, do you need to cross-reference these files, and alternate reads and writes
between distant points among them? That will cause thrashing - and if you must
synchronize these threads with semaphores then you will probably increase the
thrashing, unless you are a computer scientist who can determine the exact
algorithm required to keep every thread well-fed, without thread starvation.

Conclusion: Open each one, in order, process it sequentially, and close it. Then
profile your program, paying attention to user time, CPU time, and IO time. If
the IO time is very high, you are spending too much time waiting. If this
happens, you might consider breaking everything into threads, then sending all
the files simultaneously to your filesystem driver. It may have a function that
lets you batch up a whole bunch of file commands and simultaneously execute
them. This allows the harddrive to optimize its read operations, and multiplex
all the results together.

Don't do any of this unless you have a working program, _and_ you think its
slow, _AND_ your customers think it's slow. Premature optimization is the root
of all evil.
 
J

John Carter

Im required to open 50+ files and parse the data in them. WOuld using
multiple threads give me the best performance? or is it best just to do
it sequentially?

Prefer processes to threads on unix.

Depends on whether you have multiple cores.

Depends on what the file devices are. I have one small app where the
fd's are sockets to machines that may or may not have a certain other
application up. (The app finds out)

I spin one thread per machine, and open all connections in
parallel. The time to completion is the time for a single connect
fail, which is about N times faster than testing each connection in
series.

Depends also of data locality. Cache is many times faster than
ram. If you can live in cache, you go much faster. If multiple threads
mean you spend less time in cache, you go much slower.


John Carter Phone : (64)(3) 358 6639
Tait Electronics Fax : (64)(3) 359 4632
PO Box 1645 Christchurch Email : (e-mail address removed)
New Zealand
 
R

Robert Klemme

2008/2/7 said:
There's the same amount of IO bandwidth to go around no matter how many
threads you throw at the problem (and in practice if you add more threads you
start wasting bandwidth due to seeking and other overhead). Given that,
it's almost always best to do things sequentially.

... unless all files reside on different IO devices in which case
parallel reading *can* be faster than sequentially. If they are on
the same filesystem I'd certainly prefer to read them sequentially.
There might be a slight performance gain by decoupling reading,
parsing (and probably output) into different threads. But that mostly
depends on IO speed and processing complexity and the slowest part
determines throughput - no matter what.
If you are using a native-threaded runtime (e.g. JRuby), and you can prove
that you aren't consuming most of the available IO bandwidth yet (e.g. because
parsing is taking longer than the IO), then _maybe_ consider using multiple
threads, but then you need to be careful to only use enough to consume the
available IO bandwidth and no more. If you want to use your IO bandwidth most
effectively, asynchronous IO (e.g. with libev, etc.) is often a better idea.

Good points.

Cheers

robert
 
J

James Tucker

Take a look at the wide finder implementations on Tim Brays blog.

It's quite interesting to see over there how little IO was a
bottleneck. (Which seems to have been repeated a number of times here).

Whilst the test environment is probably drastically different from
your own, it might be worth looking at how some of those solutions
solved the problem, and also give you some good reading on the topic.
 
F

Francis Cianfrocca

[Note: parts of this message were removed to make it a legal post.]

I basically gave up on optimizing hard-disk I/O long ago. (In
Ruby/EventMachine, I started adding an event-driven interface for disk
files, and will probably complete it someday, but initial profiling showed
relatively little benefit.)

A big part of the problem is that different machines have different
controller hardware, with a wide variance not only in raw performance, but
also in caching strategies and in the way they schedule the physical seeks.
Multispindle systems change the behavior yet again. You can develop on one
machine hoping to get some level of performance improvement, and find a
totally different behavior when you go to production.

Take a look at the wide finder implementations on Tim Brays blog.

It's quite interesting to see over there how little IO was a
bottleneck. (Which seems to have been repeated a number of times here).

Whilst the test environment is probably drastically different from
your own, it might be worth looking at how some of those solutions
solved the problem, and also give you some good reading on the topic.
 
A

ara howard

I basically gave up on optimizing hard-disk I/O long ago. (In
Ruby/EventMachine, I started adding an event-driven interface for disk
files, and will probably complete it someday, but initial profiling
showed
relatively little benefit.)

A big part of the problem is that different machines have different
controller hardware, with a wide variance not only in raw
performance, but
also in caching strategies and in the way they schedule the physical
seeks.
Multispindle systems change the behavior yet again. You can develop
on one
machine hoping to get some level of performance improvement, and
find a
totally different behavior when you go to production.

good advice. i've had quite a bit of experience optimizing large
scale processing (really large) and seen that there is always an
optimal io/cpu usage pattern (two processes per cpu in dual-cpu
machines with dual disk controllers, etc) but also that it is *always*
specific to the exact hardware setup. i agree that it's mostly
impossible to try to come up with a generic solution.

cheers.

a @ http://codeforpeople.com/
 
C

Chris Richards

wow.... just tried Jruby1.1 on my script that opens a thousand files and
processes them.

Ruby : 11seconds
JRuby 1st run : 3.3 seconds
Jruby second run : 1.1 Second

very nice darlin!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top