Threading model for reading 1,000 files quickly?

L

leegee

I have directory with many sub-directories, each with many thousands of files.

I wish to process each file, which takes one or two seconds.

I wish to simultaneously process as many files as possible.

I'm not clear the best approach to this.

Should I use a thread pool? Is Java capable of maximising the hardware resources to determine how many threads it can simultaneously execute?

TIA
Lee
 
E

Eric Sosman

I have directory with many sub-directories, each with many thousands of files.

I wish to process each file, which takes one or two seconds.

"Many" sub-directories (let's say a hundred) times "many
thousands" of files each (let's say ten thousand) times "one or
two" seconds per file (let's say 1.5) -- Okay, that's about two
and a half weeks if you process them one at a time.

If two and a half weeks is an acceptable amount of time, you
should probably turn your attention to things like checkpointing
of partial results, so that a power failure or crash on Day Nine
doesn't force you to restart from the very beginning.
I wish to simultaneously process as many files as possible.

"As many as possible" -- With what kind of equipment? What's
"possible" for an Amazon data center might be beyond the reach
of an Amazon Kindle ...
I'm not clear the best approach to this.

Should I use a thread pool? Is Java capable of maximising the hardware resources to determine how many threads it can simultaneously execute?

You should use a thousand machines, each doing a thousandth
of the overall job. They'll finish in about twenty minutes.

(If you want more definite answers, you'll need to provide a
more definite problem statement.)
 
M

markspace

Your problem here is not threading, but disk IO. Specifically disk seeks. If
you are using a rotating disk (as opposed to a SSD), and all the files are on
the same spindle, then using > 1 thread will just slow things down as the
different thread "fight" to position the disk heads over "their" files.


I gotta disagree also. It's actually well known trick to "flood" a disk
controller with requests, so that the disk controller can organize and
optimize the head read/write access. Especially if the files are small,
it's possible the disk controller might be able to read multiple files
in multiple blocks in a single revolution of the disk platter. You can't
do that if you only open one file; the disk controller would only read
the one file in that case. Multiple threads are probably the easiest
way to do this.

Yes, if performance is critical, you might investigate a single thread
and non-blocking IO. Less the overhead of multiple threads, that might
actually be faster. But since IO is still the limiting factor, I think
the speed increase would be small.

It also might be worth it for the OP to test their own performance.
There's lots of ways here that we could be wrong, given unexpected
system features. SingleThreadedExecutor vs. CachedThreadPool should
make this easy to test.
 
K

Kevin McMurtrie

I have directory with many sub-directories, each with many thousands of
files.

I wish to process each file, which takes one or two seconds.

I wish to simultaneously process as many files as possible.

I'm not clear the best approach to this.

Should I use a thread pool? Is Java capable of maximising the hardware
resources to determine how many threads it can simultaneously execute?

TIA
Lee

Your main thread creates several worker threads, recurses the directory
structure, creates a task to process each file, and puts each task in a
shared BlockingQueue having a bounded size. When all files have been
queued, a special END marker is placed on the end of the BlockingQueue
and join() is called for each worker thread.

Each worker thread pulls one task from the shared BlockingQueue,
executes it, and loops to pull another task. When the END marker is
seen by a thread, it is placed back into the BlockingQueue and the
thread exits.



An alternative is to use a customized ThreadPoolExecutor of a fixed
size. Construct it with a bounded BlockingQueue and a rejection policy
of ThreadPoolExecutor.CallerRunsPolicy. ThreadPoolExecutor is loaded
with quirky features (bugs) that can be a hassle so it's not my first
choice for very specific tasks.

Runtime.availableProcessors() can help you guess how much concurrent CPU
time is available. The actual concurrency is a complex factor so a
simpler solution is creating as many threads as you can without running
out of memory or thrashing the filesystem.
 
R

Robert Klemme

I must admit that I had forgotten that aspect of the situation.

To me it seems there are a lot more "forgotten aspects"...
Consider: would you choose the time when you've got a big disk operation
running (copying a huge number of files say) to kick off a virus scan on the
same spindle ? I most certainly would not, perhaps your experience has been
different.

File copying is only IO with negligible CPU, virus scanning only looks
at portions of files. We do not know whether that scenario only
remotely resembles the problem the OP is trying to tackle.
The problem is that the analysis in terms of scattered disk blocks is
unrealistic.If the blocks of each file are actually randomised across the
disk, then the analysis works. But in that case a simple defrag seems to make
more sense to me.

Not all file systems support online or offline defragmentation and we do
not even yet know the file system. Heck, files may actually reside on
some type of network share or on a RAID storage with it's own caching
and read strategies. Also, since defragmentation usually works on a
whole file system the cost overhead might not pay off at all. Btw.
another fact we do not know yet (as far as I can see) is whether this is
a one off thing or the processing should be done repeatedly (in case of
one off the whole discussion is superfluous as it costs more time than
the overhead of a sub optimal IO and threading strategy). It may also
make sense to know how files get there (maybe it's even more efficient
to fetch files in Java with a HTTP client from where they are taken and
process them while downloading, i.e. without ever writing them to disk).
If, on the other hand, the block/s/ in most files are
mostly contiguous, and each thread is processing those blocks mostly
sequentially, then running even two threads will turn the fast sequential
access pattern

B+0, B+1, B+2, ... B+n, C+0, C+1, C+2, ... C+m

into something more like:

B+0, C+0, B+1, C+1, ... B+n, C+m

which is a disaster.

We cannot know. First of all we do not know the size of files, do we?
So files might actually take up just one block. Then, the operating
system might actually be prefetching blocks of individual files when it
detects the access pattern (reading in one go from head to tail) to fill
the cache even before blocks are accessed.

Oh, and btw., we do not even know the read pattern, do we? Are files
read from beginning to end? Are they accessed more in a random access
fashion? And we do not know the nature of the processing either. At
the moment we just know that it takes one to two seconds (on what
hardware and OS?) - but we do not know whether that is because of CPU
load or IO load etc.
Of course, my analysis also depends on assumptions about the actual files and
their layout, but I don't think the assumptions are unreasonable. In fact, in
the absence of more specific data, I'd call 'em good ;-)

That's a bold statement. You call an analysis "good" which just fills
in unmentioned assumptions for missing facts - a lot of missing facts.

Cheers

robert
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top