Threading model for reading 1,000 files quickly?

Discussion in 'Java' started by leegee@gmail.com, Oct 1, 2012.

  1. Guest

    I have directory with many sub-directories, each with many thousands of files.

    I wish to process each file, which takes one or two seconds.

    I wish to simultaneously process as many files as possible.

    I'm not clear the best approach to this.

    Should I use a thread pool? Is Java capable of maximising the hardware resources to determine how many threads it can simultaneously execute?

    TIA
    Lee
    , Oct 1, 2012
    #1
    1. Advertising

  2. Eric Sosman Guest

    On 10/1/2012 3:11 AM, wrote:
    > I have directory with many sub-directories, each with many thousands of files.
    >
    > I wish to process each file, which takes one or two seconds.


    "Many" sub-directories (let's say a hundred) times "many
    thousands" of files each (let's say ten thousand) times "one or
    two" seconds per file (let's say 1.5) -- Okay, that's about two
    and a half weeks if you process them one at a time.

    If two and a half weeks is an acceptable amount of time, you
    should probably turn your attention to things like checkpointing
    of partial results, so that a power failure or crash on Day Nine
    doesn't force you to restart from the very beginning.

    > I wish to simultaneously process as many files as possible.


    "As many as possible" -- With what kind of equipment? What's
    "possible" for an Amazon data center might be beyond the reach
    of an Amazon Kindle ...

    > I'm not clear the best approach to this.
    >
    > Should I use a thread pool? Is Java capable of maximising the hardware resources to determine how many threads it can simultaneously execute?


    You should use a thousand machines, each doing a thousandth
    of the overall job. They'll finish in about twenty minutes.

    (If you want more definite answers, you'll need to provide a
    more definite problem statement.)

    --
    Eric Sosman
    d
    Eric Sosman, Oct 1, 2012
    #2
    1. Advertising

  3. markspace Guest

    On 10/1/2012 1:43 AM, Chris Uppal wrote:

    >
    > Your problem here is not threading, but disk IO. Specifically disk seeks. If
    > you are using a rotating disk (as opposed to a SSD), and all the files are on
    > the same spindle, then using > 1 thread will just slow things down as the
    > different thread "fight" to position the disk heads over "their" files.
    >



    I gotta disagree also. It's actually well known trick to "flood" a disk
    controller with requests, so that the disk controller can organize and
    optimize the head read/write access. Especially if the files are small,
    it's possible the disk controller might be able to read multiple files
    in multiple blocks in a single revolution of the disk platter. You can't
    do that if you only open one file; the disk controller would only read
    the one file in that case. Multiple threads are probably the easiest
    way to do this.

    Yes, if performance is critical, you might investigate a single thread
    and non-blocking IO. Less the overhead of multiple threads, that might
    actually be faster. But since IO is still the limiting factor, I think
    the speed increase would be small.

    It also might be worth it for the OP to test their own performance.
    There's lots of ways here that we could be wrong, given unexpected
    system features. SingleThreadedExecutor vs. CachedThreadPool should
    make this easy to test.
    markspace, Oct 1, 2012
    #3
  4. In article <>,
    wrote:

    > I have directory with many sub-directories, each with many thousands of
    > files.
    >
    > I wish to process each file, which takes one or two seconds.
    >
    > I wish to simultaneously process as many files as possible.
    >
    > I'm not clear the best approach to this.
    >
    > Should I use a thread pool? Is Java capable of maximising the hardware
    > resources to determine how many threads it can simultaneously execute?
    >
    > TIA
    > Lee


    Your main thread creates several worker threads, recurses the directory
    structure, creates a task to process each file, and puts each task in a
    shared BlockingQueue having a bounded size. When all files have been
    queued, a special END marker is placed on the end of the BlockingQueue
    and join() is called for each worker thread.

    Each worker thread pulls one task from the shared BlockingQueue,
    executes it, and loops to pull another task. When the END marker is
    seen by a thread, it is placed back into the BlockingQueue and the
    thread exits.



    An alternative is to use a customized ThreadPoolExecutor of a fixed
    size. Construct it with a bounded BlockingQueue and a rejection policy
    of ThreadPoolExecutor.CallerRunsPolicy. ThreadPoolExecutor is loaded
    with quirky features (bugs) that can be a hassle so it's not my first
    choice for very specific tasks.

    Runtime.availableProcessors() can help you guess how much concurrent CPU
    time is available. The actual concurrency is a complex factor so a
    simpler solution is creating as many threads as you can without running
    out of memory or thrashing the filesystem.
    --
    I will not see posts from Google because I must filter them as spam
    Kevin McMurtrie, Oct 2, 2012
    #4
  5. On 03.10.2012 09:24, Chris Uppal wrote:

    > I must admit that I had forgotten that aspect of the situation.


    To me it seems there are a lot more "forgotten aspects"...

    > Consider: would you choose the time when you've got a big disk operation
    > running (copying a huge number of files say) to kick off a virus scan on the
    > same spindle ? I most certainly would not, perhaps your experience has been
    > different.


    File copying is only IO with negligible CPU, virus scanning only looks
    at portions of files. We do not know whether that scenario only
    remotely resembles the problem the OP is trying to tackle.

    > The problem is that the analysis in terms of scattered disk blocks is
    > unrealistic.If the blocks of each file are actually randomised across the
    > disk, then the analysis works. But in that case a simple defrag seems to make
    > more sense to me.


    Not all file systems support online or offline defragmentation and we do
    not even yet know the file system. Heck, files may actually reside on
    some type of network share or on a RAID storage with it's own caching
    and read strategies. Also, since defragmentation usually works on a
    whole file system the cost overhead might not pay off at all. Btw.
    another fact we do not know yet (as far as I can see) is whether this is
    a one off thing or the processing should be done repeatedly (in case of
    one off the whole discussion is superfluous as it costs more time than
    the overhead of a sub optimal IO and threading strategy). It may also
    make sense to know how files get there (maybe it's even more efficient
    to fetch files in Java with a HTTP client from where they are taken and
    process them while downloading, i.e. without ever writing them to disk).

    > If, on the other hand, the block/s/ in most files are
    > mostly contiguous, and each thread is processing those blocks mostly
    > sequentially, then running even two threads will turn the fast sequential
    > access pattern
    >
    > B+0, B+1, B+2, ... B+n, C+0, C+1, C+2, ... C+m
    >
    > into something more like:
    >
    > B+0, C+0, B+1, C+1, ... B+n, C+m
    >
    > which is a disaster.


    We cannot know. First of all we do not know the size of files, do we?
    So files might actually take up just one block. Then, the operating
    system might actually be prefetching blocks of individual files when it
    detects the access pattern (reading in one go from head to tail) to fill
    the cache even before blocks are accessed.

    Oh, and btw., we do not even know the read pattern, do we? Are files
    read from beginning to end? Are they accessed more in a random access
    fashion? And we do not know the nature of the processing either. At
    the moment we just know that it takes one to two seconds (on what
    hardware and OS?) - but we do not know whether that is because of CPU
    load or IO load etc.

    > Of course, my analysis also depends on assumptions about the actual files and
    > their layout, but I don't think the assumptions are unreasonable. In fact, in
    > the absence of more specific data, I'd call 'em good ;-)


    That's a bold statement. You call an analysis "good" which just fills
    in unmentioned assumptions for missing facts - a lot of missing facts.

    Cheers

    robert

    --
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
    Robert Klemme, Oct 3, 2012
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. rote
    Replies:
    3
    Views:
    431
    Mark Rae [MVP]
    Jan 24, 2008
  2. lector

    storing 1,000,000 records

    lector, Apr 6, 2008, in forum: C Programming
    Replies:
    19
    Views:
    473
    Barry Schwarz
    Apr 8, 2008
  3. Replies:
    1
    Views:
    440
  4. Replies:
    0
    Views:
    586
  5. bad_knee

    regex to convert 1000000 -> 1,000,000 ?

    bad_knee, Nov 14, 2003, in forum: Perl Misc
    Replies:
    39
    Views:
    331
    Tad McClellan
    Nov 19, 2003
Loading...

Share This Page