is lots of files with Threads faster?

Discussion in 'Ruby' started by Chris Richards, Feb 7, 2008.

  1. Im required to open 50+ files and parse the data in them. WOuld using
    multiple threads give me the best performance? or is it best just to do
    it sequentially?

    THanks
    Chris
    --
    Posted via http://www.ruby-forum.com/.
    Chris Richards, Feb 7, 2008
    #1
    1. Advertising

  2. Chris Richards

    Tim Pease Guest

    On Feb 7, 2008, at 1:21 PM, Chris Richards wrote:

    > Im required to open 50+ files and parse the data in them. WOuld using
    > multiple threads give me the best performance? or is it best just
    > to do
    > it sequentially?
    >


    Better to do it sequentially since (1) ruby is single threaded
    anyways, (2) the disk IO is going to be the biggest bottleneck, and
    (3) you'll most likely run out of file descriptors.

    Blessings,
    TwP
    Tim Pease, Feb 7, 2008
    #2
    1. Advertising

  3. Chris Richards

    Phrogz Guest

    On Feb 7, 1:21 pm, Chris Richards <> wrote:
    > Im required to open 50+ files and parse the data in them. WOuld using
    > multiple threads give me the best performance? or is it best just to do
    > it sequentially?


    I suspect it depends on how long the parsing of data takes.

    If it's fast, trying to read 50 files simultaneously will likely (I'm
    guessing) cause disk thrashing that will slow you down.

    If processing each file is much longer than reading the file from
    disk, and you have multiple CPUs, and can use native threads, and can
    schedule the read of one file to begin after another ends...probably
    you can speed things up.

    I made all those answers up, but I'm guessing they're correct :)
    Phrogz, Feb 7, 2008
    #3
  4. Chris Richards wrote:
    > Im required to open 50+ files and parse the data in them. WOuld using
    > multiple threads give me the best performance? or is it best just to do
    > it sequentially?


    Is it possible that in the future you will need to do this with sockets
    in place of files?

    --
    vjoel : Joel VanderWerf : path berkeley edu : 510 665 3407
    Joel VanderWerf, Feb 7, 2008
    #4
  5. Chris Richards

    MenTaLguY Guest

    On Fri, 8 Feb 2008 05:21:51 +0900, Chris Richards <> wrote:
    > Im required to open 50+ files and parse the data in them. WOuld using
    > multiple threads give me the best performance? or is it best just to do
    > it sequentially?


    There's the same amount of IO bandwidth to go around no matter how many
    threads you throw at the problem (and in practice if you add more threads you
    start wasting bandwidth due to seeking and other overhead). Given that,
    it's almost always best to do things sequentially.

    If you are using a native-threaded runtime (e.g. JRuby), and you can prove
    that you aren't consuming most of the available IO bandwidth yet (e.g. because
    parsing is taking longer than the IO), then _maybe_ consider using multiple
    threads, but then you need to be careful to only use enough to consume the
    available IO bandwidth and no more. If you want to use your IO bandwidth most
    effectively, asynchronous IO (e.g. with libev, etc.) is often a better idea.

    -mental
    MenTaLguY, Feb 7, 2008
    #5
  6. Chris Richards

    Phlip Guest

    Chris Richards wrote:
    > Im required to open 50+ files and parse the data in them. WOuld using
    > multiple threads give me the best performance? or is it best just to do
    > it sequentially?


    Fifty files of sub-megabyte size is a piffling on a modern CPU. Between your
    code and the hard drive surface are several layers of buffers, most supported by
    dedicated hardware. They are all geared to sequential reads. For example, if you
    read 1k from a file, and if the read-write head is still flying over that file
    when it reaches the end of that 1k, it will continue scooping up file data. This
    goes into the drive's memory cache, so the next request for 1k will return from
    the memory cache. You generally cannot go wrong by reading files sequentially.

    Almost all these memory caches (on the drive, in your memory, on your bus, and
    inside your CPU but outside your actual ALU) use dedicated hardware to operate
    asynchronously. The only thing better than a simulated thread is a real thread
    in alternate hardware. You already have that in these caches.

    Now, do you need to cross-reference these files, and alternate reads and writes
    between distant points among them? That will cause thrashing - and if you must
    synchronize these threads with semaphores then you will probably increase the
    thrashing, unless you are a computer scientist who can determine the exact
    algorithm required to keep every thread well-fed, without thread starvation.

    Conclusion: Open each one, in order, process it sequentially, and close it. Then
    profile your program, paying attention to user time, CPU time, and IO time. If
    the IO time is very high, you are spending too much time waiting. If this
    happens, you might consider breaking everything into threads, then sending all
    the files simultaneously to your filesystem driver. It may have a function that
    lets you batch up a whole bunch of file commands and simultaneously execute
    them. This allows the harddrive to optimize its read operations, and multiplex
    all the results together.

    Don't do any of this unless you have a working program, _and_ you think its
    slow, _AND_ your customers think it's slow. Premature optimization is the root
    of all evil.

    --
    Phlip
    Phlip, Feb 7, 2008
    #6
  7. Chris Richards

    John Carter Guest

    On Fri, 8 Feb 2008, Chris Richards wrote:

    > Im required to open 50+ files and parse the data in them. WOuld using
    > multiple threads give me the best performance? or is it best just to do
    > it sequentially?


    Prefer processes to threads on unix.

    Depends on whether you have multiple cores.

    Depends on what the file devices are. I have one small app where the
    fd's are sockets to machines that may or may not have a certain other
    application up. (The app finds out)

    I spin one thread per machine, and open all connections in
    parallel. The time to completion is the time for a single connect
    fail, which is about N times faster than testing each connection in
    series.

    Depends also of data locality. Cache is many times faster than
    ram. If you can live in cache, you go much faster. If multiple threads
    mean you spend less time in cache, you go much slower.


    John Carter Phone : (64)(3) 358 6639
    Tait Electronics Fax : (64)(3) 359 4632
    PO Box 1645 Christchurch Email :
    New Zealand
    John Carter, Feb 7, 2008
    #7
  8. 2008/2/7, MenTaLguY <>:
    > On Fri, 8 Feb 2008 05:21:51 +0900, Chris Richards <> wrote:
    > > Im required to open 50+ files and parse the data in them. WOuld using
    > > multiple threads give me the best performance? or is it best just to do
    > > it sequentially?

    >
    > There's the same amount of IO bandwidth to go around no matter how many
    > threads you throw at the problem (and in practice if you add more threads you
    > start wasting bandwidth due to seeking and other overhead). Given that,
    > it's almost always best to do things sequentially.


    ... unless all files reside on different IO devices in which case
    parallel reading *can* be faster than sequentially. If they are on
    the same filesystem I'd certainly prefer to read them sequentially.
    There might be a slight performance gain by decoupling reading,
    parsing (and probably output) into different threads. But that mostly
    depends on IO speed and processing complexity and the slowest part
    determines throughput - no matter what.

    > If you are using a native-threaded runtime (e.g. JRuby), and you can prove
    > that you aren't consuming most of the available IO bandwidth yet (e.g. because
    > parsing is taking longer than the IO), then _maybe_ consider using multiple
    > threads, but then you need to be careful to only use enough to consume the
    > available IO bandwidth and no more. If you want to use your IO bandwidth most
    > effectively, asynchronous IO (e.g. with libev, etc.) is often a better idea.


    Good points.

    Cheers

    robert

    --
    use.inject do |as, often| as.you_can - without end
    Robert Klemme, Feb 8, 2008
    #8
  9. Chris Richards

    James Tucker Guest

    Take a look at the wide finder implementations on Tim Brays blog.

    It's quite interesting to see over there how little IO was a
    bottleneck. (Which seems to have been repeated a number of times here).

    Whilst the test environment is probably drastically different from
    your own, it might be worth looking at how some of those solutions
    solved the problem, and also give you some good reading on the topic.

    On 7 Feb 2008, at 20:21, Chris Richards wrote:

    > Im required to open 50+ files and parse the data in them. WOuld using
    > multiple threads give me the best performance? or is it best just
    > to do
    > it sequentially?
    >
    > THanks
    > Chris
    > --
    > Posted via http://www.ruby-forum.com/.
    >
    James Tucker, Feb 9, 2008
    #9
  10. [Note: parts of this message were removed to make it a legal post.]

    I basically gave up on optimizing hard-disk I/O long ago. (In
    Ruby/EventMachine, I started adding an event-driven interface for disk
    files, and will probably complete it someday, but initial profiling showed
    relatively little benefit.)

    A big part of the problem is that different machines have different
    controller hardware, with a wide variance not only in raw performance, but
    also in caching strategies and in the way they schedule the physical seeks.
    Multispindle systems change the behavior yet again. You can develop on one
    machine hoping to get some level of performance improvement, and find a
    totally different behavior when you go to production.

    On Feb 9, 2008 12:19 PM, James Tucker <> wrote:

    > Take a look at the wide finder implementations on Tim Brays blog.
    >
    > It's quite interesting to see over there how little IO was a
    > bottleneck. (Which seems to have been repeated a number of times here).
    >
    > Whilst the test environment is probably drastically different from
    > your own, it might be worth looking at how some of those solutions
    > solved the problem, and also give you some good reading on the topic.
    >
    > On 7 Feb 2008, at 20:21, Chris Richards wrote:
    >
    > > Im required to open 50+ files and parse the data in them. WOuld using
    > > multiple threads give me the best performance? or is it best just
    > > to do
    > > it sequentially?
    > >
    > > THanks
    > > Chris
    > > --
    > > Posted via http://www.ruby-forum.com/.
    > >

    >
    >
    >
    Francis Cianfrocca, Feb 10, 2008
    #10
  11. Chris Richards

    ara howard Guest

    On Feb 10, 2008, at 5:19 AM, Francis Cianfrocca wrote:

    > I basically gave up on optimizing hard-disk I/O long ago. (In
    > Ruby/EventMachine, I started adding an event-driven interface for disk
    > files, and will probably complete it someday, but initial profiling
    > showed
    > relatively little benefit.)
    >
    > A big part of the problem is that different machines have different
    > controller hardware, with a wide variance not only in raw
    > performance, but
    > also in caching strategies and in the way they schedule the physical
    > seeks.
    > Multispindle systems change the behavior yet again. You can develop
    > on one
    > machine hoping to get some level of performance improvement, and
    > find a
    > totally different behavior when you go to production.


    good advice. i've had quite a bit of experience optimizing large
    scale processing (really large) and seen that there is always an
    optimal io/cpu usage pattern (two processes per cpu in dual-cpu
    machines with dual disk controllers, etc) but also that it is *always*
    specific to the exact hardware setup. i agree that it's mostly
    impossible to try to come up with a generic solution.

    cheers.

    a @ http://codeforpeople.com/
    --
    share your knowledge. it's a way to achieve immortality.
    h.h. the 14th dalai lama
    ara howard, Feb 10, 2008
    #11
  12. wow.... just tried Jruby1.1 on my script that opens a thousand files and
    processes them.

    Ruby : 11seconds
    JRuby 1st run : 3.3 seconds
    Jruby second run : 1.1 Second

    very nice darlin!
    --
    Posted via http://www.ruby-forum.com/.
    Chris Richards, Feb 11, 2008
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Greg Lindstrom

    Lots of pdf files

    Greg Lindstrom, Jul 20, 2005, in forum: Python
    Replies:
    8
    Views:
    401
    Terry Hancock
    Jul 22, 2005
  2. Replies:
    4
    Views:
    390
    Larry I Smith
    Nov 24, 2005
  3. brad
    Replies:
    9
    Views:
    353
    Bruno Desthuilliers
    Jun 19, 2008
  4. David Combs
    Replies:
    27
    Views:
    277
  5. coolneo
    Replies:
    9
    Views:
    177
    coolneo
    Jan 30, 2007
Loading...

Share This Page