Built for speed - mmap, threads

Discussion in 'C++' started by Michael, Feb 19, 2006.

  1. Michael

    Michael Guest

    I'm writing an application that decodes a file containing binary
    records. Each record is a particular event type. Each record is
    translated into ASCII and then written to a file. Each file contains
    the same events. At the moment each record is processed one after the
    other. It taks about 1m40s to process a large file containing 70,000
    records. Would my application benifit from multiple threads and mmap?

    If so what is the best way to manage the multiple output files? For
    example there are 20 event types. When parsing the file I identify the
    event type and build 20 lists. Then have 20 threads each working with
    each event file.

    How do I extract this into classes?
     
    Michael, Feb 19, 2006
    #1
    1. Advertisements

  2. Michael

    Ian Collins Guest

    Well that all depends on ow many cores you have to run them on, and is
    bit OT here. You'll have better luck on comp.programming.threads, or
    one specific to your platform.
    Odds are that'll slow you down due to context switches, assuming you
    have less tan 20 cores.
    Again, try comp.programming.threads, maybe with an example of how you
    think it could be done.
     
    Ian Collins, Feb 19, 2006
    #2
    1. Advertisements

  3. Michael

    Michael Guest

    OK, thanks will try threads. Target is 8 sun sparc IV dual core CPU.
     
    Michael, Feb 19, 2006
    #3
  4. : I'm writing an application that decodes a file containing binary
    : records. Each record is a particular event type. Each record is
    : translated into ASCII and then written to a file. Each file contains
    : the same events. At the moment each record is processed one after the
    : other. It taks about 1m40s to process a large file containing 70,000
    : records. Would my application benifit from multiple threads and mmap?

    You don't say how much processing is being performed on the events,
    or what is the actual size of the file/each record.

    Using memory-mapping will typically help a lot if the performance
    is i/o bound. It also often simplifies the reading/processing of
    the data. So it something I often do upfront.

    However, 100s for 70k records seems relatively long, so would assume
    that your are doing quite some processing. It is likely that this
    processing itself (its algorithms) could be improved quite a bit.
    You should use a profiler and find out what is being the most
    time-consuming -- you might find an obvious culprit.

    : If so what is the best way to manage the multiple output files? For
    : example there are 20 event types. When parsing the file I identify the
    : event type and build 20 lists. Then have 20 threads each working with
    : each event file.

    Regarding the output: it might be good to prepare the output
    in a memory buffer, and to write/flush them in large chunks.
    But this all depends on your current memory usage, etc.

    Using multiple threads will not automatically improve performance,
    unless you carfully craft your design based on a thorough analysis.
    Just creating one thread for each output file typically won't help.

    : How do I extract this into classes?

    What do you think you need classes for?


    By the way, your question has nothing to do with the C++ language,
    and therefore doesn't belong in this NG.
    Try a platform-specific forum?


    hth -Ivan
     
    Ivan Vecerina, Feb 19, 2006
    #4
  5. Michael

    Greg Guest

    The answer is a definite maybe. The threads question is highly hardware
    dependent. Multiple threads are most effective on machines with
    multiple processors. Otherwise, simply increasing the number of threads
    does not increase a machine's processing power to like degree. In fact
    because switching between threads entails some overhead, it is just as
    possible to wind up with too many threads instead of too few when
    guessing blindly for the optimal number.

    Since you have not provided a detailed description about the
    application's current memory use and I/O characteristics, it is
    impossible to say whether mmap would help or not. And the first order
    of business in any case has to be to profile the current app and find
    out how it is spending those 100 seconds. If 90% of that time is in
    parsing code, than no, mmap will be unlikely to help. If, on the other
    hand, a large portion of that time is spent in disk I/O operations (as
    is often the case), then yes, a few large read and write operations
    (instead of many little ones) will do more to improve performance than
    almost any other type of optimization. But without knowing the extent
    to which the current application has optimized its behavior, it's
    futile to estimate how much further its performance could be optimized.
    Unless the hardware has a lot of multiprocessing capability, 20 threads
    sound like far too many. But only profiling and testing various
    implementations will be able to find the optimal number of threads for
    this app running on a particular hardware configuation.

    As for the 20 event types, I would not do anything fancy. If the 20
    possible types are fixed, then declaring an array of 20 file handles
    with using an enum as an index into that array to find the
    corresponding file handle should suffice. Just avoid "magic numbers"
    like 20, and define const integral values in their place.
    I'm not sure that a program that performs a linear processing task
    benefits a great deal from classes. Classes (and a class hierarchy)
    work best as a dynamic model - often one driven by an ever-changing
    series of events (often generated by the user's interaction with the
    application). A program that opens a file, parses its contents, closes
    the file and declares itself done is really conducting a series of
    predictable sequential operations. And the only reason for wanting to
    use classes here would be for maintainability (because I can't see that
    performance issues would ever mandate implementing classes).

    So the question to ask is whether classes would necessarily make the
    code more maintainable? A well-designed and implemented class model
    should, but otherwise a class model designed for its own sake would
    probably be harder to maintain. Because a class hierarchy of any kind,
    almost always increases the total complexity of a program (in other
    words there is more code). But because code in a well-designed
    hierarchy better encapsulates its complexity, a programmer is able to
    work on the program's logic in smaller "pieces" (thereby reducing the
    complexity that the programmer has to deal with at any one time).

    Lastly, maintainability is a separate issue from performance. And one
    that should be addressed first. It wouldn't make sense to fine tune the
    app's performance if its code is going to be thrown out and replaced
    with an object-oriented implemnentation in the final, shipping version.

    So to recap: first, decide whether (and then implement, if the decision
    is affirmative) a class hierarchy would improve the maintainability of
    the source code to such an extent that would justify the additional
    work. Second, profile the app to obtain a precise accounting of the 100
    seconds it spends processing records. Next, use that profile
    information to target bottlenecks: remedy them using standard
    optimization techniques (such as using fewer I/O requests by increasing
    the size of each request, or, if parsing is the bottleneck, use a table
    driven for maximal speed). And lastly the most important point: it's
    simply never effective to try to speed up a program, without first
    learning why it is so slow.

    Greg
     
    Greg, Feb 19, 2006
    #5
  6. Michael

    Michael Guest

    Sorry was half thinking about how to write this?

    I know where all the time is being spent as I timed each task as I was
    developing. For each record I am setting a TCL array and then dumping
    to file. I still need to add logic but I am concentrating on raw speed
    at the moment end to end.

    I already have the decoding part which goes through the file and
    creates an index. It is 1 class. It taks about 2 seconds to create the
    index on a 30M file - 70000 records. The index is public so I can
    directly access this index to get an offset to different parts of the
    file. The file is loaded into memory at startup but I will eventually
    mmap it - once I work out how to and if it makes a difference to
    performance.

    I was thinking of creating another class which would be the decode
    thread manager. This would decide how many threads were needed for a
    particular file, create the threads and then balance the load on each
    thread by deciding which records each thread would process. A thread
    would store output data in a buffer which would then be copied and
    flush to file. Memory isn't a problem I have 32GB to play around with.
     
    Michael, Feb 19, 2006
    #6
  7. Michael

    Michael Guest

    Thanks Greg,

    I'm using C++ because I haven't used C for ages and don't wont to mess
    around with memory management and pointers - core dumps. It's quicker
    for me to write code to store things like configuration in vectors and
    let them deal with cleaning up memory. I only have one 'new/delete' and
    that is to create a large buffer to hold the contents of the file in
    memory - this will eventually disappear once I get mmap working - in
    cygwin/g++. I'm not that bothered about memory overhead of using
    vectors as I've got 32GB to work with.

    I did some rough profiling. Without writing to file the processing
    (paring file, setting internal TCL variables) maximises the CPU usage.
    With writing to disk, the CPU usage goes down to 35% (2 CPU Sparc III)
    and there is I/O wait. So with threads and mmap I'm hoping that I will
    make maximum avalaible usage to hardware.

    Michael
     
    Michael, Feb 19, 2006
    #7
  8. : Sorry was half thinking about how to write this?
    :
    : I know where all the time is being spent as I timed each task as I was
    : developing. For each record I am setting a TCL array and then dumping
    : to file. I still need to add logic but I am concentrating on raw speed
    : at the moment end to end.
    I do not know what a TCL array is (TCL/TK, Think Class Library, or??).

    : I already have the decoding part which goes through the file and
    : creates an index. It is 1 class. It taks about 2 seconds to create the
    : index on a 30M file - 70000 records. The index is public so I can
    : directly access this index to get an offset to different parts of the
    : file. The file is loaded into memory at startup but I will eventually
    : mmap it - once I work out how to and if it makes a difference to
    : performance.
    You say you did time measurements, yet you only account for 2sec
    out of 100. Using a profiler will highlight the hot spots in your
    program to a single line. Only this will allow you to identify,
    for example, that you spend too much time in memory allocations,
    or searches, and allow you to optimize your algorithms and data
    structures.

    : I was thinking of creating another class which would be the decode
    : thread manager. This would decide how many threads were needed for a
    : particular file, create the threads and then balance the load on each
    : thread by deciding which records each thread would process. A thread
    : would store output data in a buffer which would then be copied and
    : flush to file. Memory isn't a problem I have 32GB to play around with.
    Good in terms of caching file outputs.
    Keep in mind, though, that memory accesses are nowadays what takes
    the most time in all simple-to-moderately complex processing algos.
    Avoiding reallocations, and using contiguous memory accesses, can
    make a real difference.

    Again, don't bother using threads until you have analyzed the
    performance profile of your application.


    Ivan
     
    Ivan Vecerina, Feb 19, 2006
    #8
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.