Built for speed - mmap, threads

M

Michael

I'm writing an application that decodes a file containing binary
records. Each record is a particular event type. Each record is
translated into ASCII and then written to a file. Each file contains
the same events. At the moment each record is processed one after the
other. It taks about 1m40s to process a large file containing 70,000
records. Would my application benifit from multiple threads and mmap?

If so what is the best way to manage the multiple output files? For
example there are 20 event types. When parsing the file I identify the
event type and build 20 lists. Then have 20 threads each working with
each event file.

How do I extract this into classes?
 
I

Ian Collins

Michael said:
I'm writing an application that decodes a file containing binary
records. Each record is a particular event type. Each record is
translated into ASCII and then written to a file. Each file contains
the same events. At the moment each record is processed one after the
other. It taks about 1m40s to process a large file containing 70,000
records. Would my application benifit from multiple threads and mmap?
Well that all depends on ow many cores you have to run them on, and is
bit OT here. You'll have better luck on comp.programming.threads, or
one specific to your platform.
If so what is the best way to manage the multiple output files? For
example there are 20 event types. When parsing the file I identify the
event type and build 20 lists. Then have 20 threads each working with
each event file.
Odds are that'll slow you down due to context switches, assuming you
have less tan 20 cores.
How do I extract this into classes?
Again, try comp.programming.threads, maybe with an example of how you
think it could be done.
 
I

Ivan Vecerina

: I'm writing an application that decodes a file containing binary
: records. Each record is a particular event type. Each record is
: translated into ASCII and then written to a file. Each file contains
: the same events. At the moment each record is processed one after the
: other. It taks about 1m40s to process a large file containing 70,000
: records. Would my application benifit from multiple threads and mmap?

You don't say how much processing is being performed on the events,
or what is the actual size of the file/each record.

Using memory-mapping will typically help a lot if the performance
is i/o bound. It also often simplifies the reading/processing of
the data. So it something I often do upfront.

However, 100s for 70k records seems relatively long, so would assume
that your are doing quite some processing. It is likely that this
processing itself (its algorithms) could be improved quite a bit.
You should use a profiler and find out what is being the most
time-consuming -- you might find an obvious culprit.

: If so what is the best way to manage the multiple output files? For
: example there are 20 event types. When parsing the file I identify the
: event type and build 20 lists. Then have 20 threads each working with
: each event file.

Regarding the output: it might be good to prepare the output
in a memory buffer, and to write/flush them in large chunks.
But this all depends on your current memory usage, etc.

Using multiple threads will not automatically improve performance,
unless you carfully craft your design based on a thorough analysis.
Just creating one thread for each output file typically won't help.

: How do I extract this into classes?

What do you think you need classes for?


By the way, your question has nothing to do with the C++ language,
and therefore doesn't belong in this NG.
Try a platform-specific forum?


hth -Ivan
 
G

Greg

Michael said:
I'm writing an application that decodes a file containing binary
records. Each record is a particular event type. Each record is
translated into ASCII and then written to a file. Each file contains
the same events. At the moment each record is processed one after the
other. It taks about 1m40s to process a large file containing 70,000
records. Would my application benifit from multiple threads and mmap?

The answer is a definite maybe. The threads question is highly hardware
dependent. Multiple threads are most effective on machines with
multiple processors. Otherwise, simply increasing the number of threads
does not increase a machine's processing power to like degree. In fact
because switching between threads entails some overhead, it is just as
possible to wind up with too many threads instead of too few when
guessing blindly for the optimal number.

Since you have not provided a detailed description about the
application's current memory use and I/O characteristics, it is
impossible to say whether mmap would help or not. And the first order
of business in any case has to be to profile the current app and find
out how it is spending those 100 seconds. If 90% of that time is in
parsing code, than no, mmap will be unlikely to help. If, on the other
hand, a large portion of that time is spent in disk I/O operations (as
is often the case), then yes, a few large read and write operations
(instead of many little ones) will do more to improve performance than
almost any other type of optimization. But without knowing the extent
to which the current application has optimized its behavior, it's
futile to estimate how much further its performance could be optimized.
If so what is the best way to manage the multiple output files? For
example there are 20 event types. When parsing the file I identify the
event type and build 20 lists. Then have 20 threads each working with
each event file.

Unless the hardware has a lot of multiprocessing capability, 20 threads
sound like far too many. But only profiling and testing various
implementations will be able to find the optimal number of threads for
this app running on a particular hardware configuation.

As for the 20 event types, I would not do anything fancy. If the 20
possible types are fixed, then declaring an array of 20 file handles
with using an enum as an index into that array to find the
corresponding file handle should suffice. Just avoid "magic numbers"
like 20, and define const integral values in their place.
How do I extract this into classes?

I'm not sure that a program that performs a linear processing task
benefits a great deal from classes. Classes (and a class hierarchy)
work best as a dynamic model - often one driven by an ever-changing
series of events (often generated by the user's interaction with the
application). A program that opens a file, parses its contents, closes
the file and declares itself done is really conducting a series of
predictable sequential operations. And the only reason for wanting to
use classes here would be for maintainability (because I can't see that
performance issues would ever mandate implementing classes).

So the question to ask is whether classes would necessarily make the
code more maintainable? A well-designed and implemented class model
should, but otherwise a class model designed for its own sake would
probably be harder to maintain. Because a class hierarchy of any kind,
almost always increases the total complexity of a program (in other
words there is more code). But because code in a well-designed
hierarchy better encapsulates its complexity, a programmer is able to
work on the program's logic in smaller "pieces" (thereby reducing the
complexity that the programmer has to deal with at any one time).

Lastly, maintainability is a separate issue from performance. And one
that should be addressed first. It wouldn't make sense to fine tune the
app's performance if its code is going to be thrown out and replaced
with an object-oriented implemnentation in the final, shipping version.

So to recap: first, decide whether (and then implement, if the decision
is affirmative) a class hierarchy would improve the maintainability of
the source code to such an extent that would justify the additional
work. Second, profile the app to obtain a precise accounting of the 100
seconds it spends processing records. Next, use that profile
information to target bottlenecks: remedy them using standard
optimization techniques (such as using fewer I/O requests by increasing
the size of each request, or, if parsing is the bottleneck, use a table
driven for maximal speed). And lastly the most important point: it's
simply never effective to try to speed up a program, without first
learning why it is so slow.

Greg
 
M

Michael

Sorry was half thinking about how to write this?

I know where all the time is being spent as I timed each task as I was
developing. For each record I am setting a TCL array and then dumping
to file. I still need to add logic but I am concentrating on raw speed
at the moment end to end.

I already have the decoding part which goes through the file and
creates an index. It is 1 class. It taks about 2 seconds to create the
index on a 30M file - 70000 records. The index is public so I can
directly access this index to get an offset to different parts of the
file. The file is loaded into memory at startup but I will eventually
mmap it - once I work out how to and if it makes a difference to
performance.

I was thinking of creating another class which would be the decode
thread manager. This would decide how many threads were needed for a
particular file, create the threads and then balance the load on each
thread by deciding which records each thread would process. A thread
would store output data in a buffer which would then be copied and
flush to file. Memory isn't a problem I have 32GB to play around with.
 
M

Michael

Thanks Greg,

I'm using C++ because I haven't used C for ages and don't wont to mess
around with memory management and pointers - core dumps. It's quicker
for me to write code to store things like configuration in vectors and
let them deal with cleaning up memory. I only have one 'new/delete' and
that is to create a large buffer to hold the contents of the file in
memory - this will eventually disappear once I get mmap working - in
cygwin/g++. I'm not that bothered about memory overhead of using
vectors as I've got 32GB to work with.

I did some rough profiling. Without writing to file the processing
(paring file, setting internal TCL variables) maximises the CPU usage.
With writing to disk, the CPU usage goes down to 35% (2 CPU Sparc III)
and there is I/O wait. So with threads and mmap I'm hoping that I will
make maximum avalaible usage to hardware.

Michael
 
I

Ivan Vecerina

: Sorry was half thinking about how to write this?
:
: I know where all the time is being spent as I timed each task as I was
: developing. For each record I am setting a TCL array and then dumping
: to file. I still need to add logic but I am concentrating on raw speed
: at the moment end to end.
I do not know what a TCL array is (TCL/TK, Think Class Library, or??).

: I already have the decoding part which goes through the file and
: creates an index. It is 1 class. It taks about 2 seconds to create the
: index on a 30M file - 70000 records. The index is public so I can
: directly access this index to get an offset to different parts of the
: file. The file is loaded into memory at startup but I will eventually
: mmap it - once I work out how to and if it makes a difference to
: performance.
You say you did time measurements, yet you only account for 2sec
out of 100. Using a profiler will highlight the hot spots in your
program to a single line. Only this will allow you to identify,
for example, that you spend too much time in memory allocations,
or searches, and allow you to optimize your algorithms and data
structures.

: I was thinking of creating another class which would be the decode
: thread manager. This would decide how many threads were needed for a
: particular file, create the threads and then balance the load on each
: thread by deciding which records each thread would process. A thread
: would store output data in a buffer which would then be copied and
: flush to file. Memory isn't a problem I have 32GB to play around with.
Good in terms of caching file outputs.
Keep in mind, though, that memory accesses are nowadays what takes
the most time in all simple-to-moderately complex processing algos.
Avoiding reallocations, and using contiguous memory accesses, can
make a real difference.

Again, don't bother using threads until you have analyzed the
performance profile of your application.


Ivan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top