Boost.bind and performance penalties?

Rune Allnor · May 25, 2011

Hi all.

I have this application which is heavy on computations
on data scattered in RAM. The task is to generate
a triangulated surface from a set of scattered (x,y,z)
data. In this application the memory managment is
a big issue while computatiosn are relativley small
and uncompicated, so that run-time would be expected
to split about 50/50 between memory managment and
computations.

I expect to see a 50% slash in run-time by rewriting
the program from vector indexing presently on the form

std::vector<double> v;
double x;
std::vector<size_t> i;
size_t n;

x = v[i[n]];

to pointer dereferencing, _along_the_lines_of_

std::vector<double> v;
double x;
std::vector<*double> i;
size_t n;

// initialize the elements of i to point to elements of v
x = *i[n];

The point is that I can't afford much run-time overhead, or
it will signifigantly impact the application.

Now, there are several variants of computations that I can
do on the data x. I'd like to use function objects or pointers
to specify how to do the computations, and am thinking of
using boost.bind to declare these function pointers when
calling this library.

What run-time overheads, if any, would be expected when
using function pointers, as opposed to 'hard-wiring' the
computations inside the library?

Rune

Edek · May 25, 2011

Hi all. [snip]
The point is that I can't afford much run-time overhead, or
it will signifigantly impact the application.

Now, there are several variants of computations that I can
do on the data x. I'd like to use function objects or pointers
to specify how to do the computations, and am thinking of
using boost.bind to declare these function pointers when
calling this library.

What run-time overheads, if any, would be expected when
using function pointers, as opposed to 'hard-wiring' the
computations inside the library?

I guess it depends on how complicated the functions are, in two ways.

One: if they take long time, the call overhead is _relatively_ small.

Two: how complicated they are, impacting optimizing possibilities. E.g.
simple multiplication can be SSE'd ('vectorized') if inline'd or
deducted by compiler in some other way. Such optimisations cannot be
done - in most cases - when the call is indirect enough that the
compiler cannot be sure it knows the called function body. This can be
significant.

In general bind() is not light, and even function pointers, or even
worse pointers to members, should not be called in a tight loop, if only
for the call itself, let alone missed optimisations.

These are mostly rules of thumb. If you have a lot of time to spend on
programming, use expression templates, or inline at least. If you
cannot, use boost::bind and just don't worry.

HTH
Edek

Rune Allnor · May 26, 2011

Hi all. [snip]
The point is that I can't afford much run-time overhead, or
it will signifigantly impact the application.

Click to expand...

Now, there are several variants of computations that I can
do on the data x. I'd like to use function objects or pointers
to specify how to do the computations, and am thinking of
using boost.bind to declare these function pointers when
calling this library.

Click to expand...

What run-time overheads, if any, would be expected when
using function pointers, as opposed to 'hard-wiring' the
computations inside the library?

Click to expand...

I guess it depends on how complicated the functions are, in two ways.

One: if they take long time, the call overhead is _relatively_ small.

The functions are small. 8-10 trivial math operations.

Two: how complicated they are, impacting optimizing possibilities. E.g.
simple multiplication can be SSE'd ('vectorized') if inline'd or
deducted by compiler in some other way. Such optimisations cannot be
done - in most cases - when the call is indirect enough that the
compiler cannot be sure it knows the called function body. This can be
significant.

I suppose this is the issue I am worried about.

In general bind() is not light, and even function pointers, or even
worse pointers to members, should not be called in a tight loop, if only
for the call itself, let alone missed optimisations.

I'll have to think about this. The reason why the question came up
in the first place, was that I found that I could use some of the
same functionality I implemented for the (x,y,z) surface data stuff,
when playing with various GUI features like points on the screen:

How can I efficiently hide what type of point I play with, GUI or
measurements, from the data? And how can I access the coordinates
in the point types effectively without either hard-coding the
access functions in the library or pay too large a price on
performance?

Maybe I can accept to do only one call to fetch the data.
Maybe I can come up with a specialization for the performance-
critical stuff. There are several options here.

These are mostly rules of thumb. If you have a lot of time to spend on
programming, use expression templates, or inline at least. If you
cannot, use boost::bind and just don't worry.

Can't afford not to worry. Run-time is a critical factor here.

HTH
Edek

Thanks.

Rune

Edek · May 26, 2011

The functions are small. 8-10 trivial math operations.

When I was talking about time, this somehow includes scattered access
which you mentioned before. Uncached RAM access takes long time
comparing to couple of simple ops.

Which reminds me: I do not understand why replacing vector offset by
vector of pointers should speed things up. Vector offset boils down to
pointer + offset.

You might want to consider a 'manual prefetch' if the vector fits into
cache and large part of it is used. The rule here is that CPU has a
hardware prefetcher, which kicks in during sequencial access; otherwise,
manual prefetch might help. Not the first thing to try though, and a bit
compiler specific or assembly. (I am talking about x86)

I suppose this is the issue I am worried about.

Make a test and measure.

I'll have to think about this. The reason why the question came up
in the first place, was that I found that I could use some of the
same functionality I implemented for the (x,y,z) surface data stuff,
when playing with various GUI features like points on the screen:

How can I efficiently hide what type of point I play with, GUI or
measurements, from the data? And how can I access the coordinates
in the point types effectively without either hard-coding the
access functions in the library or pay too large a price on
performance?

Inline getters (would that be 'hardcoding'?

), or inline MyPoint
const& getPoint () const, or inherit from MyPoint, or template<class
MyOp> doOpOnPoint(MyOp& op) on some point class, or ...

Maybe I can accept to do only one call to fetch the data.
Maybe I can come up with a specialization for the performance-
critical stuff. There are several options here.

This might be a good idea. Easiest: fetch into vector, pass a vector to
the function. The function would be locally working on a vector; but
then, fetching is an operation too, it is simple, but takes some time.
It depends, measure a test. I functions (ops) are separate from data, a
wrapper can be made applying an op on a vector. I guess the options are
limited by what you actually want to do with the result.

Can't afford not to worry. Run-time is a critical factor here.

Still, measure first, make a simple test, measure again. There are more
things in the CPU than are dreamt of in programming.

Rune Allnor · May 26, 2011

When I was talking about time, this somehow includes scattered access
which you mentioned before. Uncached RAM access takes long time
comparing to couple of simple ops.

Which reminds me: I do not understand why replacing vector offset by
vector of pointers should speed things up. Vector offset boils down to
pointer + offset.

Not quite:

1) ElementPointer = BasePointer + ElementOffset
2) Element = derefernce(ElementPointer)

while the pointer semantics is only half that:

A) Element = dereference(ElementPointer)

In this application the difference matters. The issue is a bit
more involved than what I posted first (there might b 3 - 5 levels
of indirections through pointers to pointers). Maybe rewriting to
pointer semantics instead of vector semantics saves me 30% run time;
maybe it saves 60%. I don't know until I rewrite.

But it's that order of magnitude we are talking about. I don't want
to blow those savings or more, on an expensive function indirection.

You might want to consider a 'manual prefetch' if the vector fits into
cache and large part of it is used. The rule here is that CPU has a
hardware prefetcher, which kicks in during sequencial access; otherwise,
manual prefetch might help. Not the first thing to try though, and a bit
compiler specific or assembly. (I am talking about x86)

The problem is that the data points are scattered in RAM.
Cache organization is at least as hard as the main task,
which is to organize the data points. Cache organization
just uses a different criterium.

....

Still, measure first, make a simple test, measure again. There are more
things in the CPU than are dreamt of in programming.

The performance factors only kick in visibly at large volumes.
No such thing as a 'simple test': either rewrite the whole thing
or nothing at all.

Rune

Edek · May 26, 2011

Not quite:

1) ElementPointer = BasePointer + ElementOffset
2) Element = derefernce(ElementPointer)

while the pointer semantics is only half that:

A) Element = dereference(ElementPointer)

Not quite, if I understand your first post:

1) Pointer = deref(const_pointer + pointerOffset)
2) Element = deref(Pointer)

vs.

1) ElementOffset = deref (const_pointer + elementOffsetOffset)
2) Element = deref (const_pointer + ElementOffset)

.... 1 is sequencial I guess, 2 is scattered. In both cases 2 is
dependant on 1. deref(pointer + offset) is a single instruction on most
CPUs if not all, although somehow split into microops - still, RAM fetch
is way more that address addition.

And it is not even that simple addressing. 2) should be a & + member
offset. v[off].x that is, or deref ([const_pointer + ElementOffset] +
const-x-offset).

In this application the difference matters. The issue is a bit
more involved than what I posted first (there might b 3 - 5 levels
of indirections through pointers to pointers). Maybe rewriting to
pointer semantics instead of vector semantics saves me 30% run time;
maybe it saves 60%. I don't know until I rewrite.

I don't know. I usually think in such cases that the compiler will do
such things for you, if you do not prevent it: generated assembly does
not match line-by-line what is in .cpp file. It is a simple
transformation. The compiler stops further transformations when faced
with something it does not know for sure. Indirection might be one of
them, true; a 'blackbox' call is for sure another. I still see no
significant difference between returning an offset and returning a
pointer, at least in a loop.

Making everything a template, or just inline, makes no 'blackbox' calls;
fewer indirections should help too.

But it's that order of magnitude we are talking about. I don't want
to blow those savings or more, on an expensive function indirection.

Probably boost::bind is what you need to avoid. std::for_each or any
similar template would be better I guess.

The performance factors only kick in visibly at large volumes.
No such thing as a 'simple test': either rewrite the whole thing
or nothing at all.

Maybe. I use simple tests to learn the mechanisms, on large volumes. It
might be easier than rewriting the whole thing and learning later that
it was not the right way to go. Which by the way happens during such
changes, performance is too often heuristic-like. What is clearly an
advantage, is changes like O(n^2) to O(log n), but we're not talking
about that now.

Edek

Rune Allnor · May 26, 2011

Not quite, if I understand your first post:

Don't take that post too literally; what *really*
goes on is something like (in vector semantics)

struct element_t {size_t a; size_t b; size_t c;};
std::vector<element_t> v;
size_t n,m;

m = v[v[v[n].a].b].c;

where one can not predict any relation between n and v[n].x.

I think this can be more efficient if written along the
lines of something like

struct Element_t {*Element_t A; *Element_t B; size_t C};
std::vector<element_t> v;
size_t n,m;

m = v[n].A->B->C;

....

I don't know. I usually think in such cases that the compiler will do
such things for you, if you do not prevent it: generated assembly does
not match line-by-line what is in .cpp file. It is a simple
transformation. The compiler stops further transformations when faced
with something it does not know for sure. Indirection might be one of
them, true; a 'blackbox' call is for sure another. I still see no
significant difference between returning an offset and returning a
pointer, at least in a loop.

Again, my first few posts were maybe too simplistic. The example
above
may still be too simplistic, but gives some impression of what is
actually going on.

Making everything a template, or just inline, makes no 'blackbox' calls;
fewer indirections should help too.

Can't do that. The nature of the problem to be solved.

Probably boost::bind is what you need to avoid. std::for_each or any
similar template would be better I guess.

It's two separate problems:

1) Getting fast indirections
2) Applying functions without overhead.

Problem 1 is not really essential for problem 2, other than as
a demonstration of what kinds of overheads are damaging.

Maybe. I use simple tests to learn the mechanisms, on large volumes. It
might be easier than rewriting the whole thing and learning later that
it was not the right way to go. Which by the way happens during such
changes, performance is too often heuristic-like. What is clearly an
advantage, is changes like O(n^2) to O(log n), but we're not talking
about that now.

The present version of the application is a 'naive simple semantics'
version, which has already been shown to be too slow. The question
now is how to implement a fast version that also might be flexible
enough to be useful in other applications, where speed is not as
much of an issue.

Rune

[C language] Issue in the Lotka-Volterra model.	0	Jun 28, 2023
Compiler Performance with Compile-Time Array?	2	Oct 27, 2012
valarray performance	11	Aug 4, 2010
Crossword	2	May 11, 2020
How to keep count of right answer and wrong answers in C++?	0	Nov 3, 2021
Problems with performance	18	Feb 25, 2013
Custom matrix multiplication produces different results to glm	0	Sep 16, 2023
File Read Performance Issue	17	Aug 17, 2013

Boost.bind and performance penalties?

Rune Allnor

Edek

Rune Allnor

Edek

Rune Allnor

Edek

Rune Allnor

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads