Boost.bind and performance penalties?

Discussion in 'C++' started by Rune Allnor, May 25, 2011.

  1. Rune Allnor

    Rune Allnor Guest

    Hi all.

    I have this application which is heavy on computations
    on data scattered in RAM. The task is to generate
    a triangulated surface from a set of scattered (x,y,z)
    data. In this application the memory managment is
    a big issue while computatiosn are relativley small
    and uncompicated, so that run-time would be expected
    to split about 50/50 between memory managment and
    computations.

    I expect to see a 50% slash in run-time by rewriting
    the program from vector indexing presently on the form

    std::vector<double> v;
    double x;
    std::vector<size_t> i;
    size_t n;

    x = v[i[n]];

    to pointer dereferencing, _along_the_lines_of_

    std::vector<double> v;
    double x;
    std::vector<*double> i;
    size_t n;

    // initialize the elements of i to point to elements of v
    x = *i[n];

    The point is that I can't afford much run-time overhead, or
    it will signifigantly impact the application.

    Now, there are several variants of computations that I can
    do on the data x. I'd like to use function objects or pointers
    to specify how to do the computations, and am thinking of
    using boost.bind to declare these function pointers when
    calling this library.

    What run-time overheads, if any, would be expected when
    using function pointers, as opposed to 'hard-wiring' the
    computations inside the library?

    Rune
     
    Rune Allnor, May 25, 2011
    #1
    1. Advertising

  2. Rune Allnor

    Edek Guest

    On 05/25/2011 01:56 PM, Rune Allnor wrote:
    > Hi all.

    [snip]
    > The point is that I can't afford much run-time overhead, or
    > it will signifigantly impact the application.
    >
    > Now, there are several variants of computations that I can
    > do on the data x. I'd like to use function objects or pointers
    > to specify how to do the computations, and am thinking of
    > using boost.bind to declare these function pointers when
    > calling this library.
    >
    > What run-time overheads, if any, would be expected when
    > using function pointers, as opposed to 'hard-wiring' the
    > computations inside the library?


    I guess it depends on how complicated the functions are, in two ways.

    One: if they take long time, the call overhead is _relatively_ small.

    Two: how complicated they are, impacting optimizing possibilities. E.g.
    simple multiplication can be SSE'd ('vectorized') if inline'd or
    deducted by compiler in some other way. Such optimisations cannot be
    done - in most cases - when the call is indirect enough that the
    compiler cannot be sure it knows the called function body. This can be
    significant.

    In general bind() is not light, and even function pointers, or even
    worse pointers to members, should not be called in a tight loop, if only
    for the call itself, let alone missed optimisations.

    These are mostly rules of thumb. If you have a lot of time to spend on
    programming, use expression templates, or inline at least. If you
    cannot, use boost::bind and just don't worry.

    HTH
    Edek
     
    Edek, May 25, 2011
    #2
    1. Advertising

  3. Rune Allnor

    Rune Allnor Guest

    On May 25, 4:45 pm, Edek <> wrote:
    > On 05/25/2011 01:56 PM, Rune Allnor wrote:
    >
    > > Hi all.

    > [snip]
    > > The point is that I can't afford much run-time overhead, or
    > > it will signifigantly impact the application.

    >
    > > Now, there are several variants of computations that I can
    > > do on the data x. I'd like to use function objects or pointers
    > > to specify how to do the computations, and am thinking of
    > > using boost.bind to declare these function pointers when
    > > calling this library.

    >
    > > What run-time overheads, if any, would be expected when
    > > using function pointers, as opposed to 'hard-wiring' the
    > > computations inside the library?

    >
    > I guess it depends on how complicated the functions are, in two ways.
    >
    > One: if they take long time, the call overhead is _relatively_ small.


    The functions are small. 8-10 trivial math operations.

    > Two: how complicated they are, impacting optimizing possibilities. E.g.
    > simple multiplication can be SSE'd ('vectorized') if inline'd or
    > deducted by compiler in some other way. Such optimisations cannot be
    > done - in most cases - when the call is indirect enough that the
    > compiler cannot be sure it knows the called function body. This can be
    > significant.


    I suppose this is the issue I am worried about.

    > In general bind() is not light, and even function pointers, or even
    > worse pointers to members, should not be called in a tight loop, if only
    > for the call itself, let alone missed optimisations.


    I'll have to think about this. The reason why the question came up
    in the first place, was that I found that I could use some of the
    same functionality I implemented for the (x,y,z) surface data stuff,
    when playing with various GUI features like points on the screen:

    How can I efficiently hide what type of point I play with, GUI or
    measurements, from the data? And how can I access the coordinates
    in the point types effectively without either hard-coding the
    access functions in the library or pay too large a price on
    performance?

    Maybe I can accept to do only one call to fetch the data.
    Maybe I can come up with a specialization for the performance-
    critical stuff. There are several options here.

    > These are mostly rules of thumb. If you have a lot of time to spend on
    > programming, use expression templates, or inline at least. If you
    > cannot, use boost::bind and just don't worry.


    Can't afford not to worry. Run-time is a critical factor here.

    > HTH
    > Edek


    Thanks.

    Rune
     
    Rune Allnor, May 26, 2011
    #3
  4. Rune Allnor

    Edek Guest

    On 05/26/2011 06:35 AM, Rune Allnor wrote:

    >> One: if they take long time, the call overhead is _relatively_ small.

    >
    > The functions are small. 8-10 trivial math operations.


    When I was talking about time, this somehow includes scattered access
    which you mentioned before. Uncached RAM access takes long time
    comparing to couple of simple ops.

    Which reminds me: I do not understand why replacing vector offset by
    vector of pointers should speed things up. Vector offset boils down to
    pointer + offset.

    You might want to consider a 'manual prefetch' if the vector fits into
    cache and large part of it is used. The rule here is that CPU has a
    hardware prefetcher, which kicks in during sequencial access; otherwise,
    manual prefetch might help. Not the first thing to try though, and a bit
    compiler specific or assembly. (I am talking about x86)

    >
    >> Two: how complicated they are, impacting optimizing possibilities. E.g.
    >> simple multiplication can be SSE'd ('vectorized') if inline'd or
    >> deducted by compiler in some other way. Such optimisations cannot be
    >> done - in most cases - when the call is indirect enough that the
    >> compiler cannot be sure it knows the called function body. This can be
    >> significant.

    >
    > I suppose this is the issue I am worried about.


    Make a test and measure.

    >> In general bind() is not light, and even function pointers, or even
    >> worse pointers to members, should not be called in a tight loop, if only
    >> for the call itself, let alone missed optimisations.

    >
    > I'll have to think about this. The reason why the question came up
    > in the first place, was that I found that I could use some of the
    > same functionality I implemented for the (x,y,z) surface data stuff,
    > when playing with various GUI features like points on the screen:
    >
    > How can I efficiently hide what type of point I play with, GUI or
    > measurements, from the data? And how can I access the coordinates
    > in the point types effectively without either hard-coding the
    > access functions in the library or pay too large a price on
    > performance?


    Inline getters (would that be 'hardcoding'? ;) ), or inline MyPoint
    const& getPoint () const, or inherit from MyPoint, or template<class
    MyOp> doOpOnPoint(MyOp& op) on some point class, or ...

    >
    > Maybe I can accept to do only one call to fetch the data.
    > Maybe I can come up with a specialization for the performance-
    > critical stuff. There are several options here.


    This might be a good idea. Easiest: fetch into vector, pass a vector to
    the function. The function would be locally working on a vector; but
    then, fetching is an operation too, it is simple, but takes some time.
    It depends, measure a test. I functions (ops) are separate from data, a
    wrapper can be made applying an op on a vector. I guess the options are
    limited by what you actually want to do with the result.

    >
    >> These are mostly rules of thumb. If you have a lot of time to spend on
    >> programming, use expression templates, or inline at least. If you
    >> cannot, use boost::bind and just don't worry.

    >
    > Can't afford not to worry. Run-time is a critical factor here.


    Still, measure first, make a simple test, measure again. There are more
    things in the CPU than are dreamt of in programming.

    >
    >> HTH
    >> Edek

    >
    > Thanks.
    >
    > Rune
     
    Edek, May 26, 2011
    #4
  5. Rune Allnor

    Rune Allnor Guest

    On May 26, 11:34 am, Edek <> wrote:
    > On 05/26/2011 06:35 AM, Rune Allnor wrote:
    >
    > >> One: if they take long time, the call overhead is _relatively_ small.

    >
    > > The functions are small. 8-10 trivial math operations.

    >
    > When I was talking about time, this somehow includes scattered access
    > which you mentioned before. Uncached RAM access takes long time
    > comparing to couple of simple ops.
    >
    > Which reminds me: I do not understand why replacing vector offset by
    > vector of pointers should speed things up. Vector offset boils down to
    > pointer + offset.


    Not quite:

    1) ElementPointer = BasePointer + ElementOffset
    2) Element = derefernce(ElementPointer)

    while the pointer semantics is only half that:

    A) Element = dereference(ElementPointer)

    In this application the difference matters. The issue is a bit
    more involved than what I posted first (there might b 3 - 5 levels
    of indirections through pointers to pointers). Maybe rewriting to
    pointer semantics instead of vector semantics saves me 30% run time;
    maybe it saves 60%. I don't know until I rewrite.

    But it's that order of magnitude we are talking about. I don't want
    to blow those savings or more, on an expensive function indirection.

    > You might want to consider a 'manual prefetch' if the vector fits into
    > cache and large part of it is used. The rule here is that CPU has a
    > hardware prefetcher, which kicks in during sequencial access; otherwise,
    > manual prefetch might help. Not the first thing to try though, and a bit
    > compiler specific or assembly. (I am talking about x86)


    The problem is that the data points are scattered in RAM.
    Cache organization is at least as hard as the main task,
    which is to organize the data points. Cache organization
    just uses a different criterium.

    ....

    > >> These are mostly rules of thumb. If you have a lot of time to spend on
    > >> programming, use expression templates, or inline at least. If you
    > >> cannot, use boost::bind and just don't worry.

    >
    > > Can't afford not to worry. Run-time is a critical factor here.

    >
    > Still, measure first, make a simple test, measure again. There are more
    > things in the CPU than are dreamt of in programming.


    The performance factors only kick in visibly at large volumes.
    No such thing as a 'simple test': either rewrite the whole thing
    or nothing at all.

    Rune
     
    Rune Allnor, May 26, 2011
    #5
  6. Rune Allnor

    Edek Guest


    >> Which reminds me: I do not understand why replacing vector offset by
    >> vector of pointers should speed things up. Vector offset boils down to
    >> pointer + offset.

    >
    > Not quite:
    >
    > 1) ElementPointer = BasePointer + ElementOffset
    > 2) Element = derefernce(ElementPointer)
    >
    > while the pointer semantics is only half that:
    >
    > A) Element = dereference(ElementPointer)


    Not quite, if I understand your first post:

    1) Pointer = deref(const_pointer + pointerOffset)
    2) Element = deref(Pointer)

    vs.

    1) ElementOffset = deref (const_pointer + elementOffsetOffset)
    2) Element = deref (const_pointer + ElementOffset)

    .... 1 is sequencial I guess, 2 is scattered. In both cases 2 is
    dependant on 1. deref(pointer + offset) is a single instruction on most
    CPUs if not all, although somehow split into microops - still, RAM fetch
    is way more that address addition.

    And it is not even that simple addressing. 2) should be a & + member
    offset. v[off].x that is, or deref ([const_pointer + ElementOffset] +
    const-x-offset).

    >
    > In this application the difference matters. The issue is a bit
    > more involved than what I posted first (there might b 3 - 5 levels
    > of indirections through pointers to pointers). Maybe rewriting to
    > pointer semantics instead of vector semantics saves me 30% run time;
    > maybe it saves 60%. I don't know until I rewrite.


    I don't know. I usually think in such cases that the compiler will do
    such things for you, if you do not prevent it: generated assembly does
    not match line-by-line what is in .cpp file. It is a simple
    transformation. The compiler stops further transformations when faced
    with something it does not know for sure. Indirection might be one of
    them, true; a 'blackbox' call is for sure another. I still see no
    significant difference between returning an offset and returning a
    pointer, at least in a loop.

    Making everything a template, or just inline, makes no 'blackbox' calls;
    fewer indirections should help too.

    >
    > But it's that order of magnitude we are talking about. I don't want
    > to blow those savings or more, on an expensive function indirection.


    Probably boost::bind is what you need to avoid. std::for_each or any
    similar template would be better I guess.

    >>>> These are mostly rules of thumb. If you have a lot of time to spend on
    >>>> programming, use expression templates, or inline at least. If you
    >>>> cannot, use boost::bind and just don't worry.

    >>
    >>> Can't afford not to worry. Run-time is a critical factor here.

    >>
    >> Still, measure first, make a simple test, measure again. There are more
    >> things in the CPU than are dreamt of in programming.

    >
    > The performance factors only kick in visibly at large volumes.
    > No such thing as a 'simple test': either rewrite the whole thing
    > or nothing at all.


    Maybe. I use simple tests to learn the mechanisms, on large volumes. It
    might be easier than rewriting the whole thing and learning later that
    it was not the right way to go. Which by the way happens during such
    changes, performance is too often heuristic-like. What is clearly an
    advantage, is changes like O(n^2) to O(log n), but we're not talking
    about that now.

    Edek
     
    Edek, May 26, 2011
    #6
  7. Rune Allnor

    Rune Allnor Guest

    On May 26, 12:53 pm, Edek <> wrote:
    > >> Which reminds me: I do not understand why replacing vector offset by
    > >> vector of pointers should speed things up. Vector offset boils down to
    > >> pointer + offset.

    >
    > > Not quite:

    >
    > > 1) ElementPointer = BasePointer + ElementOffset
    > > 2) Element =  derefernce(ElementPointer)

    >
    > > while the pointer semantics is only half that:

    >
    > > A) Element = dereference(ElementPointer)

    >
    > Not quite, if I understand your first post:


    Don't take that post too literally; what *really*
    goes on is something like (in vector semantics)

    struct element_t {size_t a; size_t b; size_t c;};
    std::vector<element_t> v;
    size_t n,m;

    m = v[v[v[n].a].b].c;

    where one can not predict any relation between n and v[n].x.

    I think this can be more efficient if written along the
    lines of something like

    struct Element_t {*Element_t A; *Element_t B; size_t C};
    std::vector<element_t> v;
    size_t n,m;

    m = v[n].A->B->C;

    ....
    > > In this application the difference matters. The issue is a bit
    > > more involved than what I posted first (there might b 3 - 5 levels
    > > of indirections through pointers to pointers). Maybe rewriting to
    > > pointer semantics instead of vector semantics saves me 30% run time;
    > > maybe it saves 60%. I don't know until I rewrite.

    >
    > I don't know. I usually think in such cases that the compiler will do
    > such things for you, if you do not prevent it: generated assembly does
    > not match line-by-line what is in .cpp file. It is a simple
    > transformation. The compiler stops further transformations when faced
    > with something it does not know for sure. Indirection might be one of
    > them, true; a 'blackbox' call is for sure another. I still see no
    > significant difference between returning an offset and returning a
    > pointer, at least in a loop.


    Again, my first few posts were maybe too simplistic. The example
    above
    may still be too simplistic, but gives some impression of what is
    actually going on.

    > Making everything a template, or just inline, makes no 'blackbox' calls;
    > fewer indirections should help too.


    Can't do that. The nature of the problem to be solved.

    > > But it's that order of magnitude we are talking about. I don't want
    > > to blow those savings or more, on an expensive function indirection.

    >
    > Probably boost::bind is what you need to avoid. std::for_each or any
    > similar template would be better I guess.


    It's two separate problems:

    1) Getting fast indirections
    2) Applying functions without overhead.

    Problem 1 is not really essential for problem 2, other than as
    a demonstration of what kinds of overheads are damaging.

    > >>>> These are mostly rules of thumb. If you have a lot of time to spend on
    > >>>> programming, use expression templates, or inline at least. If you
    > >>>> cannot, use boost::bind and just don't worry.

    >
    > >>> Can't afford not to worry. Run-time is a critical factor here.

    >
    > >> Still, measure first, make a simple test, measure again. There are more
    > >> things in the CPU than are dreamt of in programming.

    >
    > > The performance factors only kick in visibly at large volumes.
    > > No such thing as a 'simple test': either rewrite the whole thing
    > > or nothing at all.

    >
    > Maybe. I use simple tests to learn the mechanisms, on large volumes. It
    > might be easier than rewriting the whole thing and learning later that
    > it was not the right way to go. Which by the way happens during such
    > changes, performance is too often heuristic-like. What is clearly an
    > advantage, is changes like O(n^2) to O(log n), but we're not talking
    > about that now.


    The present version of the application is a 'naive simple semantics'
    version, which has already been shown to be too slow. The question
    now is how to implement a fast version that also might be flexible
    enough to be useful in other applications, where speed is not as
    much of an issue.

    Rune
     
    Rune Allnor, May 26, 2011
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Toby Bradshaw
    Replies:
    6
    Views:
    1,782
    Kai-Uwe Bux
    Jun 2, 2006
  2. Misiu
    Replies:
    3
    Views:
    2,410
    Misiu
    Jan 31, 2007
  3. Replies:
    0
    Views:
    598
  4. Christopher
    Replies:
    1
    Views:
    827
    Yakov Gerlovin
    Oct 5, 2011
  5. Mark
    Replies:
    1
    Views:
    442
    Jeff Flinn
    Nov 25, 2012
Loading...

Share This Page