Re: C/C++ speed optimization bible/resources/pointers needed!

Discussion in 'C++' started by galathaea, Jul 27, 2007.

  1. galathaea

    galathaea Guest

    In article <>, wrote:

    !! C/C++ speed optimization bible/resources/pointers needed!
    !!
    !! Hi all,
    !!
    !! I am in the middle of programming to solve an engineering problem
    !! where the speed is huge concern. The project involving lots of
    !! numerical integration and then there are several loops/levels of
    !! optimization on top of the function evaluation engine. As you probably
    !! know, the key to a successful optimization is a fast underlying
    !! objective function evaluator. The faster it is, the more promising the
    !! optimization result(perhaps global optimal). However our project
    !! requires many numerical integrations which prohibits us from making it
    !! super fast. At the heart of the numerical integration is a smart
    !! integrator and a super-fast integrand function evaluator. Even worse,
    !! our function evaluation is in complex-domain. So the kay point is how
    !! to arrange our C/C++ code to make it highly efficient in every aspect.
    !! Could anybody give some advice/pointers on how to improve the speed of
    !! C/C++ program? How to arrange code? How to make it highly efficient
    !! and super fast? What options do I have if I don't have luxury to use
    !! multi-threaded, multi-core or distributed computing? But I do have a
    !! P4 at least. Please recommend some good bibles and resources! Thank
    !! you!

    your best friend is a good optimising compiler

    in my experience
    codewarrior was top dog when metrowerks owned it
    but now it looks like intel has the best reputation

    you have to run benchmarks and profile code execution paths
    but just choosing the right compiler has increased code speed 40% in tight loops for me in the past

    ( in fact - always verify with profiling what needs work
    otherwise you will almost certainly waste time "optimising" nonessential areas )

    avoid unnecessary copies

    that may seem obvious
    but that means all nonfundamental types should be passed as a const reference for inputs
    and you may need to use "expression templates" to prevent certain intermediate copies in looped evaluations

    see "c++ templates" by vandevoorde and josuttis for a description of expression templates
    as applied to matrix computations and related problems

    experiment with unrolling loops
    you don't want to unroll too much and have your code walking all over your cache
    but too little wastes unnecessary time messing with the iterator
    there is usually a sweet spot

    in general
    any calculation that can be done during translationtime
    saves you runtime

    the compiler can usually do much of this
    but look at other techniques like metaprogramming if profiling shows you are still wasting a lot of time here

    cf. abrahams and gurtovoy's "c++ template metaprogramming"
    or czarnecki and eisenecker's "generative programming" for similar methods

    use the bit width of your processor
    if you have (as you mention with p4) a 32-bit processor
    be wary of data structures that use 8-bit chars (if they are that on your architecture)

    i found once that a cryptographic routine that looped over chars for encryption
    began running 10x faster (not just 4!) when i 32-bitised all the operations
    because address space alignment is precious at the hardware level

    also
    you may benefit from branch prediction hints
    compilers like gcc have hints you can add to if statements (likely/unlikely)
    that can optimise the most used branches

    i have not found that as useful in my work but it is used extensively in the linux kernel

    and ask the right groups for help
    fortran users are less likely to know c++ tricks than c++ users
    ( probably because they still think its faster despite the optimisations
    that expression templates allow over it! )

    i hope this helps some

    -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
    galathaea: prankster, fablist, magician, liar
    galathaea, Jul 27, 2007
    #1
    1. Advertising

  2. galathaea

    Luna Moon Guest

    "galathaea" <> wrote in message
    news:...
    > In article <>,
    > wrote:
    >
    > !! C/C++ speed optimization bible/resources/pointers needed!
    > !!
    > !! Hi all,
    > !!
    > !! I am in the middle of programming to solve an engineering problem
    > !! where the speed is huge concern. The project involving lots of
    > !! numerical integration and then there are several loops/levels of
    > !! optimization on top of the function evaluation engine. As you probably
    > !! know, the key to a successful optimization is a fast underlying
    > !! objective function evaluator. The faster it is, the more promising the
    > !! optimization result(perhaps global optimal). However our project
    > !! requires many numerical integrations which prohibits us from making it
    > !! super fast. At the heart of the numerical integration is a smart
    > !! integrator and a super-fast integrand function evaluator. Even worse,
    > !! our function evaluation is in complex-domain. So the kay point is how
    > !! to arrange our C/C++ code to make it highly efficient in every aspect.
    > !! Could anybody give some advice/pointers on how to improve the speed of
    > !! C/C++ program? How to arrange code? How to make it highly efficient
    > !! and super fast? What options do I have if I don't have luxury to use
    > !! multi-threaded, multi-core or distributed computing? But I do have a
    > !! P4 at least. Please recommend some good bibles and resources! Thank
    > !! you!
    >
    > your best friend is a good optimising compiler
    >
    > in my experience
    > codewarrior was top dog when metrowerks owned it
    > but now it looks like intel has the best reputation
    >
    > you have to run benchmarks and profile code execution paths
    > but just choosing the right compiler has increased code speed 40% in
    > tight loops for me in the past
    >
    > ( in fact - always verify with profiling what needs work
    > otherwise you will almost certainly waste time "optimising" nonessential
    > areas )
    >
    > avoid unnecessary copies
    >
    > that may seem obvious
    > but that means all nonfundamental types should be passed as a const
    > reference for inputs
    > and you may need to use "expression templates" to prevent certain
    > intermediate copies in looped evaluations
    >
    > see "c++ templates" by vandevoorde and josuttis for a description of
    > expression templates
    > as applied to matrix computations and related problems
    >
    > experiment with unrolling loops
    > you don't want to unroll too much and have your code walking all over
    > your cache
    > but too little wastes unnecessary time messing with the iterator
    > there is usually a sweet spot
    >
    > in general
    > any calculation that can be done during translationtime
    > saves you runtime
    >
    > the compiler can usually do much of this
    > but look at other techniques like metaprogramming if profiling shows you
    > are still wasting a lot of time here
    >
    > cf. abrahams and gurtovoy's "c++ template metaprogramming"
    > or czarnecki and eisenecker's "generative programming" for similar methods
    >
    > use the bit width of your processor
    > if you have (as you mention with p4) a 32-bit processor
    > be wary of data structures that use 8-bit chars (if they are that on your
    > architecture)
    >
    > i found once that a cryptographic routine that looped over chars for
    > encryption
    > began running 10x faster (not just 4!) when i 32-bitised all the
    > operations
    > because address space alignment is precious at the hardware level
    >
    > also
    > you may benefit from branch prediction hints
    > compilers like gcc have hints you can add to if statements
    > (likely/unlikely)
    > that can optimise the most used branches
    >
    > i have not found that as useful in my work but it is used extensively in
    > the linux kernel
    >
    > and ask the right groups for help
    > fortran users are less likely to know c++ tricks than c++ users
    > ( probably because they still think its faster despite the optimisations
    > that expression templates allow over it! )
    >



    Indeed, it is very helpful! Thank you!

    I am dealing with double and floating points most of the time, since I am
    doing numerical programming. Do you and the other experts on these boards
    have more to say and more pointers along those lines?

    Thanks!
    Luna Moon, Jul 27, 2007
    #2
    1. Advertising

  3. Luna Moon wrote:
    > [..]
    > I am dealing with double and floating points most of the time, since
    > I am doing numerical programming. Do you and the other experts on
    > these boards have more to say and more pointers along those lines?


    First off, please don't cross-post so wildly. Since I don't read the
    'sci.*' hierarchy, I've removed those from my reply. If you think
    the readers of those are going to be interested, post a summary of
    your findings there later.

    That aside... The secret to optimizing is not doing all those small
    things that you're fiding all over the place (loop unrolling, etc.)
    The secret is *not to perform unnecessary work*. That's it. The
    rest is all about how you learn what is unnecessary and what you can
    do to void perfoming it.

    I am not sure you can find it in those books you were recommended.
    I am not sure experimenting will do you good. If you don't know what
    loop to unroll, are you going to unroll each separately, then all
    pairs, then all triples, then... (you get the idea) to find which
    combination buys you the best fraction of a percent of improvement?
    Come on! Use the proper tools to find out what parts of your code
    are in fact bottlenecks, and work on those. Once you use the best
    algorithms and cache as much data as you can, you might want to try
    all those loop unrollings or rearranging your data so you get fewer
    page faults when iterating over the array or containers... But
    without the tools you're shooting in the dark.

    V
    --
    Please remove capital 'A's when replying by e-mail
    I do not respond to top-posted replies, please don't ask
    Victor Bazarov, Jul 27, 2007
    #3
  4. galathaea schrieb:

    > you have to run benchmarks and profile code execution paths
    > but just choosing the right compiler has increased code
    > speed 40% in tight loops for me in the past


    I've seen a 500% boost just from adding a compiler option (-mfpmath).
    With this option, gcc uses SSE1/2 units for floating point calculations
    (SISD, not vector code) instead of traditional x87 code.


    Hendrik vdH
    Hendrik van der Heijden, Jul 27, 2007
    #4
  5. galathaea

    Luna Moon Guest

    "Victor Bazarov" <> wrote in message
    news:...
    > Luna Moon wrote:
    >> [..]
    >> I am dealing with double and floating points most of the time, since
    >> I am doing numerical programming. Do you and the other experts on
    >> these boards have more to say and more pointers along those lines?

    >
    > First off, please don't cross-post so wildly. Since I don't read the
    > 'sci.*' hierarchy, I've removed those from my reply. If you think
    > the readers of those are going to be interested, post a summary of
    > your findings there later.
    >
    > That aside... The secret to optimizing is not doing all those small
    > things that you're fiding all over the place (loop unrolling, etc.)
    > The secret is *not to perform unnecessary work*. That's it. The
    > rest is all about how you learn what is unnecessary and what you can
    > do to void perfoming it.
    >


    I tell you the truth, at the algorithm level, I've been working on it for 1+
    year. So we've already push in all what we can do. Now I just want to make
    sure that at the implementation level, I won't get hurt by not paying
    attention to doing "small" things. From past experience, not paying
    attention to "small" things can easily make the implementation 1000% slower,
    even with a super smart algorithm. So please focus on giving more detail
    suggestions on implementation. Thank you!

    > I am not sure you can find it in those books you were recommended.
    > I am not sure experimenting will do you good. If you don't know what
    > loop to unroll, are you going to unroll each separately, then all
    > pairs, then all triples, then... (you get the idea) to find which
    > combination buys you the best fraction of a percent of improvement?


    I do need books and pointers, because I don't even know what is loop
    unrolling.

    > Come on! Use the proper tools to find out what parts of your code
    > are in fact bottlenecks, and work on those.


    How? The details or pointers are what I really need.

    > Once you use the best
    > algorithms and cache as much data as you can, you might want to try
    > all those loop unrollings or rearranging your data so you get fewer
    > page faults when iterating over the array or containers... But
    > without the tools you're shooting in the dark.
    >


    Which tools? Please elaborate... Thanks a lot!
    Luna Moon, Jul 27, 2007
    #5
  6. galathaea

    Luna Moon Guest

    "Hendrik van der Heijden" <> wrote in message
    news:46a99140$0$31634$-online.net...
    > galathaea schrieb:
    >
    >> you have to run benchmarks and profile code execution paths
    >> but just choosing the right compiler has increased code
    > > speed 40% in tight loops for me in the past

    >
    > I've seen a 500% boost just from adding a compiler option (-mfpmath).
    > With this option, gcc uses SSE1/2 units for floating point calculations
    > (SISD, not vector code) instead of traditional x87 code.
    >
    >
    > Hendrik vdH


    Great! This is exactly what I want... Thanks so much I will try it out!
    Which complier?

    We have done everything at the algorithm level, now we just want to make
    sure our data structure, caches, and code organization don't do stupid
    things to slow down and we try various tricks to squeeze up speed further.

    Any more pointers? Hopefully there are some books/articles/resource
    somewhere on this planet talking about highly efficient C++ code and
    complier, etc.
    Luna Moon, Jul 27, 2007
    #6
  7. Luna Moon a écrit :
    > "Victor Bazarov" <> wrote in message
    > news:...
    >> Luna Moon wrote:
    >>> [..]
    >>> I am dealing with double and floating points most of the time, since
    >>> I am doing numerical programming. Do you and the other experts on
    >>> these boards have more to say and more pointers along those lines?

    >> First off, please don't cross-post so wildly. Since I don't read the
    >> 'sci.*' hierarchy, I've removed those from my reply. If you think
    >> the readers of those are going to be interested, post a summary of
    >> your findings there later.
    >>
    >> That aside... The secret to optimizing is not doing all those small
    >> things that you're fiding all over the place (loop unrolling, etc.)
    >> The secret is *not to perform unnecessary work*. That's it. The
    >> rest is all about how you learn what is unnecessary and what you can
    >> do to void perfoming it.
    >>

    >
    > I tell you the truth, at the algorithm level, I've been working on it for 1+
    > year. So we've already push in all what we can do. Now I just want to make
    > sure that at the implementation level, I won't get hurt by not paying
    > attention to doing "small" things. From past experience, not paying
    > attention to "small" things can easily make the implementation 1000% slower,
    > even with a super smart algorithm. So please focus on giving more detail
    > suggestions on implementation. Thank you!


    Since we don't known your implementation/architecture/envirronement, it
    would be hard to give you implementation suggestion except of the broad one:
    - tune finely your compiler (use options that fit your needs)
    - perhaps try different compilers
    - make benchmark to see the impact
    - Mark out parts that are bottleneck and find a solution
    - avoid too frequent heap allocation
    ....

    In all cases, you'd better validate the correctness and the robustness
    of your implementation before any optimisation (good test coverage helps
    a lot). Nitty gritty diference between float,double or other floating
    point library you might use is irrelevant if you can't evaluate the
    performances and the correctness.

    >> I am not sure you can find it in those books you were recommended.
    >> I am not sure experimenting will do you good. If you don't know what
    >> loop to unroll, are you going to unroll each separately, then all
    >> pairs, then all triples, then... (you get the idea) to find which
    >> combination buys you the best fraction of a percent of improvement?

    >
    > I do need books and pointers, because I don't even know what is loop
    > unrolling.


    Loop unrolling is a solution for a problem you don't have. Find the
    problem first. Don't worry, you'll find plenty :)

    >
    >> Come on! Use the proper tools to find out what parts of your code
    >> are in fact bottlenecks, and work on those.

    >
    > How? The details or pointers are what I really need.
    >
    >> Once you use the best
    >> algorithms and cache as much data as you can, you might want to try
    >> all those loop unrollings or rearranging your data so you get fewer
    >> page faults when iterating over the array or containers... But
    >> without the tools you're shooting in the dark.
    >>

    >
    > Which tools? Please elaborate... Thanks a lot!


    You have been working on it for 1+ year and you've never ever performed
    any benchmark to see if a lesser implementation would fit the need or
    how perform conccurent algorithm ? Or to measure the improovement you
    target compared to existing practice ? It must be a terrible algorithm.

    The tools depend on your plateform. gprof under linux is widespread.
    Google for "performance analysis tool".

    Michael
    Michael DOUBEZ, Jul 27, 2007
    #7
  8. Luna Moon wrote:
    > "Hendrik van der Heijden" <> wrote in message
    > news:46a99140$0$31634$-online.net...
    >> galathaea schrieb:
    >>
    >>> you have to run benchmarks and profile code execution paths
    >>> but just choosing the right compiler has increased code
    >>> speed 40% in tight loops for me in the past

    >>
    >> I've seen a 500% boost just from adding a compiler option (-mfpmath).
    >> With this option, gcc uses SSE1/2 units for floating point
    >> calculations (SISD, not vector code) instead of traditional x87 code.
    >>
    >>
    >> Hendrik vdH

    >
    > Great! This is exactly what I want... Thanks so much I will try it
    > out! Which complier?


    'gcc', didn't you read the sentence?

    >
    > We have done everything at the algorithm level,


    That's what I'd heard many times when folks asked me to find if it
    was possible to improve anything. Then I'd find they have been using
    'std::set<SomeType*>' to collect things "already touched" in an array
    when processing it out of sequence. Some things one can find in some
    codebases are unbelievable.

    > now we just want to
    > make sure our data structure, caches, and code organization don't do
    > stupid things to slow down and we try various tricks to squeeze up
    > speed further.
    > Any more pointers? Hopefully there are some books/articles/resource
    > somewhere on this planet talking about highly efficient C++ code and
    > complier, etc.


    Books: "Efficient C++" by Bulka & Mayhew, "High Performance Computing"
    by Wadleigh & Crawford, "Effective" series by Meyers. I am not going
    to suggest Knuth's "TACP", it may be too late. But when you have time
    after stopping working on your current stuff, do give it a read, at
    least leaf through the Contents section.

    Tools: Intel VTune, AutomatedQA AQtime.

    Web: www.google.com

    Don't expect to find the instruction manual on how to speed up *your*
    code. Do expect to find recommendations and speculations based on
    a whole lot of assumptions. Nobody knows your code better than you,
    and no thought process can replace measuring. So, do the measuring
    first, then do the thinking.

    V
    --
    Please remove capital 'A's when replying by e-mail
    I do not respond to top-posted replies, please don't ask
    Victor Bazarov, Jul 27, 2007
    #8
  9. "Luna Moon" <> writes:

    > "Victor Bazarov" <> wrote in message
    > news:...
    >
    >> Come on! Use the proper tools to find out what parts of your code
    >> are in fact bottlenecks, and work on those.

    >
    > How? The details or pointers are what I really need.


    Use a profiler to measure your code's performance, then follow the 90/10
    rule. A profiler will tell you what relative amount of time is being spent
    in each part of your code. Most often, you'll find that 90% of the total
    execution time is being spent in 10% of the code, which means that focusing
    your optimization efforts on that 10% will give the best results.

    You'll need to consult your tool vendor's documentation to see how to use
    their profiler - for example, "man gprof" if you're using GCC.

    sherm--

    --
    Web Hosting by West Virginians, for West Virginians: http://wv-www.net
    Cocoa programming in Perl: http://camelbones.sourceforge.net
    Sherm Pendley, Jul 27, 2007
    #9
  10. In article <f8clov$qh$>,
    "Luna Moon" <> wrote:

    > "Hendrik van der Heijden" <> wrote in message
    > news:46a99140$0$31634$-online.net...
    > > galathaea schrieb:
    > >
    > >> you have to run benchmarks and profile code execution paths
    > >> but just choosing the right compiler has increased code
    > > > speed 40% in tight loops for me in the past

    > >
    > > I've seen a 500% boost just from adding a compiler option (-mfpmath).
    > > With this option, gcc uses SSE1/2 units for floating point calculations
    > > (SISD, not vector code) instead of traditional x87 code.
    > >
    > >
    > > Hendrik vdH

    >
    > Great! This is exactly what I want... Thanks so much I will try it out!
    > Which complier?
    >
    > We have done everything at the algorithm level, now we just want to make
    > sure our data structure, caches, and code organization don't do stupid
    > things to slow down and we try various tricks to squeeze up speed further.
    >
    > Any more pointers? Hopefully there are some books/articles/resource
    > somewhere on this planet talking about highly efficient C++ code and
    > complier, etc.


    Have you profiled your current implementation?

    --
    Michael Press
    Michael Press, Jul 28, 2007
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. PWalker
    Replies:
    3
    Views:
    743
    richardv2
    Dec 15, 2004
  2. mjbackues at yahoo

    problem with microsoft speed optimization

    mjbackues at yahoo, Dec 29, 2005, in forum: C++
    Replies:
    21
    Views:
    1,021
    Roberto Waltman
    Jan 2, 2006
  3. Luna Moon
    Replies:
    1
    Views:
    463
    Chip Eastham
    Jul 27, 2007
  4. Ravikiran

    Zero Optimization and Sign Optimization???

    Ravikiran, Nov 17, 2008, in forum: C Programming
    Replies:
    22
    Views:
    831
    Thad Smith
    Nov 24, 2008
  5. cerr

    pointers, pointers, pointers...

    cerr, Apr 7, 2011, in forum: C Programming
    Replies:
    12
    Views:
    642
Loading...

Share This Page