Fastest way to get minimum of four integer values?

Discussion in 'C++' started by anders.johansen@gmail.com, Apr 1, 2008.

  1. Guest

    Hi,

    I have a tight loop where I basically do this over and over (millions
    of times):

    int cell = <calculation>
    const int above = <calculation>
    if (cell > above) cell = above;
    const int below = <calculation>
    if (cell > below) cell = below;
    const int trans = <calculation>
    if (cell > trans) cell = trans;

    Is there a faster way to do this?

    Sincerely,
    Anders
     
    , Apr 1, 2008
    #1
    1. Advertising

  2. Guest

    On Apr 1, 1:42 pm, wrote:
    > Hi,
    >
    > I have a tight loop where I basically do this over and over (millions
    > of times):
    >
    > int cell = <calculation>
    > const int above = <calculation>
    > if (cell > above) cell = above;
    > const int below = <calculation>
    > if (cell > below) cell = below;
    > const int trans = <calculation>
    > if (cell > trans) cell = trans;
    >
    > Is there a faster way to do this?
    >
    > Sincerely,
    >   Anders


    the if could be replaced like
    cell=(cell > trans ? trans : cell);

    By replacing i cannot say what would be the improvement. There are
    other factors to be considered.
    moreover, you haven't said anything about the <calculations>. you may
    also want to consider running in parallel. all depends on what you are
    trying to achieve, about which you haven't said much.

    Thanks,
    Balaji.
     
    , Apr 1, 2008
    #2
    1. Advertising

  3. Martin Guest

    On Apr 1, 1:42 pm, wrote:
    > Hi,
    >
    > I have a tight loop where I basically do this over and over (millions
    > of times):
    >
    > int cell = <calculation>
    > const int above = <calculation>
    > if (cell > above) cell = above;
    > const int below = <calculation>
    > if (cell > below) cell = below;
    > const int trans = <calculation>
    > if (cell > trans) cell = trans;
    >
    > Is there a faster way to do this?
    >
    > Sincerely,
    >   Anders


    One obvious improvement is moving all <calculations> (or at least some
    of them) out of the loop, if it's possible. If no - IMO there is no
    considerable improvement.
     
    Martin, Apr 1, 2008
    #3
  4. Kai-Uwe Bux Guest

    wrote:

    > Hi,
    >
    > I have a tight loop where I basically do this over and over (millions
    > of times):
    >
    > int cell = <calculation>
    > const int above = <calculation>
    > if (cell > above) cell = above;
    > const int below = <calculation>
    > if (cell > below) cell = below;
    > const int trans = <calculation>
    > if (cell > trans) cell = trans;
    >
    > Is there a faster way to do this?


    Not from an algorithmic point of view. To find the minimum of four
    values will require computing all of them and three additional comparisons.

    On the other hand, there could be bit-fiddling techniques to find the
    minimum without using branch operations. This can lead to better pipelining
    on some processors. See, for instance:

    www.cellperformance.com/articles/2006/04/benefits_to_branch_elimination.html

    YMMV


    Best

    Kai-Uwe Bux
     
    Kai-Uwe Bux, Apr 1, 2008
    #4
  5. Lionel B Guest

    On Tue, 01 Apr 2008 01:49:40 -0700, kasthurirangan.balaji wrote:

    > On Apr 1, 1:42 pm, wrote:
    >> Hi,
    >>
    >> I have a tight loop where I basically do this over and over (millions
    >> of times):
    >>
    >> int cell = <calculation>
    >> const int above = <calculation>
    >> if (cell > above) cell = above;
    >> const int below = <calculation>
    >> if (cell > below) cell = below;
    >> const int trans = <calculation>
    >> if (cell > trans) cell = trans;
    >>
    >> Is there a faster way to do this?
    >>
    >> Sincerely,
    >>   Anders

    >
    > the if could be replaced like
    > cell=(cell > trans ? trans : cell);


    Pointless. That would be no faster, is not (to my mind) any clearer and
    potentially involves a redundant self-assignment of cell to itself
    (although an optimising compiler would probably not generate code for
    this).

    [...]

    --
    Lionel B
     
    Lionel B, Apr 1, 2008
    #5
  6. In article <fssubs$o9u$>, Lionel B <> wrote:
    >On Tue, 01 Apr 2008 01:49:40 -0700, kasthurirangan.balaji wrote:
    >
    >>
    >> the if could be replaced like
    >> cell=(cell > trans ? trans : cell);

    >
    >Pointless. That would be no faster, is not (to my mind) any clearer and
    >potentially involves a redundant self-assignment of cell to itself
    >(although an optimising compiler would probably not generate code for
    >this).
    >


    Agree, any decent compiler should generate the exact same machine code for:

    cell = ( cell > trans ? trans : cell);

    and

    if( cell > trans )
    cell = trans;
    else
    cell = cell;

    The fact that the first one can be written is C on one line is of no
    relevance to the compiler.

    Yan
     
    Yannick Tremblay, Apr 1, 2008
    #6
  7. Puppet_Sock Guest

    On Apr 1, 4:42 am, wrote:
    > I have a tight loop where I basically do this over and over (millions
    > of times):
    >
    > int cell = <calculation>
    > const int above = <calculation>
    > if (cell > above) cell = above;
    > const int below = <calculation>
    > if (cell > below) cell = below;
    > const int trans = <calculation>
    > if (cell > trans) cell = trans;
    >
    > Is there a faster way to do this?


    Question: In the above psuedo code, the various <calculation>
    entries use a lot of time, yes? That is, the various if-greater
    tests are a tiny portion of the overall run-time, yes?
    So, if the calcs happen every time through this loop, you
    probably will find more opportunity for optimizing in those
    calcs rather than the comparatively small potential in
    the if-greater tests.

    Also: Every time somebody starts asking about "make it faster"
    I have a standard response I trot out. Where's your stop-watch?
    Where's your performance spec? First measure how much time
    various parts of the code take to run. Then look to your spec
    to see if that is acceptable. If you've made the spec, relax!
    And don't sweat the stuff that takes a small part of the time.
    Socks
     
    Puppet_Sock, Apr 1, 2008
    #7
  8. Guest

    First of all, thanks to all who answered.

    I was kinda hoping somebody had a magic bit-twiddeling trick up their
    sleeve for this. Never hurts to ask ;)

    More info follows and responses to some concrete questions.

    On Apr 1, 4:21 pm, Puppet_Sock <> wrote:
    > On Apr 1, 4:42 am, wrote:
    >
    > > I have a tight loop where I basically do this over and over (millions
    > > of times):

    >
    > > int cell = <calculation>
    > > const int above = <calculation>
    > > if (cell > above) cell = above;
    > > const int below = <calculation>
    > > if (cell > below) cell = below;
    > > const int trans = <calculation>
    > > if (cell > trans) cell = trans;

    >
    > > Is there a faster way to do this?

    >
    > Question: In the above psuedo code, the various <calculation>
    > entries use a lot of time, yes? That is, the various if-greater
    > tests are a tiny portion of the overall run-time, yes?
    > So, if the calcs happen every time through this loop, you
    > probably will find more opportunity for optimizing in those
    > calcs rather than the comparatively small potential in
    > the if-greater tests.


    Actually the calculations are mainly table look-ups. It looks like
    this:

    int thismin = 100

    <...>

    <loop starts>

    int cell = (acceptMatrix[false][source[i-1]][target[1]]) ?
    distmatrix[i-1][1] : distmatrix[i-1][1]+1 ;
    const int above (distmatrix[i-1][2]+1);
    if (cell>above) cell=above;
    const int left (distmatrix[1]+1);
    if (cell>left) cell=left;
    const int trans (
    distmatrix[i-2][0] + 1 +
    (!acceptMatrix[false][source[i-2]][target[1]]) +
    (!acceptMatrix[true][source[i-1]][target[0]])
    );
    if (cell>trans) cell=trans;
    distmatrix[2]=cell;
    if (cell<thismin) thismin=cell;

    <loop ends>

    distmatrix is a 2-D array of int.

    acceptmatrix is a 3D array of bool.

    i and j are unsigned int loop indexes

    source and target are const unsigned char * const

    An additional question is whether the use of named const temporaries
    is detrimental. Should I perhaps instead hand-inline the table
    lookups, and let the compiler find the common expressions and
    optimize, or...

    I don't know how smart the compiler is, but I suspect "not too
    bright". It's Borland C++ Builder version 6, so a 6 years old compiler
    at best.

    > Also: Every time somebody starts asking about "make it faster"
    > I have a standard response I trot out. Where's your stop-watch?
    > Where's your performance spec? First measure how much time
    > various parts of the code take to run. Then look to your spec
    > to see if that is acceptable. If you've made the spec, relax!
    > And don't sweat the stuff that takes a small part of the time.
    > Socks


    The "stop-watch" is outside this algorithm, but currently this code
    fragment dominates execution time. Changing the calculation of "trans"
    from an if/then based structure to the current gobbledygook decreased
    run time by approximately 15%, probably by eliminating a few branches
    and a temporary or two. I compare the output to the original
    implementation for a large dataset, so I am sure that behaviour was
    not changed in any obvious way.I am also pursuing other ways to
    decrease the data set that the algorithm iterates over, but that's an
    entirely different discussion.

    Of course I could just get some hotshot to hand-assy it for me, but
    then I would be locked-in to that particular "magic" implementation.
    Further, I doubt that solution would be as easy to hand-unroll as the
    C/C++ version is.

    Anders
     
    , Apr 1, 2008
    #8
  9. a écrit :
    > First of all, thanks to all who answered.
    >
    > I was kinda hoping somebody had a magic bit-twiddeling trick up their
    > sleeve for this. Never hurts to ask ;)
    >
    > More info follows and responses to some concrete questions.
    > [snip]
    >
    > int thismin = 100
    >
    > <...>
    >
    > <loop starts>
    >
    > int cell = (acceptMatrix[false][source[i-1]][target[1]]) ?
    > distmatrix[i-1][1] : distmatrix[i-1][1]+1 ;
    > const int above (distmatrix[i-1][2]+1);
    > if (cell>above) cell=above;
    > const int left (distmatrix[1]+1);
    > if (cell>left) cell=left;
    > const int trans (
    > distmatrix[i-2][0] + 1 +
    > (!acceptMatrix[false][source[i-2]][target[1]]) +
    > (!acceptMatrix[true][source[i-1]][target[0]])
    > );


    Funny that you add integers and boolean.
    What value do you expect from a conversion from bool to int ?
    Or is the code incomplete ?


    > if (cell>trans) cell=trans;
    > distmatrix[2]=cell;
    > if (cell<thismin) thismin=cell;
    >
    > <loop ends>
    >
    > distmatrix is a 2-D array of int.
    >
    > acceptmatrix is a 3D array of bool.
    >
    > i and j are unsigned int loop indexes
    >


    I see no j variable in your code.

    > source and target are const unsigned char * const
    >
    > An additional question is whether the use of named const temporaries
    > is detrimental. Should I perhaps instead hand-inline the table
    > lookups, and let the compiler find the common expressions and
    > optimize, or...
    >
    > I don't know how smart the compiler is, but I suspect "not too
    > bright". It's Borland C++ Builder version 6, so a 6 years old compiler
    > at best.


    In this case I suggest that you get a look at the assembly generated by
    the compiler. And depending on the operations that seems consuming, you
    might want to tinker a bit with compiler's directives.

    Here is how I would rewrite the code you gave in order to minimize the
    number of operation but I wouldn't expect to outsmart any modern compiler.

    <loop starts>

    int cell=distmatrix[i-1][1];
    const char source_im1(source[i-1]); //reused later
    const char target_1 (target[1]); //reused later
    if(!acceptMatrix[false][src_im1][target_1])
    { //increment cell only in given case
    ++cell;
    }

    //change here to gain additions - might lose in memory load/unload
    int cell2 (distmatrix[i-1][2]);
    const int left (distmatrix[1]);
    if (cell2>left) cell2=left;

    int trans (distmatrix[i-2][0]);
    if(!acceptMatrix[false][source[i-2]][target[1]])
    {
    ++trans;
    }
    if(!acceptMatrix[true][source_im1][target_1])
    {
    ++trans;
    }
    if (cell2>trans) cell2=trans;
    //addition we add not done before
    ++cell2
    if (cell>cell2) cell=cell2;

    distmatrix[2]=cell;
    if (cell<thismin) thismin=cell;

    <loop ends>

    Michael
     
    Michael DOUBEZ, Apr 2, 2008
    #9
  10. Guest

    Hmmm,

    How about this:

    int cell = (acceptMatrix[false][source[i-1]][target[1]]) ?
    distmatrix[i-1][1] : distmatrix[i-1][1]+1 ;
    // Newval replaces above. All add one instrs. moved down
    int newval (distmatrix[i-1][2]);
    const int left (distmatrix[1]);
    const int trans (
    distmatrix[i-2][0] +
    (!acceptMatrix[false][source[i-2]][target[1]]) +
    (!acceptMatrix[true][source[i-1]][target[0]])
    );
    if (left < newval) newval = left;
    if (trans < newval) newval = trans;
    // Add one moved here
    if (++newval < cell) cell=newval;
    if (distmatrix[2]=cell < thismin) thismin=cell;

    This saves two "+ 1" instructions, and replaces the remaining one with
    a preincrement. The ++newval could (should?) be moved out of the if
    statement for clearer code.
     
    , Apr 2, 2008
    #10
  11. Daniel Pitts Guest

    wrote:
    > Actually the calculations are mainly table look-ups. It looks like
    > this:
    >
    > int thismin = 100
    >
    > <...>
    >
    > <loop starts>
    >
    > int cell = (acceptMatrix[false][source[i-1]][target[1]]) ?
    > distmatrix[i-1][1] : distmatrix[i-1][1]+1 ;
    > const int above (distmatrix[i-1][2]+1);
    > if (cell>above) cell=above;
    > const int left (distmatrix[1]+1);
    > if (cell>left) cell=left;
    > const int trans (
    > distmatrix[i-2][0] + 1 +
    > (!acceptMatrix[false][source[i-2]][target[1]]) +
    > (!acceptMatrix[true][source[i-1]][target[0]])
    > );
    > if (cell>trans) cell=trans;
    > distmatrix[2]=cell;
    > if (cell<thismin) thismin=cell;
    >
    > <loop ends>

    Array lookups aren't terribly expensive, but they do have a cost. if
    you can save a reference to distmatrix[i-1] (since its used in a few
    places), that might shave a nanosecond or two. If you have nested
    loops, look for calculations or references you can do outside of the
    inner most loop (or outside of both loops).

    For instance, if "i" is from your outer loop, you could "precalculate"
    distmatrix, distmatrix[i-1], dismatrix[i-2], source[i-1],
    source[i-2], etc...

    Chances are, you'll have more luck finding a better algorithm than
    you'll have shaving significant time off by optimizing this algorithm.
    >
    > distmatrix is a 2-D array of int.
    >
    > acceptmatrix is a 3D array of bool.
    >
    > i and j are unsigned int loop indexes
    >
    > source and target are const unsigned char * const
    >
    > An additional question is whether the use of named const temporaries
    > is detrimental. Should I perhaps instead hand-inline the table
    > lookups, and let the compiler find the common expressions and
    > optimize, or...

    It shouldn't really hurt, they would probably compile to the same thing.
    >
    > I don't know how smart the compiler is, but I suspect "not too
    > bright". It's Borland C++ Builder version 6, so a 6 years old compiler
    > at best.
    >
    >> Also: Every time somebody starts asking about "make it faster"
    >> I have a standard response I trot out. Where's your stop-watch?
    >> Where's your performance spec? First measure how much time
    >> various parts of the code take to run. Then look to your spec
    >> to see if that is acceptable. If you've made the spec, relax!
    >> And don't sweat the stuff that takes a small part of the time.
    >> Socks

    >
    > The "stop-watch" is outside this algorithm, but currently this code
    > fragment dominates execution time. Changing the calculation of "trans"
    > from an if/then based structure to the current gobbledygook decreased
    > run time by approximately 15%, probably by eliminating a few branches
    > and a temporary or two. I compare the output to the original
    > implementation for a large dataset, so I am sure that behaviour was
    > not changed in any obvious way.I am also pursuing other ways to
    > decrease the data set that the algorithm iterates over, but that's an
    > entirely different discussion.
    >
    > Of course I could just get some hotshot to hand-assy it for me, but
    > then I would be locked-in to that particular "magic" implementation.
    > Further, I doubt that solution would be as easy to hand-unroll as the
    > C/C++ version is.
    >
    > Anders



    --
    Daniel Pitts' Tech Blog: <http://virtualinfinity.net/wordpress/>
     
    Daniel Pitts, Apr 6, 2008
    #11
  12. Martin York Guest

    On Apr 6, 2:43 pm, Daniel Pitts
    <> wrote:

    > Array lookups aren't terribly expensive, but they do have a cost. if
    > you can save a reference to distmatrix[i-1] (since its used in a few
    > places), that might shave a nanosecond or two.
    >
    > Daniel Pitts' Tech Blog: <http://virtualinfinity.net/wordpress/>


    Not sure I buy that. Could be wrong!

    Of course we are getting down to machine specific behavior at this
    point. But accessing a memory location via an array or reference would
    be just as expensive as most memory access instructions (Its been a
    while so things may have changed) allow a major pointer and and offset
    as part of the same instruction.

    Actually the expensive bit is not in the memory retrieval but in
    loading the memory from main memory into the local cache. If your code
    is getting cache misses the processor stall time for this is many
    orders of magnitude greater than any theoretical difference between an
    array verses reference read.

    As a result I think any changes to code for a perceived code
    performance increase must be significant before you consider change to
    the code that makes it less readable (read maintainable) as any
    performance increases on a particular data size may evaporate when
    used on a new data size because of changes in cache traversal/usage.
     
    Martin York, Apr 7, 2008
    #12
  13. ppi Guest

    On Apr 1, 4:42 am, wrote:
    > Hi,
    >
    > I have a tight loop where I basically do this over and over (millions
    > of times):
    >
    > int cell = <calculation>
    > const int above = <calculation>
    > if (cell > above) cell = above;
    > const int below = <calculation>
    > if (cell > below) cell = below;
    > const int trans = <calculation>
    > if (cell > trans) cell = trans;
    >
    > Is there a faster way to do this?
    >
    > Sincerely,
    > Anders


    you may get better results by helping the compiler filling up the
    pipeline:

    int a = cell > above ? above : cell;
    int b = below > trans ? trans : below;
    cell = a < b ? b : a;

    ideally 'a' and can 'b' be computed efficiently: they are both
    independent and cached in registers.
    By doing so you provide the compiler with more information regarding
    the computation, namely that his free to compute a or b whenever it
    wants, in this order or not.
    Keeping on re-assigning the same variable again and again just make
    the program instructions code stuck in the pipeline.
    If you can write something like the previous code, you should notice a
    speed-up.
    This is nice feature off powerpc and AMD/Intel pentium processors.

    You may be interested by:
    - http://www.agner.org/optimize/optimizing_cpp.pdf
    - http://developer.amd.com/TechnicalArticles/Articles/pages/7162004127.aspx
    - http://developer.amd.com/TechnicalArticles/Articles/pages/6212004126.aspx

    Even if the last 2 links are from the amd website, they are still
    relevant for general optimization.

    Unfortunately this means that you do have a good compiler and that the
    code/algorithm you use can actually make this possible.

    Hope it will help,
    -- paulo
     
    ppi, Apr 7, 2008
    #13
  14. James Kanze Guest

    On Apr 7, 2:15 am, Martin York <> wrote:
    > On Apr 6, 2:43 pm, Daniel Pitts


    > <> wrote:
    > > Array lookups aren't terribly expensive, but they do have a cost. if
    > > you can save a reference to distmatrix[i-1] (since its used in a few
    > > places), that might shave a nanosecond or two.


    > > Daniel Pitts' Tech Blog: <http://virtualinfinity.net/wordpress/>


    > Not sure I buy that. Could be wrong!


    More to the point (and this applies to the last response by ppi
    as well, concerning pipeline stall): the compiler knows this
    better than you, and turning on optimization will normally
    result in the best solution here. Compilers have been caching
    array accesses and common subexpressions for several decades
    now, and all modern compilers know about pipelines, memory
    caches, etc., and take them into account when optimizing. All
    such micro-optimizations do in source code is make it more
    convoluted---more difficult for the human reader to understand,
    and more difficult for the compiler to optimize.

    --
    James Kanze (GABI Software) email:
    Conseils en informatique orientée objet/
    Beratung in objektorientierter Datenverarbeitung
    9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
     
    James Kanze, Apr 8, 2008
    #14
  15. Guest

    Hi,

    Well, I can tell you that manual loop unrolling for one special case
    gave a massive speedup, and it now runs in about 25% of the time it
    used to. I may attempt the approach of helping the compiler bound the
    life of temporaries by using curlies, but initial results indicate
    that this is detrimental to performance. Next I will probably try to
    eliminate named temporaries, on the assumption that the compiler is
    able to identify the references easier withour the aliassing
    introduced by named temps.

    Still, so far I have gotten some nice speedups. They are not related
    to the issue I brought up here, but the solutions are in many cases
    inspired by the discussion, so you have all helped me in various
    ways ;)

    Thank you all!

    Anders
     
    , Apr 9, 2008
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. William S. Huizinga

    Need elegant way to cast four bytes into a long

    William S. Huizinga, Aug 8, 2003, in forum: Python
    Replies:
    11
    Views:
    522
    Anton Vredegoor
    Aug 11, 2003
  2. Godzilla
    Replies:
    15
    Views:
    550
    jigloo
    Jul 15, 2007
  3. wswilson
    Replies:
    16
    Views:
    467
  4. rlc
    Replies:
    10
    Views:
    1,198
    David Thompson
    Jul 1, 2009
  5. Seebs

    Sixty Four Bit Integer

    Seebs, Sep 15, 2009, in forum: C Programming
    Replies:
    26
    Views:
    740
    Tim Rentsch
    Oct 3, 2009
Loading...

Share This Page