# Fastest way to get minimum of four integer values?

Discussion in 'C++' started by anders.johansen@gmail.com, Apr 1, 2008.

1. ### Guest

Hi,

I have a tight loop where I basically do this over and over (millions
of times):

int cell = <calculation>
const int above = <calculation>
if (cell > above) cell = above;
const int below = <calculation>
if (cell > below) cell = below;
const int trans = <calculation>
if (cell > trans) cell = trans;

Is there a faster way to do this?

Sincerely,
Anders

, Apr 1, 2008

2. ### Guest

On Apr 1, 1:42 pm, wrote:
> Hi,
>
> I have a tight loop where I basically do this over and over (millions
> of times):
>
> int cell = <calculation>
> const int above = <calculation>
> if (cell > above) cell = above;
> const int below = <calculation>
> if (cell > below) cell = below;
> const int trans = <calculation>
> if (cell > trans) cell = trans;
>
> Is there a faster way to do this?
>
> Sincerely,
>   Anders

the if could be replaced like
cell=(cell > trans ? trans : cell);

By replacing i cannot say what would be the improvement. There are
other factors to be considered.
moreover, you haven't said anything about the <calculations>. you may
also want to consider running in parallel. all depends on what you are
trying to achieve, about which you haven't said much.

Thanks,
Balaji.

, Apr 1, 2008

3. ### MartinGuest

On Apr 1, 1:42 pm, wrote:
> Hi,
>
> I have a tight loop where I basically do this over and over (millions
> of times):
>
> int cell = <calculation>
> const int above = <calculation>
> if (cell > above) cell = above;
> const int below = <calculation>
> if (cell > below) cell = below;
> const int trans = <calculation>
> if (cell > trans) cell = trans;
>
> Is there a faster way to do this?
>
> Sincerely,
>   Anders

One obvious improvement is moving all <calculations> (or at least some
of them) out of the loop, if it's possible. If no - IMO there is no
considerable improvement.

Martin, Apr 1, 2008
4. ### Kai-Uwe BuxGuest

wrote:

> Hi,
>
> I have a tight loop where I basically do this over and over (millions
> of times):
>
> int cell = <calculation>
> const int above = <calculation>
> if (cell > above) cell = above;
> const int below = <calculation>
> if (cell > below) cell = below;
> const int trans = <calculation>
> if (cell > trans) cell = trans;
>
> Is there a faster way to do this?

Not from an algorithmic point of view. To find the minimum of four
values will require computing all of them and three additional comparisons.

On the other hand, there could be bit-fiddling techniques to find the
minimum without using branch operations. This can lead to better pipelining
on some processors. See, for instance:

www.cellperformance.com/articles/2006/04/benefits_to_branch_elimination.html

YMMV

Best

Kai-Uwe Bux

Kai-Uwe Bux, Apr 1, 2008
5. ### Lionel BGuest

On Tue, 01 Apr 2008 01:49:40 -0700, kasthurirangan.balaji wrote:

> On Apr 1, 1:42Â pm, wrote:
>> Hi,
>>
>> I have a tight loop where I basically do this over and over (millions
>> of times):
>>
>> int cell = <calculation>
>> const int above = <calculation>
>> if (cell > above) cell = above;
>> const int below = <calculation>
>> if (cell > below) cell = below;
>> const int trans = <calculation>
>> if (cell > trans) cell = trans;
>>
>> Is there a faster way to do this?
>>
>> Sincerely,
>> Â  Anders

>
> the if could be replaced like
> cell=(cell > trans ? trans : cell);

Pointless. That would be no faster, is not (to my mind) any clearer and
potentially involves a redundant self-assignment of cell to itself
(although an optimising compiler would probably not generate code for
this).

[...]

--
Lionel B

Lionel B, Apr 1, 2008
6. ### Yannick TremblayGuest

In article <fssubs\$o9u\$>, Lionel B <> wrote:
>On Tue, 01 Apr 2008 01:49:40 -0700, kasthurirangan.balaji wrote:
>
>>
>> the if could be replaced like
>> cell=(cell > trans ? trans : cell);

>
>Pointless. That would be no faster, is not (to my mind) any clearer and
>potentially involves a redundant self-assignment of cell to itself
>(although an optimising compiler would probably not generate code for
>this).
>

Agree, any decent compiler should generate the exact same machine code for:

cell = ( cell > trans ? trans : cell);

and

if( cell > trans )
cell = trans;
else
cell = cell;

The fact that the first one can be written is C on one line is of no
relevance to the compiler.

Yan

Yannick Tremblay, Apr 1, 2008
7. ### Puppet_SockGuest

On Apr 1, 4:42 am, wrote:
> I have a tight loop where I basically do this over and over (millions
> of times):
>
> int cell = <calculation>
> const int above = <calculation>
> if (cell > above) cell = above;
> const int below = <calculation>
> if (cell > below) cell = below;
> const int trans = <calculation>
> if (cell > trans) cell = trans;
>
> Is there a faster way to do this?

Question: In the above psuedo code, the various <calculation>
entries use a lot of time, yes? That is, the various if-greater
tests are a tiny portion of the overall run-time, yes?
So, if the calcs happen every time through this loop, you
probably will find more opportunity for optimizing in those
calcs rather than the comparatively small potential in
the if-greater tests.

I have a standard response I trot out. Where's your stop-watch?
Where's your performance spec? First measure how much time
various parts of the code take to run. Then look to your spec
to see if that is acceptable. If you've made the spec, relax!
And don't sweat the stuff that takes a small part of the time.
Socks

Puppet_Sock, Apr 1, 2008
8. ### Guest

First of all, thanks to all who answered.

I was kinda hoping somebody had a magic bit-twiddeling trick up their
sleeve for this. Never hurts to ask

On Apr 1, 4:21 pm, Puppet_Sock <> wrote:
> On Apr 1, 4:42 am, wrote:
>
> > I have a tight loop where I basically do this over and over (millions
> > of times):

>
> > int cell = <calculation>
> > const int above = <calculation>
> > if (cell > above) cell = above;
> > const int below = <calculation>
> > if (cell > below) cell = below;
> > const int trans = <calculation>
> > if (cell > trans) cell = trans;

>
> > Is there a faster way to do this?

>
> Question: In the above psuedo code, the various <calculation>
> entries use a lot of time, yes? That is, the various if-greater
> tests are a tiny portion of the overall run-time, yes?
> So, if the calcs happen every time through this loop, you
> probably will find more opportunity for optimizing in those
> calcs rather than the comparatively small potential in
> the if-greater tests.

Actually the calculations are mainly table look-ups. It looks like
this:

int thismin = 100

<...>

<loop starts>

int cell = (acceptMatrix[false][source[i-1]][target[1]]) ?
distmatrix[i-1][1] : distmatrix[i-1][1]+1 ;
const int above (distmatrix[i-1][2]+1);
if (cell>above) cell=above;
const int left (distmatrix[1]+1);
if (cell>left) cell=left;
const int trans (
distmatrix[i-2][0] + 1 +
(!acceptMatrix[false][source[i-2]][target[1]]) +
(!acceptMatrix[true][source[i-1]][target[0]])
);
if (cell>trans) cell=trans;
distmatrix[2]=cell;
if (cell<thismin) thismin=cell;

<loop ends>

distmatrix is a 2-D array of int.

acceptmatrix is a 3D array of bool.

i and j are unsigned int loop indexes

source and target are const unsigned char * const

An additional question is whether the use of named const temporaries
is detrimental. Should I perhaps instead hand-inline the table
lookups, and let the compiler find the common expressions and
optimize, or...

I don't know how smart the compiler is, but I suspect "not too
bright". It's Borland C++ Builder version 6, so a 6 years old compiler
at best.

> I have a standard response I trot out. Where's your stop-watch?
> Where's your performance spec? First measure how much time
> various parts of the code take to run. Then look to your spec
> to see if that is acceptable. If you've made the spec, relax!
> And don't sweat the stuff that takes a small part of the time.
> Socks

The "stop-watch" is outside this algorithm, but currently this code
fragment dominates execution time. Changing the calculation of "trans"
from an if/then based structure to the current gobbledygook decreased
run time by approximately 15%, probably by eliminating a few branches
and a temporary or two. I compare the output to the original
implementation for a large dataset, so I am sure that behaviour was
not changed in any obvious way.I am also pursuing other ways to
decrease the data set that the algorithm iterates over, but that's an
entirely different discussion.

Of course I could just get some hotshot to hand-assy it for me, but
then I would be locked-in to that particular "magic" implementation.
Further, I doubt that solution would be as easy to hand-unroll as the
C/C++ version is.

Anders

, Apr 1, 2008
9. ### Michael DOUBEZGuest

a écrit :
> First of all, thanks to all who answered.
>
> I was kinda hoping somebody had a magic bit-twiddeling trick up their
> sleeve for this. Never hurts to ask
>
> [snip]
>
> int thismin = 100
>
> <...>
>
> <loop starts>
>
> int cell = (acceptMatrix[false][source[i-1]][target[1]]) ?
> distmatrix[i-1][1] : distmatrix[i-1][1]+1 ;
> const int above (distmatrix[i-1][2]+1);
> if (cell>above) cell=above;
> const int left (distmatrix[1]+1);
> if (cell>left) cell=left;
> const int trans (
> distmatrix[i-2][0] + 1 +
> (!acceptMatrix[false][source[i-2]][target[1]]) +
> (!acceptMatrix[true][source[i-1]][target[0]])
> );

Funny that you add integers and boolean.
What value do you expect from a conversion from bool to int ?
Or is the code incomplete ?

> if (cell>trans) cell=trans;
> distmatrix[2]=cell;
> if (cell<thismin) thismin=cell;
>
> <loop ends>
>
> distmatrix is a 2-D array of int.
>
> acceptmatrix is a 3D array of bool.
>
> i and j are unsigned int loop indexes
>

I see no j variable in your code.

> source and target are const unsigned char * const
>
> An additional question is whether the use of named const temporaries
> is detrimental. Should I perhaps instead hand-inline the table
> lookups, and let the compiler find the common expressions and
> optimize, or...
>
> I don't know how smart the compiler is, but I suspect "not too
> bright". It's Borland C++ Builder version 6, so a 6 years old compiler
> at best.

In this case I suggest that you get a look at the assembly generated by
the compiler. And depending on the operations that seems consuming, you
might want to tinker a bit with compiler's directives.

Here is how I would rewrite the code you gave in order to minimize the
number of operation but I wouldn't expect to outsmart any modern compiler.

<loop starts>

int cell=distmatrix[i-1][1];
const char source_im1(source[i-1]); //reused later
const char target_1 (target[1]); //reused later
if(!acceptMatrix[false][src_im1][target_1])
{ //increment cell only in given case
++cell;
}

int cell2 (distmatrix[i-1][2]);
const int left (distmatrix[1]);
if (cell2>left) cell2=left;

int trans (distmatrix[i-2][0]);
if(!acceptMatrix[false][source[i-2]][target[1]])
{
++trans;
}
if(!acceptMatrix[true][source_im1][target_1])
{
++trans;
}
if (cell2>trans) cell2=trans;
++cell2
if (cell>cell2) cell=cell2;

distmatrix[2]=cell;
if (cell<thismin) thismin=cell;

<loop ends>

Michael

Michael DOUBEZ, Apr 2, 2008
10. ### Guest

Hmmm,

int cell = (acceptMatrix[false][source[i-1]][target[1]]) ?
distmatrix[i-1][1] : distmatrix[i-1][1]+1 ;
// Newval replaces above. All add one instrs. moved down
int newval (distmatrix[i-1][2]);
const int left (distmatrix[1]);
const int trans (
distmatrix[i-2][0] +
(!acceptMatrix[false][source[i-2]][target[1]]) +
(!acceptMatrix[true][source[i-1]][target[0]])
);
if (left < newval) newval = left;
if (trans < newval) newval = trans;
if (++newval < cell) cell=newval;
if (distmatrix[2]=cell < thismin) thismin=cell;

This saves two "+ 1" instructions, and replaces the remaining one with
a preincrement. The ++newval could (should?) be moved out of the if
statement for clearer code.

, Apr 2, 2008
11. ### Daniel PittsGuest

wrote:
> Actually the calculations are mainly table look-ups. It looks like
> this:
>
> int thismin = 100
>
> <...>
>
> <loop starts>
>
> int cell = (acceptMatrix[false][source[i-1]][target[1]]) ?
> distmatrix[i-1][1] : distmatrix[i-1][1]+1 ;
> const int above (distmatrix[i-1][2]+1);
> if (cell>above) cell=above;
> const int left (distmatrix[1]+1);
> if (cell>left) cell=left;
> const int trans (
> distmatrix[i-2][0] + 1 +
> (!acceptMatrix[false][source[i-2]][target[1]]) +
> (!acceptMatrix[true][source[i-1]][target[0]])
> );
> if (cell>trans) cell=trans;
> distmatrix[2]=cell;
> if (cell<thismin) thismin=cell;
>
> <loop ends>

Array lookups aren't terribly expensive, but they do have a cost. if
you can save a reference to distmatrix[i-1] (since its used in a few
places), that might shave a nanosecond or two. If you have nested
loops, look for calculations or references you can do outside of the
inner most loop (or outside of both loops).

For instance, if "i" is from your outer loop, you could "precalculate"
distmatrix, distmatrix[i-1], dismatrix[i-2], source[i-1],
source[i-2], etc...

Chances are, you'll have more luck finding a better algorithm than
you'll have shaving significant time off by optimizing this algorithm.
>
> distmatrix is a 2-D array of int.
>
> acceptmatrix is a 3D array of bool.
>
> i and j are unsigned int loop indexes
>
> source and target are const unsigned char * const
>
> An additional question is whether the use of named const temporaries
> is detrimental. Should I perhaps instead hand-inline the table
> lookups, and let the compiler find the common expressions and
> optimize, or...

It shouldn't really hurt, they would probably compile to the same thing.
>
> I don't know how smart the compiler is, but I suspect "not too
> bright". It's Borland C++ Builder version 6, so a 6 years old compiler
> at best.
>
>> I have a standard response I trot out. Where's your stop-watch?
>> Where's your performance spec? First measure how much time
>> various parts of the code take to run. Then look to your spec
>> to see if that is acceptable. If you've made the spec, relax!
>> And don't sweat the stuff that takes a small part of the time.
>> Socks

>
> The "stop-watch" is outside this algorithm, but currently this code
> fragment dominates execution time. Changing the calculation of "trans"
> from an if/then based structure to the current gobbledygook decreased
> run time by approximately 15%, probably by eliminating a few branches
> and a temporary or two. I compare the output to the original
> implementation for a large dataset, so I am sure that behaviour was
> not changed in any obvious way.I am also pursuing other ways to
> decrease the data set that the algorithm iterates over, but that's an
> entirely different discussion.
>
> Of course I could just get some hotshot to hand-assy it for me, but
> then I would be locked-in to that particular "magic" implementation.
> Further, I doubt that solution would be as easy to hand-unroll as the
> C/C++ version is.
>
> Anders

--
Daniel Pitts' Tech Blog: <http://virtualinfinity.net/wordpress/>

Daniel Pitts, Apr 6, 2008
12. ### Martin YorkGuest

On Apr 6, 2:43 pm, Daniel Pitts
<> wrote:

> Array lookups aren't terribly expensive, but they do have a cost. if
> you can save a reference to distmatrix[i-1] (since its used in a few
> places), that might shave a nanosecond or two.
>
> Daniel Pitts' Tech Blog: <http://virtualinfinity.net/wordpress/>

Not sure I buy that. Could be wrong!

Of course we are getting down to machine specific behavior at this
point. But accessing a memory location via an array or reference would
be just as expensive as most memory access instructions (Its been a
while so things may have changed) allow a major pointer and and offset
as part of the same instruction.

Actually the expensive bit is not in the memory retrieval but in
is getting cache misses the processor stall time for this is many
orders of magnitude greater than any theoretical difference between an

As a result I think any changes to code for a perceived code
performance increase must be significant before you consider change to
performance increases on a particular data size may evaporate when
used on a new data size because of changes in cache traversal/usage.

Martin York, Apr 7, 2008
13. ### ppiGuest

On Apr 1, 4:42 am, wrote:
> Hi,
>
> I have a tight loop where I basically do this over and over (millions
> of times):
>
> int cell = <calculation>
> const int above = <calculation>
> if (cell > above) cell = above;
> const int below = <calculation>
> if (cell > below) cell = below;
> const int trans = <calculation>
> if (cell > trans) cell = trans;
>
> Is there a faster way to do this?
>
> Sincerely,
> Anders

you may get better results by helping the compiler filling up the
pipeline:

int a = cell > above ? above : cell;
int b = below > trans ? trans : below;
cell = a < b ? b : a;

ideally 'a' and can 'b' be computed efficiently: they are both
independent and cached in registers.
the computation, namely that his free to compute a or b whenever it
wants, in this order or not.
Keeping on re-assigning the same variable again and again just make
the program instructions code stuck in the pipeline.
If you can write something like the previous code, you should notice a
speed-up.
This is nice feature off powerpc and AMD/Intel pentium processors.

You may be interested by:
- http://www.agner.org/optimize/optimizing_cpp.pdf
- http://developer.amd.com/TechnicalArticles/Articles/pages/7162004127.aspx
- http://developer.amd.com/TechnicalArticles/Articles/pages/6212004126.aspx

Even if the last 2 links are from the amd website, they are still
relevant for general optimization.

Unfortunately this means that you do have a good compiler and that the
code/algorithm you use can actually make this possible.

Hope it will help,
-- paulo

ppi, Apr 7, 2008
14. ### James KanzeGuest

On Apr 7, 2:15 am, Martin York <> wrote:
> On Apr 6, 2:43 pm, Daniel Pitts

> <> wrote:
> > Array lookups aren't terribly expensive, but they do have a cost. if
> > you can save a reference to distmatrix[i-1] (since its used in a few
> > places), that might shave a nanosecond or two.

> > Daniel Pitts' Tech Blog: <http://virtualinfinity.net/wordpress/>

> Not sure I buy that. Could be wrong!

More to the point (and this applies to the last response by ppi
as well, concerning pipeline stall): the compiler knows this
better than you, and turning on optimization will normally
result in the best solution here. Compilers have been caching
array accesses and common subexpressions for several decades
now, and all modern compilers know about pipelines, memory
caches, etc., and take them into account when optimizing. All
such micro-optimizations do in source code is make it more
convoluted---more difficult for the human reader to understand,
and more difficult for the compiler to optimize.

--
James Kanze (GABI Software) email:
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

James Kanze, Apr 8, 2008
15. ### Guest

Hi,

Well, I can tell you that manual loop unrolling for one special case
gave a massive speedup, and it now runs in about 25% of the time it
used to. I may attempt the approach of helping the compiler bound the
life of temporaries by using curlies, but initial results indicate
that this is detrimental to performance. Next I will probably try to
eliminate named temporaries, on the assumption that the compiler is
able to identify the references easier withour the aliassing
introduced by named temps.

Still, so far I have gotten some nice speedups. They are not related
to the issue I brought up here, but the solutions are in many cases
inspired by the discussion, so you have all helped me in various
ways

Thank you all!

Anders

, Apr 9, 2008