Fastest way to get minimum of four integer values?

  • Thread starter anders.johansen
  • Start date
A

anders.johansen

Hi,

I have a tight loop where I basically do this over and over (millions
of times):

int cell = <calculation>
const int above = <calculation>
if (cell > above) cell = above;
const int below = <calculation>
if (cell > below) cell = below;
const int trans = <calculation>
if (cell > trans) cell = trans;

Is there a faster way to do this?

Sincerely,
Anders
 
K

kasthurirangan.balaji

Hi,

I have a tight loop where I basically do this over and over (millions
of times):

int cell = <calculation>
const int above = <calculation>
if (cell > above) cell = above;
const int below = <calculation>
if (cell > below) cell = below;
const int trans = <calculation>
if (cell > trans) cell = trans;

Is there a faster way to do this?

Sincerely,
  Anders

the if could be replaced like
cell=(cell > trans ? trans : cell);

By replacing i cannot say what would be the improvement. There are
other factors to be considered.
moreover, you haven't said anything about the <calculations>. you may
also want to consider running in parallel. all depends on what you are
trying to achieve, about which you haven't said much.

Thanks,
Balaji.
 
M

Martin

Hi,

I have a tight loop where I basically do this over and over (millions
of times):

int cell = <calculation>
const int above = <calculation>
if (cell > above) cell = above;
const int below = <calculation>
if (cell > below) cell = below;
const int trans = <calculation>
if (cell > trans) cell = trans;

Is there a faster way to do this?

Sincerely,
  Anders

One obvious improvement is moving all <calculations> (or at least some
of them) out of the loop, if it's possible. If no - IMO there is no
considerable improvement.
 
K

Kai-Uwe Bux

Hi,

I have a tight loop where I basically do this over and over (millions
of times):

int cell = <calculation>
const int above = <calculation>
if (cell > above) cell = above;
const int below = <calculation>
if (cell > below) cell = below;
const int trans = <calculation>
if (cell > trans) cell = trans;

Is there a faster way to do this?

Not from an algorithmic point of view. To find the minimum of four
values will require computing all of them and three additional comparisons.

On the other hand, there could be bit-fiddling techniques to find the
minimum without using branch operations. This can lead to better pipelining
on some processors. See, for instance:

www.cellperformance.com/articles/2006/04/benefits_to_branch_elimination.html

YMMV


Best

Kai-Uwe Bux
 
L

Lionel B

the if could be replaced like
cell=(cell > trans ? trans : cell);

Pointless. That would be no faster, is not (to my mind) any clearer and
potentially involves a redundant self-assignment of cell to itself
(although an optimising compiler would probably not generate code for
this).

[...]
 
Y

Yannick Tremblay

Pointless. That would be no faster, is not (to my mind) any clearer and
potentially involves a redundant self-assignment of cell to itself
(although an optimising compiler would probably not generate code for
this).

Agree, any decent compiler should generate the exact same machine code for:

cell = ( cell > trans ? trans : cell);

and

if( cell > trans )
cell = trans;
else
cell = cell;

The fact that the first one can be written is C on one line is of no
relevance to the compiler.

Yan
 
P

Puppet_Sock

I have a tight loop where I basically do this over and over (millions
of times):

int cell = <calculation>
const int above = <calculation>
if (cell > above) cell = above;
const int below = <calculation>
if (cell > below) cell = below;
const int trans = <calculation>
if (cell > trans) cell = trans;

Is there a faster way to do this?

Question: In the above psuedo code, the various <calculation>
entries use a lot of time, yes? That is, the various if-greater
tests are a tiny portion of the overall run-time, yes?
So, if the calcs happen every time through this loop, you
probably will find more opportunity for optimizing in those
calcs rather than the comparatively small potential in
the if-greater tests.

Also: Every time somebody starts asking about "make it faster"
I have a standard response I trot out. Where's your stop-watch?
Where's your performance spec? First measure how much time
various parts of the code take to run. Then look to your spec
to see if that is acceptable. If you've made the spec, relax!
And don't sweat the stuff that takes a small part of the time.
Socks
 
A

anders.johansen

First of all, thanks to all who answered.

I was kinda hoping somebody had a magic bit-twiddeling trick up their
sleeve for this. Never hurts to ask ;)

More info follows and responses to some concrete questions.

Question: In the above psuedo code, the various <calculation>
entries use a lot of time, yes? That is, the various if-greater
tests are a tiny portion of the overall run-time, yes?
So, if the calcs happen every time through this loop, you
probably will find more opportunity for optimizing in those
calcs rather than the comparatively small potential in
the if-greater tests.

Actually the calculations are mainly table look-ups. It looks like
this:

int thismin = 100

<...>

<loop starts>

int cell = (acceptMatrix[false][source[i-1]][target[1]]) ?
distmatrix[i-1][1] : distmatrix[i-1][1]+1 ;
const int above (distmatrix[i-1][2]+1);
if (cell>above) cell=above;
const int left (distmatrix[1]+1);
if (cell>left) cell=left;
const int trans (
distmatrix[i-2][0] + 1 +
(!acceptMatrix[false][source[i-2]][target[1]]) +
(!acceptMatrix[true][source[i-1]][target[0]])
);
if (cell>trans) cell=trans;
distmatrix[2]=cell;
if (cell<thismin) thismin=cell;

<loop ends>

distmatrix is a 2-D array of int.

acceptmatrix is a 3D array of bool.

i and j are unsigned int loop indexes

source and target are const unsigned char * const

An additional question is whether the use of named const temporaries
is detrimental. Should I perhaps instead hand-inline the table
lookups, and let the compiler find the common expressions and
optimize, or...

I don't know how smart the compiler is, but I suspect "not too
bright". It's Borland C++ Builder version 6, so a 6 years old compiler
at best.
Also: Every time somebody starts asking about "make it faster"
I have a standard response I trot out. Where's your stop-watch?
Where's your performance spec? First measure how much time
various parts of the code take to run. Then look to your spec
to see if that is acceptable. If you've made the spec, relax!
And don't sweat the stuff that takes a small part of the time.
Socks

The "stop-watch" is outside this algorithm, but currently this code
fragment dominates execution time. Changing the calculation of "trans"
from an if/then based structure to the current gobbledygook decreased
run time by approximately 15%, probably by eliminating a few branches
and a temporary or two. I compare the output to the original
implementation for a large dataset, so I am sure that behaviour was
not changed in any obvious way.I am also pursuing other ways to
decrease the data set that the algorithm iterates over, but that's an
entirely different discussion.

Of course I could just get some hotshot to hand-assy it for me, but
then I would be locked-in to that particular "magic" implementation.
Further, I doubt that solution would be as easy to hand-unroll as the
C/C++ version is.

Anders
 
M

Michael DOUBEZ

(e-mail address removed) a écrit :
First of all, thanks to all who answered.

I was kinda hoping somebody had a magic bit-twiddeling trick up their
sleeve for this. Never hurts to ask ;)

More info follows and responses to some concrete questions.
[snip]

int thismin = 100

<...>

<loop starts>

int cell = (acceptMatrix[false][source[i-1]][target[1]]) ?
distmatrix[i-1][1] : distmatrix[i-1][1]+1 ;
const int above (distmatrix[i-1][2]+1);
if (cell>above) cell=above;
const int left (distmatrix[1]+1);
if (cell>left) cell=left;
const int trans (
distmatrix[i-2][0] + 1 +
(!acceptMatrix[false][source[i-2]][target[1]]) +
(!acceptMatrix[true][source[i-1]][target[0]])
);


Funny that you add integers and boolean.
What value do you expect from a conversion from bool to int ?
Or is the code incomplete ?

if (cell>trans) cell=trans;
distmatrix[2]=cell;
if (cell<thismin) thismin=cell;

<loop ends>

distmatrix is a 2-D array of int.

acceptmatrix is a 3D array of bool.

i and j are unsigned int loop indexes


I see no j variable in your code.
source and target are const unsigned char * const

An additional question is whether the use of named const temporaries
is detrimental. Should I perhaps instead hand-inline the table
lookups, and let the compiler find the common expressions and
optimize, or...

I don't know how smart the compiler is, but I suspect "not too
bright". It's Borland C++ Builder version 6, so a 6 years old compiler
at best.

In this case I suggest that you get a look at the assembly generated by
the compiler. And depending on the operations that seems consuming, you
might want to tinker a bit with compiler's directives.

Here is how I would rewrite the code you gave in order to minimize the
number of operation but I wouldn't expect to outsmart any modern compiler.

<loop starts>

int cell=distmatrix[i-1][1];
const char source_im1(source[i-1]); //reused later
const char target_1 (target[1]); //reused later
if(!acceptMatrix[false][src_im1][target_1])
{ //increment cell only in given case
++cell;
}

//change here to gain additions - might lose in memory load/unload
int cell2 (distmatrix[i-1][2]);
const int left (distmatrix[1]);
if (cell2>left) cell2=left;

int trans (distmatrix[i-2][0]);
if(!acceptMatrix[false][source[i-2]][target[1]])
{
++trans;
}
if(!acceptMatrix[true][source_im1][target_1])
{
++trans;
}
if (cell2>trans) cell2=trans;
//addition we add not done before
++cell2
if (cell>cell2) cell=cell2;

distmatrix[2]=cell;
if (cell<thismin) thismin=cell;

<loop ends>

Michael
 
A

anders.johansen

Hmmm,

How about this:

int cell = (acceptMatrix[false][source[i-1]][target[1]]) ?
distmatrix[i-1][1] : distmatrix[i-1][1]+1 ;
// Newval replaces above. All add one instrs. moved down
int newval (distmatrix[i-1][2]);
const int left (distmatrix[1]);
const int trans (
distmatrix[i-2][0] +
(!acceptMatrix[false][source[i-2]][target[1]]) +
(!acceptMatrix[true][source[i-1]][target[0]])
);
if (left < newval) newval = left;
if (trans < newval) newval = trans;
// Add one moved here
if (++newval < cell) cell=newval;
if (distmatrix[2]=cell < thismin) thismin=cell;

This saves two "+ 1" instructions, and replaces the remaining one with
a preincrement. The ++newval could (should?) be moved out of the if
statement for clearer code.
 
D

Daniel Pitts

Actually the calculations are mainly table look-ups. It looks like
this:

int thismin = 100

<...>

<loop starts>

int cell = (acceptMatrix[false][source[i-1]][target[1]]) ?
distmatrix[i-1][1] : distmatrix[i-1][1]+1 ;
const int above (distmatrix[i-1][2]+1);
if (cell>above) cell=above;
const int left (distmatrix[1]+1);
if (cell>left) cell=left;
const int trans (
distmatrix[i-2][0] + 1 +
(!acceptMatrix[false][source[i-2]][target[1]]) +
(!acceptMatrix[true][source[i-1]][target[0]])
);
if (cell>trans) cell=trans;
distmatrix[2]=cell;
if (cell<thismin) thismin=cell;

<loop ends>

Array lookups aren't terribly expensive, but they do have a cost. if
you can save a reference to distmatrix[i-1] (since its used in a few
places), that might shave a nanosecond or two. If you have nested
loops, look for calculations or references you can do outside of the
inner most loop (or outside of both loops).

For instance, if "i" is from your outer loop, you could "precalculate"
distmatrix, distmatrix[i-1], dismatrix[i-2], source[i-1],
source[i-2], etc...

Chances are, you'll have more luck finding a better algorithm than
you'll have shaving significant time off by optimizing this algorithm.
distmatrix is a 2-D array of int.

acceptmatrix is a 3D array of bool.

i and j are unsigned int loop indexes

source and target are const unsigned char * const

An additional question is whether the use of named const temporaries
is detrimental. Should I perhaps instead hand-inline the table
lookups, and let the compiler find the common expressions and
optimize, or...
It shouldn't really hurt, they would probably compile to the same thing.
 
M

Martin York

Array lookups aren't terribly expensive, but they do have a cost. if
you can save a reference to distmatrix[i-1] (since its used in a few
places), that might shave a nanosecond or two.

Daniel Pitts' Tech Blog: <http://virtualinfinity.net/wordpress/>

Not sure I buy that. Could be wrong!

Of course we are getting down to machine specific behavior at this
point. But accessing a memory location via an array or reference would
be just as expensive as most memory access instructions (Its been a
while so things may have changed) allow a major pointer and and offset
as part of the same instruction.

Actually the expensive bit is not in the memory retrieval but in
loading the memory from main memory into the local cache. If your code
is getting cache misses the processor stall time for this is many
orders of magnitude greater than any theoretical difference between an
array verses reference read.

As a result I think any changes to code for a perceived code
performance increase must be significant before you consider change to
the code that makes it less readable (read maintainable) as any
performance increases on a particular data size may evaporate when
used on a new data size because of changes in cache traversal/usage.
 
P

ppi

Hi,

I have a tight loop where I basically do this over and over (millions
of times):

int cell = <calculation>
const int above = <calculation>
if (cell > above) cell = above;
const int below = <calculation>
if (cell > below) cell = below;
const int trans = <calculation>
if (cell > trans) cell = trans;

Is there a faster way to do this?

Sincerely,
Anders

you may get better results by helping the compiler filling up the
pipeline:

int a = cell > above ? above : cell;
int b = below > trans ? trans : below;
cell = a < b ? b : a;

ideally 'a' and can 'b' be computed efficiently: they are both
independent and cached in registers.
By doing so you provide the compiler with more information regarding
the computation, namely that his free to compute a or b whenever it
wants, in this order or not.
Keeping on re-assigning the same variable again and again just make
the program instructions code stuck in the pipeline.
If you can write something like the previous code, you should notice a
speed-up.
This is nice feature off powerpc and AMD/Intel pentium processors.

You may be interested by:
- http://www.agner.org/optimize/optimizing_cpp.pdf
- http://developer.amd.com/TechnicalArticles/Articles/pages/7162004127.aspx
- http://developer.amd.com/TechnicalArticles/Articles/pages/6212004126.aspx

Even if the last 2 links are from the amd website, they are still
relevant for general optimization.

Unfortunately this means that you do have a good compiler and that the
code/algorithm you use can actually make this possible.

Hope it will help,
-- paulo
 
J

James Kanze

Array lookups aren't terribly expensive, but they do have a cost. if
you can save a reference to distmatrix[i-1] (since its used in a few
places), that might shave a nanosecond or two.
Daniel Pitts' Tech Blog: <http://virtualinfinity.net/wordpress/>
Not sure I buy that. Could be wrong!

More to the point (and this applies to the last response by ppi
as well, concerning pipeline stall): the compiler knows this
better than you, and turning on optimization will normally
result in the best solution here. Compilers have been caching
array accesses and common subexpressions for several decades
now, and all modern compilers know about pipelines, memory
caches, etc., and take them into account when optimizing. All
such micro-optimizations do in source code is make it more
convoluted---more difficult for the human reader to understand,
and more difficult for the compiler to optimize.
 
A

anders.johansen

Hi,

Well, I can tell you that manual loop unrolling for one special case
gave a massive speedup, and it now runs in about 25% of the time it
used to. I may attempt the approach of helping the compiler bound the
life of temporaries by using curlies, but initial results indicate
that this is detrimental to performance. Next I will probably try to
eliminate named temporaries, on the assumption that the compiler is
able to identify the references easier withour the aliassing
introduced by named temps.

Still, so far I have gotten some nice speedups. They are not related
to the issue I brought up here, but the solutions are in many cases
inspired by the discussion, so you have all helped me in various
ways ;)

Thank you all!

Anders
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,008
Latest member
obedient dusk

Latest Threads

Top