Help in optimizing branches

J

John Malek

Hi,

I ran a profiler against this complex app that I'm trying to opimize.
This is an application I'm doing to test image processing. Even
though it does a lot of computation, two simple lines take 30% of the
running times! Both these lines are from Intel's OpenCV library.
Note that mhi, mask8u, and mask are arrays with one entry per pixel in
a 640x480 image.

If anyone has any hints on how to optimize this, it would be greatly
appreciated.

Thanks,
John


const int cts = (int&)ts;
for( y = 0; y < mhi->rows; y++ )
{
int* mhi_row = (int*)(mhi->data.ptr + y*mhi->step);
uchar* mask8u_row = mask8u->data.ptr + (y+1)*mask8u->step + 1;

for( x = 0; x said:
if( mhi_row[x] == cts && mask8u_row[x] == 0 )
THE LINE ABOVE TAKES 20% of the time


uchar* mask_row = mask->data.ptr + mask->step;
for( i = 1; i said:
mask_row[0] = mask_row[size.width+1] = (uchar)1;
THE LINE ABOVE TAKES 10% of the time
 
G

Gianni Mariani

John said:
Hi,

I ran a profiler against this complex app that I'm trying to opimize.
This is an application I'm doing to test image processing. Even
though it does a lot of computation, two simple lines take 30% of the
running times! Both these lines are from Intel's OpenCV library.
Note that mhi, mask8u, and mask are arrays with one entry per pixel in
a 640x480 image.

If anyone has any hints on how to optimize this, it would be greatly
appreciated.

This question is off-topic for comp.std.c++ - try comp.programming.

However, start by posting the whole routine or send us ap pointer to the
library.

As for some possible answers:-

You could try folding some constants out of the loop but it's possible
that the optimizer has already done this. The other thing is loop
unrolling. Finally you need to look at cache.
const int cts = (int&)ts;
for( y = 0; y < mhi->rows; y++ )
{
int* mhi_row = (int*)(mhi->data.ptr + y*mhi->step);
uchar* mask8u_row = mask8u->data.ptr + (y+1)*mask8u->step + 1;
if( mhi_row[x] == cts && mask8u_row[x] == 0 )

THE LINE ABOVE TAKES 20% of the time


uchar* mask_row = mask->data.ptr + mask->step;

add:
uchar* mask_row_plus_width_plus_1 = mask_row + size.width+1;
int step = mask->step;
for( i = 1; i <= size.height; i++, mask_row += mask->step )
// comparison with 0 is faster - i is only used to limit the loop
// count so you can reverse the loop.
replace:

for(
i = size.height;
i >= 0;
i--,
mask_row += step,
mask_row_plus_width_plus_1 += step
)
// you could theoretically also unroll this loop easily
{
mask_row[0] = mask_row[size.width+1] = (uchar)1;

replace:
mask_row_plus_width_plus_1[0] = (uchar)1;
mask_row[0] = (uchar)1;
THE LINE ABOVE TAKES 10% of the time

Both of these seem like cache thrashers depending on the value of
"step". Basically the cache optimizations are changing the algorithm to
limit the memory footprint to "blocks" at a time. Hence the name
"blocking". This may require a change in the data structure. That's
why many image algorithms work with "tiles".
 
N

Nick Savoiu

John Malek said:
Hi,

I ran a profiler against this complex app that I'm trying to opimize.
This is an application I'm doing to test image processing. Even
though it does a lot of computation, two simple lines take 30% of the
running times! Both these lines are from Intel's OpenCV library.
Note that mhi, mask8u, and mask are arrays with one entry per pixel in
a 640x480 image.

If anyone has any hints on how to optimize this, it would be greatly
appreciated.

John,

I don't think you can expect much improvement.

Those lines are in nested loops and will get executed many times. You can
try to do some constant propagation, loop invariant code motions by hand but
probably the compiler can do that too if the code's not too complicated.

Nick
 
D

Dan McLeran

If anyone has any hints on how to optimize this, it would be greatly
appreciated.

Thanks,
John


const int cts = (int&)ts;
for( y = 0; y < mhi->rows; y++ )
{
int* mhi_row = (int*)(mhi->data.ptr + y*mhi->step);
uchar* mask8u_row = mask8u->data.ptr + (y+1)*mask8u->step + 1;

for( x = 0; x said:
if( mhi_row[x] == cts && mask8u_row[x] == 0 )
THE LINE ABOVE TAKES 20% of the time


uchar* mask_row = mask->data.ptr + mask->step;
for( i = 1; i said:
mask_row[0] = mask_row[size.width+1] = (uchar)1;
THE LINE ABOVE TAKES 10% of the time

I'm wondering if moving the pointer dereference out of the loop logic
would help. I assume that mhi->cols is not changing within the loop?
Try moving this out of the for loop. The compiler may already be doing
this for you. A quick look at the asm output would tell whether this
could buy you some time.
 
R

Ron Natalie

John Malek said:
int* mhi_row = (int*)(mhi->data.ptr + y*mhi->step);

Just what is mhi->data.ptr? Hopefully, it's something that converts
to int* nicely.

if( mhi_row[x] == cts && mask8u_row[x] == 0 )

Depending on the platform, using pointers might be faster:
if (*mhi_row++ == cts && *mask8u_row++ == 0)
you may need to put the increments elsewhere depending on what the rest
of the code does with these values.
uchar* mask_row = mask->data.ptr + mask->step;
for( i = 1; i said:
mask_row[0] = mask_row[size.width+1] = (uchar)1;

You might precompute size.width+1 into a local variable rather than preforming
the addition each time, this looks invariant to the loop. There's no reason to cast
the literal 1.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top