valarray performance

T

Tom

How could adding a local variable to a routine affect the performance
of a valarray in a subsequent call which is also a local.

I have something like the following...
main() {
.....
foo();
}

foo() {
// double a; // if declared then the performance of the valarray
later on is affected, faster, much faster
someFunctionWhichIndirectlyCallsOffendingValarrayCodeManyTimes();
}

OffendingValarrayCode(Widget w) {
valarray<double> params(0., 75);
valarray<double> coeff(0., 75);
for(...) {
// calculate params & coeff filling w/ new values each iter
double r = (params * coeff).sum(); // THIS LINE was the cause
of my grief.
}
}

So I observed several things
1. If I added the local before the call my cumulative run-time to
calculate r is ~14 seconds
2. Without that local declaration the cumulative run-time is about 83
seconds
3. Other combinations of locals could affect as well but I could see
no pattern except time was 83 or 14 seconds
4. Running with a memory bounds/error checker yields no clues.
5. Finally if I correct the calculation of r to the following my
cumulative run-time to calc r went down to 3 seconds
regardless of those locals.
for(size_t k=0, r=0.; k<params.size(); ++k) {
r += params[k] * coeff[k];
}
6. All runs produce identical results which happen to be a matrix 180
x 55 where each cell determined by calculation of r.

So I am not sure how comfortable I should be, is it possible that the
local variable affected something like the cache/paging in some way
which triggered the 83 second run. I have not been able to replicate
this in a simple example so I guess it is possible there is still some
underlying bug but if there were I would think my results would vary.

Thanks,
Tom
 
V

Vaclav Haisman

Tom wrote, On 4.8.2010 15:45:
How could adding a local variable to a routine affect the performance
of a valarray in a subsequent call which is also a local.

[...]

OffendingValarrayCode(Widget w) {
valarray<double> params(0., 75);
valarray<double> coeff(0., 75);
for(...) {
// calculate params & coeff filling w/ new values each iter
double r = (params * coeff).sum(); // THIS LINE was the cause
of my grief.
This creates a temporary each loop and promptly destroys it again. That can
be costly. Try rewriting the code like this:

valarray<double> params(0., 75);
valarray<double> coeff(0., 75);
valarray<double> prod(0., 75);

for(...) {
// calculate params & coeff filling w/ new values each iter
prod = params * coeff;
double r = prod.sum();
}

Or maybe even like this:

for(...) {
// calculate params & coeff filling w/ new values each iter
prod = params;
prod *= coeff;
double r = prod.sum();
}
 
T

Tom

This creates a temporary each loop and promptly destroys it again. That can
be costly. Try  rewriting the code like this:

Yes I realize this but perhaps I was not clear with my solution.
5. Finally if I correct the calculation of r to the following my
cumulative run-time to calc r went down to 3 seconds
regardless of those locals.
for(size_t k=0, r=0.; k<params.size(); ++k) {
r += params[k] * coeff[k];
}
But now I have a very reasonable run time and no explanation of why
some local variable being declared way up the stack
affects the prior code in such a dramatic way.
 
F

Francesco S. Carta

This creates a temporary each loop and promptly destroys it again. That can
be costly. Try rewriting the code like this:

Yes I realize this but perhaps I was not clear with my solution.
5. Finally if I correct the calculation of r to the following my
cumulative run-time to calc r went down to 3 seconds
regardless of those locals.
for(size_t k=0, r=0.; k<params.size(); ++k) {
r += params[k] * coeff[k];
}
But now I have a very reasonable run time and no explanation of why
some local variable being declared way up the stack
affects the prior code in such a dramatic way.

I have no idea about the issue of non-related changes that affect the
performance of that loop - seems something quite weird.

Besides, beware that your "size_t k=0, r=0.;" statement in the for loop
is not changing the value of any preexisting "r" object, but it's
declaring a size_t instance - which, by the way, will not be visible
outside of the loop.

Additionally, to squeeze out the most from that loop, consider storing
the valarray size in a further variable and check that variable against
k. With all optimizations turned on, MinGW 4.4.0 still generates faster
code when the loop doesn't repeatedly call the size() method.
 
T

Tom

I have no idea about the issue of non-related changes that affect the
performance of that loop - seems something quite weird.

Besides, beware that your "size_t k=0, r=0.;" statement in the for loop
is not changing the value of any preexisting "r" object, but it's
declaring a size_t instance - which, by the way, will not be visible
outside of the loop.
Sorry, that should be
double r=0.;
for(size_t k=0; k<params.size(); ++k) {
r += params[k] * coeff[k];
}
Additionally, to squeeze out the most from that loop, consider storing
the valarray size in a further variable and check that variable against
k. With all optimizations turned on, MinGW 4.4.0 still generates faster
code when the loop doesn't repeatedly call the size() method.

I will eliminate the repeating size call, thanks.
 
S

SG

Sorry, that should be
     double r=0.;
     for(size_t k=0; k<params.size(); ++k) {
          r += params[k] * coeff[k];
     }

Right. The problem with valarray is that its nearly useless. Even if
you eliminate all the temporaries and use operators like += and *=
it's still not very efficient. This is because of many tiny loops that
all do very little instead of one big loop that does all the
calculations in one step. Memory access patterns and the cache should
be considered here. See also "expression templates" if you dare. ;-)

Cheers!
SG
 
F

Francesco S. Carta

I have no idea about the issue of non-related changes that affect the
performance of that loop - seems something quite weird.

Besides, beware that your "size_t k=0, r=0.;" statement in the for loop
is not changing the value of any preexisting "r" object, but it's
declaring a size_t instance - which, by the way, will not be visible
outside of the loop.
Sorry, that should be
double r=0.;
for(size_t k=0; k<params.size(); ++k) {
r += params[k] * coeff[k];
}
Additionally, to squeeze out the most from that loop, consider storing
the valarray size in a further variable and check that variable against
k. With all optimizations turned on, MinGW 4.4.0 still generates faster
code when the loop doesn't repeatedly call the size() method.

I will eliminate the repeating size call, thanks.

Digging it further, the most efficient construct with my toolset seems
to be this one:

double r = 0;
size_t count = params.size();
if(count) {
size_t i = 0;
do {
r += params * coeff;
} while(++i < count);
}

Test your code using the above, maybe you'll save some more CPU ticks.

I'm not so sure if the Duff's device could give another boost here... I
think it won't, though I'd have to test it.
 
J

Jonathan Lee

But now I have a very reasonable run time and no explanation of why
some local variable being declared way up the stack
affects the prior code in such a dramatic way.

Can you post a small compilable example that demonstrates this?
I tried the following, but could not create the effect you describe.

--Jonathan

#include <valarray>

void off_val_code() {
std::valarray<double> params(1., 75);
std::valarray<double> coeff(1., 75);
for (int i = 0; i < 1000000; ++i) {
double r = (params * coeff).sum();
if (r > 1.0) r = 1.0 / r;
for (int j = 0; j < 75; ++j) {
params[j] = r;
coeff[j] = sqrt(r);
}
}
}

void foo() {
// double x; // no effect whatsoever
off_val_code();
}

int main() {
foo();
return 0;
}
 
T

Tom Depke

     double r = 0;
     size_t count = params.size();
     if(count) {
         size_t i = 0;
         do {
             r += params * coeff;
         } while(++i < count);
     }

Test your code using the above, maybe you'll save some more CPU ticks.

I just tried this, it went from 3.3 seconds to 2.3, very nice. I think
that the feeling in the
group was that the compiler would take care of these optimizations,
obviously not the case.

Thanks,
Tom
 
F

Francesco S. Carta

double r = 0;
size_t count = params.size();
if(count) {
size_t i = 0;
do {
r += params * coeff;
} while(++i< count);
}

Test your code using the above, maybe you'll save some more CPU ticks.

I just tried this, it went from 3.3 seconds to 2.3, very nice. I think
that the feeling in the
group was that the compiler would take care of these optimizations,
obviously not the case.


Actually, it should be the case - with my compiler it seems to be so - I
realized it again, just now.

Anyway, testing those mods is a practical way to check how much can be
squeezed out and how much the compiler can optimize them for us.

Try turning on all speed optimizations: your original code (including
the repeated call to size()) should perform as well as the mods I
proposed - any minimal difference should be discarded as "performing as
well as", and repeated tests could lead to different "winning" versions
each time, enforcing even more the point above.

If your compiler isn't able to optimize your original code that much,
you can resort to my mod - although in that case the best pick would be
to get a better compiler ;-)
 
T

Tom Depke

If your compiler isn't able to optimize your original code that much,
you can resort to my mod - although in that case the best pick would be
to get a better compiler ;-)
So after playing around with various combination's of optimizations I
got the following:
1. Original code which created the temp valarray from 14 -> 5 seconds
or
83 -> 5 when some combination of variables declared else where but
seemingly innocent
triggers some poor performance in the cache perhaps.
2. My solution which avoids the temp valarray but calls size in for
loop 3.3 -> 1.0 seconds
3. Using Francesco's solution 2.3 -> 1.0

BTW our original compiler VS2008 setting was set to favor speed(O2),
but if I change that to full optimization(Ox)
I get these final results.

Thanks for all the input,
Tom
 
F

Francesco S. Carta

So after playing around with various combination's of optimizations I
got the following:
1. Original code which created the temp valarray from 14 -> 5 seconds
or
83 -> 5 when some combination of variables declared else where but
seemingly innocent
triggers some poor performance in the cache perhaps.
2. My solution which avoids the temp valarray but calls size in for
loop 3.3 -> 1.0 seconds
3. Using Francesco's solution 2.3 -> 1.0

BTW our original compiler VS2008 setting was set to favor speed(O2),
but if I change that to full optimization(Ox)
I get these final results.

Thanks for all the input,
Tom

You're welcome, I'm glad to read that your compiler is good enough at
optimizing the for-loop, because it's way more straightforward and
should be preferred to the do-while-loop.

<OT && self-advisory>
A minor nit: signatures are better snipped off from quotes, but Google
Groups doesn't help you there. You might want to either get a decent
newsreader or snip them out manually. As a last resort, if you happen to
use Firefox, you can install the GreaseMonkey extension and use the
script accessible from the first link in my signature, which snips them
out for you and does some other handy things such as adding your sig and
starting follow-ups as bottom-posts - not that I earn money from it, of
course, it's just that I have been forced to use GG in the past and I
really disliked the experience. That script somehow mitigates the problem.
</OT && self-advisory>

Have fun!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,432
Messages
2,571,680
Members
48,796
Latest member
Greg L.

Latest Threads

Top