Over optimization

B

Bonj

When performance benchmark testing one of my own functions against the
equivalent runtime function, I found that with /Ox on it optimized away its
own function so that it didn't call it *at all* after the first loop.
<psuedocode>
#ifdef USE_MY_FUNC
#define testfunc(x) myfunc(x)
#else
#define testfunc(x) runtimefunc(x)
#endif
QPC(start);
for(long i = 1; i<BIGNUMBER; i++)
dummyvar = testfunc(argv[1]);
QPC(mid);
for(long i = 1; i<BIGNUMBER; i++)
dummyvar = runtimefunc(argv[1]);
QPC(end);
printf("operation took %I64u\n", mid - start);
printf("control took %I64u\n", end - mid);
</code>

I think what is happening is that with /Ox on, it's completely done away
with the first (BIGNUMBER - 1) loops, knowing that the only thing that
changes is 'dummy', etc....
my dilemma is that:
I don't want to force it to "use" the value on all the loops, say by
printing it to the screen, as this would heavily dilute the time of my
function, i.e. most of the time of each loop wouldn't be spent doing the
function being tested, it would be spent printing.
I don't want to put debug compilation arguments on as this would make it
completely dumb and possibly create the illusion that my own function is
better than a standard implementation when perhaps it isn't when my program
is compiled in release mode.

So, my question is what compiler settings is the best to put on so that the
compiler isn't completely dumb in terms of optimization, but isn't a
absolute complete smartass either. (i.e. as close as possible to release
mode, to give a fair test, but to knock off the one (or more) settings that
allow it to completely eliminate calls)
i.e. make the calls that *I'm* telling it to do, but to do them as fast as
possible. i.e. i'm telling it what I want it to actually *do*, not what I
want the end result to be.

Hope this makes sense

Thanks
 
A

Artie Gold

Bonj said:
When performance benchmark testing one of my own functions against the
equivalent runtime function, I found that with /Ox on it optimized away its
own function so that it didn't call it *at all* after the first loop.

Please be advised that you're *off topic* for both c.l.c and c.l.c++ (as
it has to do with an implementation rather tan either language itself).
Find and read the appropriate FAQs before posting in either group again
(it's the Usenet way).

That said, why not the following:
<psuedocode>
#ifdef USE_MY_FUNC
#define testfunc(x) myfunc(x)
#else
#define testfunc(x) runtimefunc(x)
#endif
QPC(start);
for(long i = 1; i<BIGNUMBER; i++)
dummyvar = testfunc(argv[1]); dummyvar += testfunc(argv[1]);
QPC(mid);
for(long i = 1; i<BIGNUMBER; i++)
dummyvar = runtimefunc(argv[1]);
dummyvar += runtimefunc(argv[1]);

This would ensure that the function will be called each time through the
loop.
QPC(end);
printf("operation took %I64u\n", mid - start);
printf("control took %I64u\n", end - mid);
</code>

I think what is happening is that with /Ox on, it's completely done away
with the first (BIGNUMBER - 1) loops, knowing that the only thing that
changes is 'dummy', etc....
my dilemma is that:
I don't want to force it to "use" the value on all the loops, say by
printing it to the screen, as this would heavily dilute the time of my
function, i.e. most of the time of each loop wouldn't be spent doing the
function being tested, it would be spent printing.
I don't want to put debug compilation arguments on as this would make it
completely dumb and possibly create the illusion that my own function is
better than a standard implementation when perhaps it isn't when my program
is compiled in release mode.

So, my question is what compiler settings is the best to put on so that the
compiler isn't completely dumb in terms of optimization, but isn't a
absolute complete smartass either. (i.e. as close as possible to release
mode, to give a fair test, but to knock off the one (or more) settings that
allow it to completely eliminate calls)
i.e. make the calls that *I'm* telling it to do, but to do them as fast as
possible. i.e. i'm telling it what I want it to actually *do*, not what I
want the end result to be.

HTH,
--ag
 
E

Eric Sosman

Bonj said:
When performance benchmark testing one of my own functions against the
equivalent runtime function, I found that with /Ox on it optimized away its
own function so that it didn't call it *at all* after the first loop.
[...]

First, please see Artie Gold's comments on topicality
and his suggestion on how to reduce the optimization.

Second, another method that often defeats aggressive
optimizers is to compile your function separately from the
timing harness. When the compiler processes the timing
program it does not "see" your function and thus can't
detect that it has no side effects, can't (usually) in-line
it, and so on.

Third, still another technique is to have the timing
program use a function pointer to call the functions being
tested, and to set the pointer's value based on information
not available at compile time. For example,

extern void func1(void);
extern void func2(void);
int main(int argc, char **argv) {
void (*func)(void) = (argc == 1) ? func1 : func2;
/* now run your loop, calling `func' each time */

Finally, it is often a good idea to run the tested
function at least once before you start the timing loop.
That helps to ensure that its code pages and whatever data
pages it references are memory-resident before you begin;
any paging or address-mapping or MMU adjustments or other
miscellany that occur only on the very first call will be
less likely to pollute your timing results.

Obtaining a timing that is accurate and unbiased *and*
useful can be a surprisingly tricky business.
 
B

Bo Persson

Artie Gold said:
Bonj said:
When performance benchmark testing one of my own functions against
the equivalent runtime function, I found that with /Ox on it
optimized away its own function so that it didn't call it *at all*
after the first loop.

Please be advised that you're *off topic* for both c.l.c and c.l.c++
(as it has to do with an implementation rather tan either language
itself). Find and read the appropriate FAQs before posting in either
group again (it's the Usenet way).

That said, why not the following:
<psuedocode>
#ifdef USE_MY_FUNC
#define testfunc(x) myfunc(x)
#else
#define testfunc(x) runtimefunc(x)
#endif
QPC(start);
for(long i = 1; i<BIGNUMBER; i++)
dummyvar = testfunc(argv[1]); dummyvar += testfunc(argv[1]);
QPC(mid);
for(long i = 1; i<BIGNUMBER; i++)
dummyvar = runtimefunc(argv[1]);
dummyvar += runtimefunc(argv[1]);

This would ensure that the function will be called each time through
the loop.

No, you can't be sure.

If the compiler is smart enough to realize that runtimefunc always
returns the same result, it might also be smart enough to multiply that
result by BIGNUMBER.


Bo Persson
 
B

Bo Persson

Eric Sosman said:
Bonj said:
When performance benchmark testing one of my own functions against
the
equivalent runtime function, I found that with /Ox on it optimized
away its
own function so that it didn't call it *at all* after the first loop.
[...]

First, please see Artie Gold's comments on topicality
and his suggestion on how to reduce the optimization.

Second, another method that often defeats aggressive
optimizers is to compile your function separately from the
timing harness. When the compiler processes the timing
program it does not "see" your function and thus can't
detect that it has no side effects, can't (usually) in-line
it, and so on.

That doesn't work well for compilers with global or link-time
optimizers. MSVC7.x comes to mind...
Third, still another technique is to have the timing
program use a function pointer to call the functions being
tested, and to set the pointer's value based on information
not available at compile time. For example,

extern void func1(void);
extern void func2(void);
int main(int argc, char **argv) {
void (*func)(void) = (argc == 1) ? func1 : func2;
/* now run your loop, calling `func' each time */

But now you are testing how fast the functions are running when they are
not "properly" optimized. That greatly reduces the value of the test.

Finally, it is often a good idea to run the tested
function at least once before you start the timing loop.
That helps to ensure that its code pages and whatever data
pages it references are memory-resident before you begin;
any paging or address-mapping or MMU adjustments or other
miscellany that occur only on the very first call will be
less likely to pollute your timing results.

This also indicates that the functions might behave differently when run
in a real program. That is where you have to test them eventually.

Obtaining a timing that is accurate and unbiased *and*
useful can be a surprisingly tricky business.

Indeed!


Bo Persson
 
B

Bonj

Exactly. Only printing or writing to a file will do, and they're both too
slow, thus will outweigh the algorithm 99/1 style, thus dilute the figures.

Bo Persson said:
Artie Gold said:
Bonj said:
When performance benchmark testing one of my own functions against the
equivalent runtime function, I found that with /Ox on it optimized away
its own function so that it didn't call it *at all* after the first
loop.

Please be advised that you're *off topic* for both c.l.c and c.l.c++ (as
it has to do with an implementation rather tan either language itself).
Find and read the appropriate FAQs before posting in either group again
(it's the Usenet way).

That said, why not the following:
<psuedocode>
#ifdef USE_MY_FUNC
#define testfunc(x) myfunc(x)
#else
#define testfunc(x) runtimefunc(x)
#endif
QPC(start);
for(long i = 1; i<BIGNUMBER; i++)
dummyvar = testfunc(argv[1]); dummyvar += testfunc(argv[1]);
QPC(mid);
for(long i = 1; i<BIGNUMBER; i++)
dummyvar = runtimefunc(argv[1]);
dummyvar += runtimefunc(argv[1]);

This would ensure that the function will be called each time through the
loop.

No, you can't be sure.

If the compiler is smart enough to realize that runtimefunc always returns
the same result, it might also be smart enough to multiply that result by
BIGNUMBER.


Bo Persson
 
B

Bonj

Finally, it is often a good idea to run the tested
function at least once before you start the timing loop.
That helps to ensure that its code pages and whatever data
pages it references are memory-resident before you begin;
any paging or address-mapping or MMU adjustments or other
miscellany that occur only on the very first call will be
less likely to pollute your timing results.

That's the purpose of the second operation - the loop from QPC(mid) to
QPC(end) is the control, that should remain approximately the same.
 
D

Dik T. Winter

That's the purpose of the second operation - the loop from QPC(mid) to
QPC(end) is the control, that should remain approximately the same.

What I do not understand is why you optimise the loop itself. In
general the calling sequence can not be improved very much. You
should just compile your function fully optimised, and the calling
loops unoptimised.
 
?

=?ISO-8859-1?Q?Peter_Nystr=F6m?=

Bonj said:
When performance benchmark testing one of my own functions against the
equivalent runtime function, I found that with /Ox on it optimized away its
own function so that it didn't call it *at all* after the first loop.
<psuedocode>
#ifdef USE_MY_FUNC
#define testfunc(x) myfunc(x)
#else
#define testfunc(x) runtimefunc(x)
#endif
QPC(start);
for(long i = 1; i<BIGNUMBER; i++)
dummyvar = testfunc(argv[1]);
QPC(mid);
for(long i = 1; i<BIGNUMBER; i++)
dummyvar = runtimefunc(argv[1]);
QPC(end);
printf("operation took %I64u\n", mid - start);
printf("control took %I64u\n", end - mid);
</code>

I think what is happening is that with /Ox on, it's completely done away
with the first (BIGNUMBER - 1) loops, knowing that the only thing that
changes is 'dummy', etc....
my dilemma is that:
I don't want to force it to "use" the value on all the loops, say by
printing it to the screen, as this would heavily dilute the time of my
function, i.e. most of the time of each loop wouldn't be spent doing the
function being tested, it would be spent printing.
I don't want to put debug compilation arguments on as this would make it
completely dumb and possibly create the illusion that my own function is
better than a standard implementation when perhaps it isn't when my program
is compiled in release mode.

So, my question is what compiler settings is the best to put on so that the
compiler isn't completely dumb in terms of optimization, but isn't a
absolute complete smartass either. (i.e. as close as possible to release
mode, to give a fair test, but to knock off the one (or more) settings that
allow it to completely eliminate calls)
i.e. make the calls that *I'm* telling it to do, but to do them as fast as
possible. i.e. i'm telling it what I want it to actually *do*, not what I
want the end result to be.

Hope this makes sense

Thanks

You can always make 'dummyvar' volatile. Then the compiler cannot remove
any access to it.

//Peter
 
B

Bonj

Well I do optimize the function itself.
But I don't want to un-optimize the loop entirely because that would be
unfair on the runtime function, which is called directly from within the
loop.
Although I suppose I could put it in a separate compilation unit and just
expose it through delegatio - the only reason for me not wanting to do that,
is that it would be slightly different from the way it would be compiled in
real life, but probably not much different speedwise.
 
W

websnarf

Try copying argv[1] to a volatile char * variable and use that instead
of argv[1] directly. Then the compiler will not be able to optimize
out the call (but it will also force it to perform a redundant reload
of the address -- so depending on the runtime function you are testing,
this can artificially skew your results.
 
D

Dik T. Winter

Well I do optimize the function itself.
But I don't want to un-optimize the loop entirely because that would be
unfair on the runtime function, which is called directly from within the
loop.
Although I suppose I could put it in a separate compilation unit and just
expose it through delegatio - the only reason for me not wanting to do that,
is that it would be slightly different from the way it would be compiled in
real life, but probably not much different speedwise.

On most systems it will not matter whether the function is given in another
unit or in the same unit. Optimising timing loops nearly always leads to
misleading results, especially with the current aggressive optimisers, that
is why those should be compiled as written. What you could do to eliminate
the time for the loop itself (including calls to the function) is add a
loop with a call to an empty function. You can subtract that from the
timings for your real functions, and then you will have a fair comparison.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top