How to make it faster?

P

Peter Nilsson

Keith said:
Check your compiler's documentation. It's likely that "-O2"
is not the highest optimization level available. (gcc supports
at least "-O3", plus a number of other options for specific
optimizations that aren't necessarily included in "-ON" for
whatever value of N.)

The way I see it, implementations don't have optimisation
levels per se. They simply allow more techniques to be
applied in the hopes of achieving faster code. Optimisation
remains inherently trial and error. So called 'higher' settings
can actually produce _less_ efficient code than 'lower'
settings in some cases.
 
P

Paul Hsieh

I tried using "#define ITER 64". Overflow occurred in this setting.
I got "1.#J". I don't know what "1.#J" stands for?

Its a kind of NaN which is in fact an overflow condition of some
kind. In that case, you should try 4 or 8. But the key is to use a
power of 2. This will allow your compiler to use an "as-if" rule to
reduce the cost of executing the "%" operator. Otherwise you can use
a target counter:

if (i >= tgtI) {
tgtI += ITER;
prodn /= prodd;
prodd = 1.0;
}

That would probably work just as well.
[...] When I changed
the type of sump and sumd to long double, the problem disappeared.
The program saved 1 sec because of larger ITER.

I also tried another way:

#define ITER 64

register double sum = 0.0;

if (i % ITER == 0) {
sum += log(prodn/prodd); /* introducing log, but not so frequent */
prodn = 1.0;
prodd = 1.0;

}

In the way, the running time remained the same: 11 secs before setting
-O2, and 7 secs after setting -O2. However, the return value was a
little different.

Well I would suggest trying other values of ITER first.
Unfortunately, long double sounds like its use a larger likely more
expensive data type which may likely have impact on the performance of
rest of your code. So I would recommend against doing that.
 
C

christian.bau

And what was all that stuff about a = 0, 0.25, 0.5, 0.75 all about?


In the following function foo, the values of arguments g0, g1 and g2
are limited to 0, 0.25, 0.5, or 0.75.  But I guess this property has
no use because exp1 and exp2 change in every call.  However, exp0 dose
remain the same for all calls.


How big is sz? There are only 64 possible combinations of the values
of g0 , g1 , g2 . If sz > 64, you can pre-calculate quite a
few things for each of the 64 possible combinations, and instead of
three floating-point values for each i you could pass in a single
unsigned char which specifies which combination is used.
 
C

christian.bau

Glad to hear it.  Transcendental calls are the worst in terms of
performance.  After that comes divides and modulos.  After that is
branch mis-predictions.  If you can optimize  all that, you will at
least be within 2x-4x of optimal (barring algorithm and IO
improvements) pretty much all time.

Usually the total cost is computational cost plus cost of data access.
Since the computational cost has by now been reduced by a factor ten,
it would now be a good idea to look at things like cache usage, but
that would require a bigger picture.

It was mentioned that g is always one of 0.0, 0.25, 0.5 and 0.75.
A trivial change (changing these arrays from double to float) would
save 25% of memory bandwidth. A simple experiment would be to measure
the time of one program run, then do the same number of calls to foo,
but always with the same set of data. If that is a lot faster, then
cache improvements can help.
 
U

user923005

Glad to hear it.  Transcendental calls are the worst in terms of
performance.  After that comes divides and modulos.  After that is
branch mis-predictions.  If you can optimize  all that, you will at
least be within 2x-4x of optimal (barring algorithm and IO
improvements) pretty much all time.

Usually the total cost is computational cost plus cost of data access.
Since the computational cost has by now been reduced by a factor ten,
it would now be a good idea to look at things like cache usage, but
that would require a bigger picture.

It was mentioned that g is always one of 0.0, 0.25, 0.5 and 0.75.
A trivial change (changing these arrays from double to float) would
save 25% of memory bandwidth. A simple experiment would be to measure
the time of one program run, then do the same number of calls to foo,
but always with the same set of data. If that is a lot faster, then
cache improvements can help.


If the problem were scaled so that g contained characters of
0,1,2,3 {which represent multiples of 0.25} it would reduce the
requirement for 4 byte float to 1 byte for char (assuming 32 bit float
and 8 bit char).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top