strange performance behavior of a mathematical method: why?

D

Dimitri Ognibene

I've found a difference of about 10% in the execution of 2 version of
this method:

public double[] evaluate(double input[]){
double a;
//System.arraycopy(input,0,activation[0],0,input.length);
activation[0]=input;
//for (int i=0;i<input.length;i++)
// activation[0]=input;
for (int i=1;i<layers;i++){
double activation_col[]=activation[i-1];
double activation_col_res[]=activation;
double weight_matr[][]=weight[i-1];
for (int j =0; j< activation_col_res.length;j++ ){
double weight_col[]=weight_matr[j];
double acc=0;
for (int k=0; k<activation_col.length;k++){
*************variant go here*************
}
activation_col_res[j]=g(acc);
}
}
setChanged();
notifyObservers();

return activation[layers-1];


}

variant 1:
a= activation_col[k];
a*=weight_col[k];
acc+=a;
variant 2:
acc+= activation_col[k]*weight_col[k];

variant1 is 10% faster than variant2.
I've a matrix of about 1200x400 elements (weight matrix)
variant1 avarage is 4.82ms
variant2 avarage is 5.24ms

does this performance difference makes any sense?

does someone has any tips to write always the faster code?

thanks
Dimitri
 
D

Diomidis Spinellis

Dimitri said:
I've found a difference of about 10% in the execution of 2 version of
this method:

public double[] evaluate(double input[]){
double a;
//System.arraycopy(input,0,activation[0],0,input.length);
activation[0]=input;
//for (int i=0;i<input.length;i++)
// activation[0]=input;
for (int i=1;i<layers;i++){
double activation_col[]=activation[i-1];
double activation_col_res[]=activation;
double weight_matr[][]=weight[i-1];
for (int j =0; j< activation_col_res.length;j++ ){
double weight_col[]=weight_matr[j];
double acc=0;
for (int k=0; k<activation_col.length;k++){
*************variant go here*************
}
activation_col_res[j]=g(acc);
}
}
setChanged();
notifyObservers();

return activation[layers-1];


}

variant 1:
a= activation_col[k];
a*=weight_col[k];
acc+=a;
variant 2:
acc+= activation_col[k]*weight_col[k];

variant1 is 10% faster than variant2.
I've a matrix of about 1200x400 elements (weight matrix)
variant1 avarage is 4.82ms
variant2 avarage is 5.24ms

does this performance difference makes any sense?


If you are curious look at the generated Java bytecodes (use javap).
But the difference isn't important. Another compiler, jit or processor
architecture could give you different results.
does someone has any tips to write always the faster code?

- Measure before optimizing
- Focus on algorithmic improvements
 
D

Dimitri Ognibene

what do you tjink of this method implementation?
any tip?
must i pass to c++ & asm to get better performance?
note that the g method called inside the loop is final and the compiler
does inline it, or it looks to be so from the profiling data.
 
D

Diomidis Spinellis

Dimitri said:
what do you tjink of this method implementation?
any tip?
must i pass to c++ & asm to get better performance?
note that the g method called inside the loop is final and the compiler
does inline it, or it looks to be so from the profiling data.

From what I can see this particular code manipulates floating point
numbers. This is one of the (not common) cases where a good Java
compiler and JIT-based runtime system should be able to give you the
same performance as the one you would achieve with (say) C.

You are unlikely to gain anything from moving to assembly, unless you
can utilize instructions that compilers don't typically support. For
example, I see that your code calculates some type of dot product: if
your data allows it, you could gain by using Intel's SSE SIMD extensions
or AMD's 3DNow. Another possibility would be to move your code to be
executed by a 3D graphic card's hardware.
 
D

Dimitri Ognibene

thanks diobidis,
do you have any link where i can found such an approach?
thanks
 
C

Chris Uppal

Dimitri said:
I've found a difference of about 10% in the execution of 2 version of
this method: [...]
variant 1:
a= activation_col[k];
a*=weight_col[k];
acc+=a;
variant 2:
acc+= activation_col[k]*weight_col[k];

variant1 is 10% faster than variant2.

It's very difficult to estimate exactly what will have marginal effects at this
level of optimisation. The JIT is pretty clever, the CPU does lots of
optimisations of its own, cache effects and alignment effects combine to
confound estimation too. I have no idea what the explanation might be here; it
could be an effect of the (JITed form of the) first expression being able to
make better use of multiple arithmetic units within the CPU, but that's nothing
more than a guess.

You don't mention whether you are using the -client or -server JVM, so I'm
guessing that you aren't aware of how that choice affects the kind of
optimisations that the JIT will do. Rule of thumb: -server is a good deal
better at optimising arithmetic code. FWIW, on the only micro-benchmark I've
run recently (and one micro-benchmark means almost nothing on its own), the
server JVM was doing essentially the same optimisations as the best C++
compiler I have access to (the one in MS VS.2003).

Similarly, you don't mention what CPU you are running on, so I'm guessing that
you are not aware of how details of chip architecture affect how fast a
specific expression of an algorithm can run. Agner Fog has a fascinating (but
long) guide to optimisation for Pentium family processors on this page:
http://www.agner.org/assem/
which also has some links. If nothing else, it should put you off the idea of
re-implementing in assembly ;-)

-- chris
 
D

Dimitri Ognibene

thanks very much Chris,
however i don't want to rewrite it in assembly but build an jni
interface to Math kernel libraries...
but i'll look at the links that you and diomidis suggest, and after
some trial i'll decide
thanks agani
dimitri
 
R

Robert Klemme

Dimitri said:
variant 1:
a= activation_col[k];
a*=weight_col[k];
acc+=a;
variant 2:
acc+= activation_col[k]*weight_col[k];

variant1 is 10% faster than variant2.
I've a matrix of about 1200x400 elements (weight matrix)
variant1 avarage is 4.82ms
variant2 avarage is 5.24ms

does this performance difference makes any sense?

How exactly did you measure performance? Did you give the JVM some warm
up runs before doing the real measurement?

Cheers

robert
 
D

Dimitri Ognibene

yes i do,
i've tried to wait a little and this are the "steady" value.
however i was not using the -server option.. i'll post the result
later
thanks dimtiri
 
D

Dimitri Ognibene

ok you were right with the -server option the two version are about 2
time faster and have the same speed 2.70 ms avarage
thanks
 
R

Robert Klemme

Dimitri said:
yes i do,
i've tried to wait a little and this are the "steady" value.

What exactly do you mean by "wait a little"? In order for the JVM's
optimization to kick in you have to actually execute the code several
times before you start measuring. Also you should execute every variant
several times to get better measurement accuracy.

robert
 
R

Rob@Bedford

When I was working on some code optimization I read somewhere that
copying global variables to local variables does increase performance.
Don't ask me how/why, but its a behind the scenes thing. I will try to
find the article and post link.
 
D

Dimitri Ognibene

this paper is very interesting, like other resources on gpgpu.org,
but do you, or any one else, know of an interface for java like brook
for c++ to use the gpu as a vector coprocessor?
Thanks
Dimitri
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,022
Latest member
MaybelleMa

Latest Threads

Top