strange performance behavior of a mathematical method: why?

Dimitri Ognibene · Apr 23, 2006

I've found a difference of about 10% in the execution of 2 version of
this method:

public double[] evaluate(double input[]){
double a;
//System.arraycopy(input,0,activation[0],0,input.length);
activation[0]=input;
//for (int i=0;i<input.length;i++)
// activation[0]=input;
for (int i=1;i<layers;i++){
double activation_col[]=activation[i-1];
double activation_col_res[]=activation;
double weight_matr[][]=weight[i-1];
for (int j =0; j< activation_col_res.length;j++ ){
double weight_col[]=weight_matr[j];
double acc=0;
for (int k=0; k<activation_col.length;k++){
*************variant go here*************
}
activation_col_res[j]=g(acc);
}
}
setChanged();
notifyObservers();

return activation[layers-1];

}

variant 1:
a= activation_col[k];
a*=weight_col[k];
acc+=a;
variant 2:
acc+= activation_col[k]*weight_col[k];

variant1 is 10% faster than variant2.
I've a matrix of about 1200x400 elements (weight matrix)
variant1 avarage is 4.82ms
variant2 avarage is 5.24ms

does this performance difference makes any sense?

does someone has any tips to write always the faster code?

thanks
Dimitri

Diomidis Spinellis · Apr 23, 2006

Dimitri said:
I've found a difference of about 10% in the execution of 2 version of
this method:

public double[] evaluate(double input[]){
double a;
//System.arraycopy(input,0,activation[0],0,input.length);
activation[0]=input;
//for (int i=0;i<input.length;i++)
// activation[0]=input;
for (int i=1;i<layers;i++){
double activation_col[]=activation[i-1];
double activation_col_res[]=activation;
double weight_matr[][]=weight[i-1];
for (int j =0; j< activation_col_res.length;j++ ){
double weight_col[]=weight_matr[j];
double acc=0;
for (int k=0; k<activation_col.length;k++){
*************variant go here*************
}
activation_col_res[j]=g(acc);
}
}
setChanged();
notifyObservers();

return activation[layers-1];

}

variant 1:
a= activation_col[k];
a*=weight_col[k];
acc+=a;
variant 2:
acc+= activation_col[k]*weight_col[k];

variant1 is 10% faster than variant2.
I've a matrix of about 1200x400 elements (weight matrix)
variant1 avarage is 4.82ms
variant2 avarage is 5.24ms

does this performance difference makes any sense?

If you are curious look at the generated Java bytecodes (use javap).
But the difference isn't important. Another compiler, jit or processor
architecture could give you different results.

does someone has any tips to write always the faster code?

Click to expand...

- Measure before optimizing
- Focus on algorithmic improvements

Dimitri Ognibene · Apr 23, 2006

what do you tjink of this method implementation?
any tip?
must i pass to c++ & asm to get better performance?
note that the g method called inside the loop is final and the compiler
does inline it, or it looks to be so from the profiling data.

Roedy Green · Apr 24, 2006

variant 1:
a= activation_col[k];
a*=weight_col[k];
acc+=a;
variant 2:
acc+= activation_col[k]*weight_col[k];

variant1 is 10% faster than variant2.

that is a puzzle. First have a look at the byte codes generated.
see http://mindprod.com/jgloss/disassembler.html

Roedy Green · Apr 24, 2006

does this performance difference makes any sense?

does someone has any tips to write always the faster code?

You try variants and measure. Use an AOT compiler or java -server
which optimises harder, but takes longer to come up to speed.

see http://mindprod.com/jgloss/aot.html
http://mindprod.com/jgloss/javaexe.html

Diomidis Spinellis · Apr 24, 2006

Dimitri said:
what do you tjink of this method implementation?
any tip?
must i pass to c++ & asm to get better performance?
note that the g method called inside the loop is final and the compiler
does inline it, or it looks to be so from the profiling data.

From what I can see this particular code manipulates floating point
numbers. This is one of the (not common) cases where a good Java
compiler and JIT-based runtime system should be able to give you the
same performance as the one you would achieve with (say) C.

You are unlikely to gain anything from moving to assembly, unless you
can utilize instructions that compilers don't typically support. For
example, I see that your code calculates some type of dot product: if
your data allows it, you could gain by using Intel's SSE SIMD extensions
or AMD's 3DNow. Another possibility would be to move your code to be
executed by a 3D graphic card's hardware.

Dimitri Ognibene · Apr 24, 2006

thanks diobidis,
do you have any link where i can found such an approach?
thanks

Diomidis Spinellis · Apr 24, 2006

Dimitri said:
thanks diobidis,
do you have any link where i can found such an approach?

http://www.ics.forth.gr/eHealth/publications/papers/2005/PCI2005.pdf

Chris Uppal · Apr 24, 2006

Dimitri said:
I've found a difference of about 10% in the execution of 2 version of
this method: [...]
variant 1:
a= activation_col[k];
a*=weight_col[k];
acc+=a;
variant 2:
acc+= activation_col[k]*weight_col[k];

variant1 is 10% faster than variant2.

It's very difficult to estimate exactly what will have marginal effects at this
level of optimisation. The JIT is pretty clever, the CPU does lots of
optimisations of its own, cache effects and alignment effects combine to
confound estimation too. I have no idea what the explanation might be here; it
could be an effect of the (JITed form of the) first expression being able to
make better use of multiple arithmetic units within the CPU, but that's nothing
more than a guess.

You don't mention whether you are using the -client or -server JVM, so I'm
guessing that you aren't aware of how that choice affects the kind of
optimisations that the JIT will do. Rule of thumb: -server is a good deal
better at optimising arithmetic code. FWIW, on the only micro-benchmark I've
run recently (and one micro-benchmark means almost nothing on its own), the
server JVM was doing essentially the same optimisations as the best C++
compiler I have access to (the one in MS VS.2003).

Similarly, you don't mention what CPU you are running on, so I'm guessing that
you are not aware of how details of chip architecture affect how fast a
specific expression of an algorithm can run. Agner Fog has a fascinating (but
long) guide to optimisation for Pentium family processors on this page:
http://www.agner.org/assem/
which also has some links. If nothing else, it should put you off the idea of
re-implementing in assembly ;-)

-- chris

Dimitri Ognibene · Apr 24, 2006

thanks very much Chris,
however i don't want to rewrite it in assembly but build an jni
interface to Math kernel libraries...
but i'll look at the links that you and diomidis suggest, and after
some trial i'll decide
thanks agani
dimitri

Robert Klemme · Apr 24, 2006

Dimitri said:
variant 1:
a= activation_col[k];
a*=weight_col[k];
acc+=a;
variant 2:
acc+= activation_col[k]*weight_col[k];

variant1 is 10% faster than variant2.
I've a matrix of about 1200x400 elements (weight matrix)
variant1 avarage is 4.82ms
variant2 avarage is 5.24ms

does this performance difference makes any sense?

How exactly did you measure performance? Did you give the JVM some warm
up runs before doing the real measurement?

Cheers

robert

Dimitri Ognibene · Apr 24, 2006

yes i do,
i've tried to wait a little and this are the "steady" value.
however i was not using the -server option.. i'll post the result
later
thanks dimtiri

Dimitri Ognibene · Apr 24, 2006

ok you were right with the -server option the two version are about 2
time faster and have the same speed 2.70 ms avarage
thanks

Robert Klemme · Apr 24, 2006

Dimitri said:
yes i do,
i've tried to wait a little and this are the "steady" value.

What exactly do you mean by "wait a little"? In order for the JVM's
optimization to kick in you have to actually execute the code several
times before you start measuring. Also you should execute every variant
several times to get better measurement accuracy.

robert

Dimitri Ognibene · Apr 24, 2006

yes,
i was in steady state...

Rob@Bedford · Apr 24, 2006

When I was working on some code optimization I read somewhere that
copying global variables to local variables does increase performance.
Don't ask me how/why, but its a behind the scenes thing. I will try to
find the article and post link.

Dimitri Ognibene · May 4, 2006

this paper is very interesting, like other resources on gpgpu.org,
but do you, or any one else, know of an interface for java like brook
for c++ to use the gpu as a vector coprocessor?
Thanks
Dimitri

Difference between using "let" in a "for" loop	0	Jul 3, 2022
CIN Input #2 gets skipped, I don't understand why.	1	Feb 9, 2023
Problems with performance	18	Feb 25, 2013
How to accept text and put each letter into a 2d matrix?	0	Jun 3, 2022
Fibonacci C code	1	Aug 4, 2018
Odd benchmark behavior	9	Mar 26, 2012
valarray performance	11	Aug 4, 2010
Improving performance of code	13	Apr 7, 2007

strange performance behavior of a mathematical method: why?

Dimitri Ognibene

Diomidis Spinellis

Dimitri Ognibene

Roedy Green

Roedy Green

Diomidis Spinellis

Dimitri Ognibene

Diomidis Spinellis

Chris Uppal

Dimitri Ognibene

Robert Klemme

Dimitri Ognibene

Dimitri Ognibene

Robert Klemme

Dimitri Ognibene

Rob@Bedford

Dimitri Ognibene

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads