java loop optimization

S

star

Hi,
I have problem with optimizing loop in java.
My code looks like this
for (k = m; --k >=0;){
for (j = m; --j >= 0;)
{
Array[j] += Math.exp(-0.5 * (Math.pow (((b[k]-b[j])/d),2))) / c;

}
}
The problem is that m depends from some input data in the application
and it can be very big, for example 150000. In that case the loops take
2 hours, which is unacceptable for the situation.
I also can't use native codes.
If any one have any suggestions please write.
Thanks in advance
 
I

Ingo R. Homann

Hi,

Perhaps, the following will speed up things a bit:
Hi,
I have problem with optimizing loop in java.
My code looks like this
for (k = m; --k >=0;){
for (j = m; --j >= 0;)
{

// Array[j] += Math.exp(-0.5 * (Math.pow (((b[k]-b[j])/d),2))) / c;
double x=(b[k]-b[j])/d; // or int or ...?
Array[j] += Math.exp(-0.5 * x*x) / c;

Another possibility might be to use a lookuptable instead of Math.exp.
Or even better, for the whole term x -> Math.exp(-0.5*x*x)/c. Of course,
it also depends on the precision you need.

Oh, and you can also precalculate d outside the loop:

double f=-0.5/d/d;

double x=b[k]-b[j];
Array[j] += Math.exp(f*x*x)/c;

In that case the loops take
2 hours, which is unacceptable for the situation.

What would be acceptable? 1 hour? 5 Minutes?

Ciao,
Ingo
 
R

Robert Klemme

Ingo said:
Hi,

Perhaps, the following will speed up things a bit:
Hi,
I have problem with optimizing loop in java.
My code looks like this
for (k = m; --k >=0;){
for (j = m; --j >= 0;)
{

// Array[j] += Math.exp(-0.5 * (Math.pow (((b[k]-b[j])/d),2))) /
c; double x=(b[k]-b[j])/d; // or int or ...?
Array[j] += Math.exp(-0.5 * x*x) / c;

Another possibility might be to use a lookuptable instead of Math.exp.
Or even better, for the whole term x -> Math.exp(-0.5*x*x)/c. Of
course, it also depends on the precision you need.

Oh, and you can also precalculate d outside the loop:

double f=-0.5/d/d;

double x=b[k]-b[j];
Array[j] += Math.exp(f*x*x)/c;

And you can extract b[k] from the inner loop into the outer loop (if the
JVM doesn't do this already) to save a lot of array bounds checking.

Did you profile this?

robert
 
M

megagurka

Ingo R. Homann skrev:
Oh, and you can also precalculate d outside the loop:

double f=-0.5/d/d;

double x=b[k]-b[j];
Array[j] += Math.exp(f*x*x)/c;

Try using floats if the precision is sufficient. Also, re-arranging the
loops helps avoid the division and some array accesses. Here's the
code:

float y = -0.5f / d / d;

for (int j = 0; j < m; j++) {
float v = 0;

for (int k = 0; k < m; k++) {
float x = b[k] - b[j];
v += Math.exp(y * x * x);
}

Array[j] += v / c;
}

The Math.exp() method will consume about 90% of the execution time, so
it need to be replaced if you want to optimize the code further.

/JN
 
I

Ingo R. Homann

Hi,

Robert said:
And you can extract b[k] from the inner loop into the outer loop (if the
JVM doesn't do this already) to save a lot of array bounds checking.

In such easy cases, the JIT should remove the bounds-checks, AFAIK.
Did you profile this?

No, of course, this is left to the OP. He asked for "any suggestions",
not for a complete solution. ;-)

Ciao,
Ingo
 
P

Patricia Shanahan

star said:
Hi,
I have problem with optimizing loop in java.
My code looks like this
for (k = m; --k >=0;){
for (j = m; --j >= 0;)
{
Array[j] += Math.exp(-0.5 * (Math.pow (((b[k]-b[j])/d),2))) / c;

}
}
The problem is that m depends from some input data in the application
and it can be very big, for example 150000. In that case the loops take
2 hours, which is unacceptable for the situation.
I also can't use native codes.
If any one have any suggestions please write.
Thanks in advance

The first question is whether the bottleneck is the arithmetic, or the
memory accesses. Try substituting "Array[j] = b[k]-b[j];". If that is
significantly faster, apply some of the already posted ideas for
improving the arithmetic.

If not, you have a memory access problem.

I'd work with counting up, rather than counting down, and separate out
the increment instead of combining it with the test, just to make things
easier for both me and the compiler. It may even be faster, because it
is more idiomatic, and hardware prefetchers often work better for
forwards scanning of memory, but it will probably be just about the same
performance.

The simplest change is to just exchange the two loops, so that you get
maximum reuse out of each element of Array once you get it into cache.
However, you may need to do loop blocking, so that you work for a while
with a couple of subranges of b elements, small enough to fit in caches.

Patricia
 
S

Skip

Ingo R. Homann said:
Hi,

Robert said:
And you can extract b[k] from the inner loop into the outer loop (if the
JVM doesn't do this already) to save a lot of array bounds checking.

In such easy cases, the JIT should remove the bounds-checks, AFAIK.

The client VM (up to 1.5) *cannot* do bound-checks-removal. Only the server
VM is able to do that.
 
C

Chris Uppal

star said:
for (k = m; --k >=0;){
for (j = m; --j >= 0;)
{
Array[j] += Math.exp(-0.5 * (Math.pow (((b[k]-b[j])/d),2))) / c;

}
}

Adding to everyone else's suggestions, you can also move the / c (and an array
access) out of the inner loop, resulting in:

double f = -0.5 / d / d;
for (int j = m; --j >= 0;)
{
double bj = b[j];
double sum = 0.0;
for (int k = m; --k >=0;)
{
double incr = b[k]-bj;
sum += Math.exp(f * incr * incr);
}
output[j] = sum / c;
}

I didn't include Patricia's suggestion of looping forwards since I found it ran
marginally slower on my Pentium-based machine. Interestingly, I found that for
this problem (with code added to allow the JITer to warm up before running the
timing tests) the -client JMV was consistently a little faster than the -server
version.

The above code runs somewhat faster -- in my test with m=10000 it ran in around
22 seconds rather than 35. Extrapolating that up to m=150K, suggests an
improvement from 2.2 hours to 1.4 hours -- useful but not sparkling...

The remaining time is completely dominated (~ 96%) by the cost of the m**2
calls to Math.exp() so there's no scope at all for more optimisation that
doesn't involve either cutting down the number of exp() calls, or replacing
them with something faster.

-- chris
 
S

Skip

Chris Uppal said:
star said:
for (k = m; --k >=0;){
for (j = m; --j >= 0;)
{
Array[j] += Math.exp(-0.5 * (Math.pow (((b[k]-b[j])/d),2))) / c;

}
}

Adding to everyone else's suggestions, you can also move the / c (and an array
access) out of the inner loop, resulting in:

double f = -0.5 / d / d;
for (int j = m; --j >= 0;)
{
double bj = b[j];
double sum = 0.0;
for (int k = m; --k >=0;)
{
double incr = b[k]-bj;
sum += Math.exp(f * incr * incr);
}
output[j] = sum / c;
}

make that:

output[j] += sum / c;

The original piece of code does not assume that the array is filled with
zeroes.
 
R

Robert Klemme

Ingo said:
Hi,

Robert said:
And you can extract b[k] from the inner loop into the outer loop (if
the JVM doesn't do this already) to save a lot of array bounds
checking.

In such easy cases, the JIT should remove the bounds-checks, AFAIK.

It would not hurt to do the extraction anyway. It also has a bit of
documentation advantage, especially when using "final".
No, of course, this is left to the OP. He asked for "any suggestions",
not for a complete solution. ;-)

I'm sorry, I didn't expect *you* to do the profiling. :)

robert
 
T

Thomas Hawtin

Chris said:
The remaining time is completely dominated (~ 96%) by the cost of the m**2
calls to Math.exp() so there's no scope at all for more optimisation that
doesn't involve either cutting down the number of exp() calls, or replacing
them with something faster.

I wouldn't be surprised if moving to non-x86 improved performance by an
order of magnitude.

The code may become significantly more accurate if the exp is replaced
by expm1 and then m added to the sum *after* the loop. It's better to
add a bundle of close numbers that are close to zero, than a similar set
with an offset.

Tom Hawtin
 
R

Roedy Green

Try using floats if the precision is sufficient.

I was looking at the principles of operation for the Pentium.
Storing-popping a double or float takes 3 cycles but storing a 80 bit
intermediate is 7 cycles. So you pay no speed penalty for using
floats, and you get the advantage of compactness, which may mean a
better cpu cache, SRAM cache and swapping cache behaviour.
 
R

Roedy Green

The remaining time is completely dominated (~ 96%) by the cost of the m**2
calls to Math.exp() so there's no scope at all for more optimisation that
doesn't involve either cutting down the number of exp() calls, or replacing
them with something faster.

Are there any symmetries to be exploited? You compute just one part
of the solution and propagate it to the other cells.
 
P

Patricia Shanahan

Roedy said:
Are there any symmetries to be exploited? You compute just one part
of the solution and propagate it to the other cells.

Good point. It should be possible to get rid of half the Math.exp calls.

The squares of (b[j]-b[k]) and (b[k]-b[j]) are equal. In terms of Chris'
sample code, once Math.exp(f * incr * incr) has been calculated, it
should be added to both Array[j] and Array[k].

If that brings in too much cache missing, it may be time for loop blocking.

Patricia
 
R

Roedy Green

The remaining time is completely dominated (~ 96%) by the cost of the m**2
calls to Math.exp() so there's no scope at all for more optimisation that
doesn't involve either cutting down the number of exp() calls, or replacing
them with something faster.

If you want to try something drastic, here is what you can do.

You need a new fast exp() function. It only has to work over a narrow
domain compared to Sun's exp. That means you don't need as high order
a polynomial to calculate it. Further you likely don't need it
accurate to 1 ulp

SO, given the constraints on the domain and accuracy you do a least
squares Chebyshev approximation, order 2, 3, 4 etc till you find one
accurate enough.

Another approach, if your domain is small enough, you could build a
table of values/derivatives and do an interpolation or use a very low
order Chebyshev specific to that band.
..
 
R

Roedy Green

The original piece of code does not assume that the array is filled with
zeroes.

Java arrays always start life out zeroed, though we don't know for
sure that array was a virgin.
 
R

Roedy Green

If that brings in too much cache missing, it may be time for loop blocking.

ok, we'll bite. What is loop blocking? You mean arranging the work to
work on a tile of ram before moving on to the next tile.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,043
Latest member
CannalabsCBDReview

Latest Threads

Top