java loop optimization

star · Sep 15, 2005

Hi,
I have problem with optimizing loop in java.
My code looks like this
for (k = m; --k >=0

{
for (j = m; --j >= 0

{
Array[j] += Math.exp(-0.5 * (Math.pow (((b[k]-b[j])/d),2))) / c;

}
}
The problem is that m depends from some input data in the application
and it can be very big, for example 150000. In that case the loops take
2 hours, which is unacceptable for the situation.
I also can't use native codes.
If any one have any suggestions please write.
Thanks in advance

Roedy Green · Sep 15, 2005

Math.pow (((b[k]-b[j])/d),2)))

see http://mindprod.com/jgloss/power.html

This can be computed many times faster with x * x

Ingo R. Homann · Sep 15, 2005

Hi,

Perhaps, the following will speed up things a bit:

Hi,
I have problem with optimizing loop in java.
My code looks like this
for (k = m; --k >=0{
for (j = m; --j >= 0
{

// Array[j] += Math.exp(-0.5 * (Math.pow (((b[k]-b[j])/d),2))) / c;
double x=(b[k]-b[j])/d; // or int or ...?
Array[j] += Math.exp(-0.5 * x*x) / c;

}
}

Another possibility might be to use a lookuptable instead of Math.exp.
Or even better, for the whole term x -> Math.exp(-0.5*x*x)/c. Of course,
it also depends on the precision you need.

Oh, and you can also precalculate d outside the loop:

double f=-0.5/d/d;

double x=b[k]-b[j];
Array[j] += Math.exp(f*x*x)/c;

In that case the loops take
2 hours, which is unacceptable for the situation.

What would be acceptable? 1 hour? 5 Minutes?

Ciao,
Ingo

Robert Klemme · Sep 15, 2005

Ingo said:
Hi,

Perhaps, the following will speed up things a bit:

Hi,
I have problem with optimizing loop in java.
My code looks like this
for (k = m; --k >=0{
for (j = m; --j >= 0
{

Click to expand...

// Array[j] += Math.exp(-0.5 * (Math.pow (((b[k]-b[j])/d),2))) /
c; double x=(b[k]-b[j])/d; // or int or ...?
Array[j] += Math.exp(-0.5 * x*x) / c;

}
}

Click to expand...

Another possibility might be to use a lookuptable instead of Math.exp.
Or even better, for the whole term x -> Math.exp(-0.5*x*x)/c. Of
course, it also depends on the precision you need.

Oh, and you can also precalculate d outside the loop:

double f=-0.5/d/d;

double x=b[k]-b[j];
Array[j] += Math.exp(f*x*x)/c;

And you can extract b[k] from the inner loop into the outer loop (if the
JVM doesn't do this already) to save a lot of array bounds checking.

Did you profile this?

robert

megagurka · Sep 15, 2005

Ingo R. Homann skrev:

Oh, and you can also precalculate d outside the loop:

double f=-0.5/d/d;

double x=b[k]-b[j];
Array[j] += Math.exp(f*x*x)/c;

Try using floats if the precision is sufficient. Also, re-arranging the
loops helps avoid the division and some array accesses. Here's the
code:

float y = -0.5f / d / d;

for (int j = 0; j < m; j++) {
float v = 0;

for (int k = 0; k < m; k++) {
float x = b[k] - b[j];
v += Math.exp(y * x * x);
}

Array[j] += v / c;
}

The Math.exp() method will consume about 90% of the execution time, so
it need to be replaced if you want to optimize the code further.

/JN

Ingo R. Homann · Sep 15, 2005

Hi,

Robert said:
And you can extract b[k] from the inner loop into the outer loop (if the
JVM doesn't do this already) to save a lot of array bounds checking.

In such easy cases, the JIT should remove the bounds-checks, AFAIK.

Did you profile this?

No, of course, this is left to the OP. He asked for "any suggestions",
not for a complete solution. ;-)

Ciao,
Ingo

Patricia Shanahan · Sep 15, 2005

star said:
Hi,
I have problem with optimizing loop in java.
My code looks like this
for (k = m; --k >=0{
for (j = m; --j >= 0
{
Array[j] += Math.exp(-0.5 * (Math.pow (((b[k]-b[j])/d),2))) / c;

}
}
The problem is that m depends from some input data in the application
and it can be very big, for example 150000. In that case the loops take
2 hours, which is unacceptable for the situation.
I also can't use native codes.
If any one have any suggestions please write.
Thanks in advance

The first question is whether the bottleneck is the arithmetic, or the
memory accesses. Try substituting "Array[j] = b[k]-b[j];". If that is
significantly faster, apply some of the already posted ideas for
improving the arithmetic.

If not, you have a memory access problem.

I'd work with counting up, rather than counting down, and separate out
the increment instead of combining it with the test, just to make things
easier for both me and the compiler. It may even be faster, because it
is more idiomatic, and hardware prefetchers often work better for
forwards scanning of memory, but it will probably be just about the same
performance.

The simplest change is to just exchange the two loops, so that you get
maximum reuse out of each element of Array once you get it into cache.
However, you may need to do loop blocking, so that you work for a while
with a couple of subranges of b elements, small enough to fit in caches.

Patricia

Skip · Sep 15, 2005

Ingo R. Homann said:
Hi,

Robert said:

And you can extract b[k] from the inner loop into the outer loop (if the
JVM doesn't do this already) to save a lot of array bounds checking.

Click to expand...

In such easy cases, the JIT should remove the bounds-checks, AFAIK.

The client VM (up to 1.5) *cannot* do bound-checks-removal. Only the server
VM is able to do that.

Chris Uppal · Sep 15, 2005

star said:
for (k = m; --k >=0{
for (j = m; --j >= 0
{
Array[j] += Math.exp(-0.5 * (Math.pow (((b[k]-b[j])/d),2))) / c;

}
}

Adding to everyone else's suggestions, you can also move the / c (and an array
access) out of the inner loop, resulting in:

double f = -0.5 / d / d;
for (int j = m; --j >= 0

{
double bj = b[j];
double sum = 0.0;
for (int k = m; --k >=0

{
double incr = b[k]-bj;
sum += Math.exp(f * incr * incr);
}
output[j] = sum / c;
}

I didn't include Patricia's suggestion of looping forwards since I found it ran
marginally slower on my Pentium-based machine. Interestingly, I found that for
this problem (with code added to allow the JITer to warm up before running the
timing tests) the -client JMV was consistently a little faster than the -server
version.

The above code runs somewhat faster -- in my test with m=10000 it ran in around
22 seconds rather than 35. Extrapolating that up to m=150K, suggests an
improvement from 2.2 hours to 1.4 hours -- useful but not sparkling...

The remaining time is completely dominated (~ 96%) by the cost of the m**2
calls to Math.exp() so there's no scope at all for more optimisation that
doesn't involve either cutting down the number of exp() calls, or replacing
them with something faster.

-- chris

Skip · Sep 15, 2005

Chris Uppal said:
star said:

for (k = m; --k >=0{
for (j = m; --j >= 0
{
Array[j] += Math.exp(-0.5 * (Math.pow (((b[k]-b[j])/d),2))) / c;

}
}

Click to expand...

Adding to everyone else's suggestions, you can also move the / c (and an array
access) out of the inner loop, resulting in:

double f = -0.5 / d / d;
for (int j = m; --j >= 0
{
double bj = b[j];
double sum = 0.0;
for (int k = m; --k >=0
{
double incr = b[k]-bj;
sum += Math.exp(f * incr * incr);
}
output[j] = sum / c;
}

make that:

output[j] += sum / c;

The original piece of code does not assume that the array is filled with
zeroes.

Robert Klemme · Sep 15, 2005

Ingo said:
Hi,

Robert said:

And you can extract b[k] from the inner loop into the outer loop (if
the JVM doesn't do this already) to save a lot of array bounds
checking.

Click to expand...

In such easy cases, the JIT should remove the bounds-checks, AFAIK.

It would not hurt to do the extraction anyway. It also has a bit of
documentation advantage, especially when using "final".

No, of course, this is left to the OP. He asked for "any suggestions",
not for a complete solution. ;-)

I'm sorry, I didn't expect *you* to do the profiling.

robert

Thomas Hawtin · Sep 15, 2005

Chris said:
The remaining time is completely dominated (~ 96%) by the cost of the m**2
calls to Math.exp() so there's no scope at all for more optimisation that
doesn't involve either cutting down the number of exp() calls, or replacing
them with something faster.

I wouldn't be surprised if moving to non-x86 improved performance by an
order of magnitude.

The code may become significantly more accurate if the exp is replaced
by expm1 and then m added to the sum *after* the loop. It's better to
add a bundle of close numbers that are close to zero, than a similar set
with an offset.

Tom Hawtin

Roedy Green · Sep 15, 2005

If any one have any suggestions please write.

this is the sort of thing you might find AOT compilation does well.

See http://mindprod.com/jgloss/jet.html

Roedy Green · Sep 15, 2005

Try using floats if the precision is sufficient.

I was looking at the principles of operation for the Pentium.
Storing-popping a double or float takes 3 cycles but storing a 80 bit
intermediate is 7 cycles. So you pay no speed penalty for using
floats, and you get the advantage of compactness, which may mean a
better cpu cache, SRAM cache and swapping cache behaviour.

Roedy Green · Sep 15, 2005

sum += Math.exp(f * incr * incr);

A longshot. Try using Math.expm1 I might be faster on your domain.

Roedy Green · Sep 15, 2005

The remaining time is completely dominated (~ 96%) by the cost of the m**2
calls to Math.exp() so there's no scope at all for more optimisation that
doesn't involve either cutting down the number of exp() calls, or replacing
them with something faster.

Are there any symmetries to be exploited? You compute just one part
of the solution and propagate it to the other cells.

Patricia Shanahan · Sep 15, 2005

Roedy said:
Are there any symmetries to be exploited? You compute just one part
of the solution and propagate it to the other cells.

Good point. It should be possible to get rid of half the Math.exp calls.

The squares of (b[j]-b[k]) and (b[k]-b[j]) are equal. In terms of Chris'
sample code, once Math.exp(f * incr * incr) has been calculated, it
should be added to both Array[j] and Array[k].

If that brings in too much cache missing, it may be time for loop blocking.

Patricia

Roedy Green · Sep 15, 2005

The remaining time is completely dominated (~ 96%) by the cost of the m**2
calls to Math.exp() so there's no scope at all for more optimisation that
doesn't involve either cutting down the number of exp() calls, or replacing
them with something faster.

If you want to try something drastic, here is what you can do.

You need a new fast exp() function. It only has to work over a narrow
domain compared to Sun's exp. That means you don't need as high order
a polynomial to calculate it. Further you likely don't need it
accurate to 1 ulp

SO, given the constraints on the domain and accuracy you do a least
squares Chebyshev approximation, order 2, 3, 4 etc till you find one
accurate enough.

Another approach, if your domain is small enough, you could build a
table of values/derivatives and do an interpolation or use a very low
order Chebyshev specific to that band.
..

Roedy Green · Sep 15, 2005

The original piece of code does not assume that the array is filled with
zeroes.

Java arrays always start life out zeroed, though we don't know for
sure that array was a virgin.

Roedy Green · Sep 16, 2005

If that brings in too much cache missing, it may be time for loop blocking.

ok, we'll bite. What is loop blocking? You mean arranging the work to
work on a tile of ram before moving on to the next tile.

Java matrix problem	3	Sep 10, 2023
Difference between using "let" in a "for" loop	0	Jul 3, 2022
loop optimization	7	Feb 2, 2012
How to loop in folder through all excel files and all sheets using pandas?	0	Dec 1, 2022
While Loop Freezing?	1	Feb 20, 2021
Python Gurobi Optimizing Cost has no errors but I get no sensible solution	0	Aug 30, 2022
C program: memory leak/ segmentation fault/ memory limit exceeded	0	Nov 12, 2022
Python code problem	2	Apr 23, 2023

java loop optimization

star

Roedy Green

Ingo R. Homann

Robert Klemme

megagurka

Ingo R. Homann

Patricia Shanahan

Skip

Chris Uppal

Skip

Robert Klemme

Thomas Hawtin

Roedy Green

Roedy Green

Roedy Green

Roedy Green

Patricia Shanahan

Roedy Green

Roedy Green

Roedy Green

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads