java loop optimization

Patricia Shanahan · Sep 16, 2005

Roedy said:
ok, we'll bite. What is loop blocking? You mean arranging the work to
work on a tile of ram before moving on to the next tile.

Correct in principle. Usually, one thinks in terms of tiles, because the
commonest cause of nested loops is calculations on multidimensional arrays.

In this case, one would pick a range of values for each of k and j, such
that all Array and b elements in either range fit in cache, and then do
ALL the calculations involving pairs of k and j values in those ranges.

If the slice size is s, the innermost pair of loops do s*s calculations,
for the price of s cache misses. You want s to be as large as possible
without triggering extra cache misses.

Patricia

Chris Uppal · Sep 16, 2005

Patricia said:
The squares of (b[j]-b[k]) and (b[k]-b[j]) are equal. In terms of Chris'
sample code, once Math.exp(f * incr * incr) has been calculated, it
should be added to both Array[j] and Array[k].

Nice.

With the code now reading:

=============
double f = -0.5 / d / d;

for (int j = m; --j >= 0

{
double inputJ = input[j];
double sumJ = 1.0;
for (int k = m; --k > j

{
double diff = input[k]-inputJ;
double incr = Math.exp(f * diff * diff);
sumJ += incr;
output[k] += incr;
}
output[j] = sumJ;
}

for (int j = m; --j >= 0

output[j] /= c;
=============

(Where I've renamed the original b[] and Array[] to input[] and output[]
respectively. I'm being explicit about assuming the output is initially
zeroed. If that's shouldn't be true then the OP could either add my output[]
to his Array[], or re-instate the division by c that I've moved out of the
inner loop.)

BTW, by initialising sumJ to 1, and terminating the loop one step too early we
can avoid doing exp() when j==k.

With the conditions as before, m=10000, the run time comes down to about 12.5
seconds for this new code. Scaling up to m=150000 should now run in about 50
minutes (on my machine) assuming that cache effects do not seriously affect the
scaling (I have no evidence either way, and am not about to try a 50 minute run
just to find out ;-)

The output[] array is in fact palindromic, but the only algorithm I've been
able to think of to take advantage of the fact is the following very messy
code:

=============
double f = -0.5 / d / d;
int halfway = (m - 1) / 2;

for (int j = m; --j >= halfway

{
double inputJ = input[j];
double sumJ = 1.0;
for (int k = m; --k > j

{
double diff = input[k]-inputJ;
double incr = Math.exp(f * diff * diff);
sumJ += incr;
output[k] += incr;
}
output[j] = sumJ;
}

for (int j = halfway; --j >= 0

{
double inputJ = input[j];
double sumJ = 1.0;
int stop = Math.max(j, halfway-1);
for (int k = m; --k > stop

{
double diff = input[k]-inputJ;
double incr = Math.exp(f * diff * diff);
output[k] += incr;
}
}

for (int j = m; --j >= halfway

output[j] /= c;

for (int j = 0; j < halfway; j++)
output[j] = output[m-j-1];
=============

I have a lot of trouble with fencepost errors at the best of times, and in code
like the above I have /no/ confidence that I've got the details right (though
it does /appear/ to work correctly). Anyway, running that with m=10000 takes
8.5 seconds, which would suggest that my machine might be able to complete the
m=150K case in around half an hour.

-- chris

Chris Uppal · Sep 16, 2005

Thomas said:
I wouldn't be surprised if moving to non-x86 improved performance by an
order of magnitude.

Is x86 floating point so /very/ slow ? I hadn't realised that (and don't have
any non-x86 machines to try :-(

-- chris

megagurka · Sep 16, 2005

Another suggestion is to convert your input values to integer fixed
point values. You could then use a lookup table for the exp() function.
The precision could be 16 bits or higher, depending on the amount of
memory you have available for the lookup table. The loop would become
something like:

int[] bInt = <b[] converted to ints>;

for (int j = 0; j < m; j++) {
int v = 0;

for (int k = 0; k < m; k++) {
v += expTable[Math.abs(bInt[k] - bInt[j])];
}

Array[j] += (double) v * someConstant;
}

This code would run MUCH faster than the floating point variant. Maybe
someone has the time and energy to make a comparison. ;-)

/JN

Patricia Shanahan · Sep 16, 2005

Chris Uppal wrote:
....

With the conditions as before, m=10000, the run time comes down to about 12.5
seconds for this new code. Scaling up to m=150000 should now run in about 50
minutes (on my machine) assuming that cache effects do not seriously affect the
scaling (I have no evidence either way, and am not about to try a 50 minute run
just to find out ;-)

....

Could you send me your test/timing harness, so that I can try it out?
I'll have a long enough block of time when I won't need my computer for
anything else.

The total data for m=150,000 is 2*8*150,000 = 2,400,000 bytes,
which is probably a bit big for effective caching, but without a test it
is hard to tell whether cache effects will matter or not.

Patricia

Chris Uppal · Sep 16, 2005

Patricia said:
Could you send me your test/timing harness, so that I can try it out?

Sent...

-- chris

Thomas Hawtin · Sep 16, 2005

Chris said:
Is x86 floating point so /very/ slow ? I hadn't realised that (and don't have
any non-x86 machines to try :-(

I've never done much fp stuff. However, my understanding is that some
functions (notably trig, which is very similar to exp) can with some
values be insufficiently accurate for the Java specs as laid down by
Sun. Code necessary to correct for the error is not fast.

If the specs for doing toRadian/toDegrees (which are really simple) were
as strict as for trig, Sun's code would just fail. The function
implementations will also overflow needlessly.

Tom Hawtin

Mark Thornton · Sep 17, 2005

Thomas said:
I've never done much fp stuff. However, my understanding is that some
functions (notably trig, which is very similar to exp) can with some
values be insufficiently accurate for the Java specs as laid down by
Sun. Code necessary to correct for the error is not fast.

The problem with trig performance is the accuracy of reducing arguments
to +-PI/4. The Intel hardware uses a 66 bit approximation of PI which
isn't adequate to meet the Java spec. This problem is specific to trig
methods and does not apply to exp.

Mark Thornton

Mark Thornton · Sep 17, 2005

Thomas said:
Tom Hawtin

PS. We are looking for a programmer at the moment. http://www.optrak.co.uk/

Mark Thornton
mthornton @ optrak.co.uk

Chris Uppal · Sep 17, 2005

Mark said:
The problem with trig performance is the accuracy of reducing arguments
to +-PI/4. The Intel hardware uses a 66 bit approximation of PI which
isn't adequate to meet the Java spec. This problem is specific to trig
methods and does not apply to exp.

Ah! Very interesting.

BTW, do you have a reference for that (I don't doubt you, I just want to find
more background info if any's available) ?

-- chris

Thomas Hawtin · Sep 17, 2005

Mark said:
The problem with trig performance is the accuracy of reducing arguments
to +-PI/4. The Intel hardware uses a 66 bit approximation of PI which
isn't adequate to meet the Java spec. This problem is specific to trig
methods and does not apply to exp.

As I say, I'm not an expert in this (approximate) field. I have just
done some very basic microbenchmarks to investigate what happens.

#include <math.h>
#include <stdio.h>

int main() {
double f = 0;
for (int ct=1; ct<=10*1000*1000; ++ct) {
f += sin(1/(double)ct);
}
printf("%e", f);
}

import static java.lang.Math.*;

class Sin {
public static void main(String[] args) {
for (int ct=0; ct<3; ++ct) {
long start = System.nanoTime();
double f = run();
long end = System.nanoTime();
System.out.println(end-start);
System.out.println(f);
}
}
private static double run() {
double f = 0;
for (int ct=1; ct<=10*1000*1000; ++ct) {
f += sin(1/(double)ct);
}
return f;
}
}

Strangely enough, g++ (GCC) 4.0.0 20050519 (Red Hat 4.0.0-8), Sun
1.6.0-ea-b51 and Sun 1.5.0_05-b05 gave me quite similar results. (Using
no options with java and -O2 with g++ (-march=pentium2 slowed it down).)
Still, they all take over 100 cycles/iteration.

For large values of x the dominant terms of the power series expansion
of e^x move down the sequence. Presumably the small terms should be
added together first. Is there an analogue of the PI/4 normalisation,
however exp is really calculated? Or does extended precision takes care
of getting to within 1 ulp, even for negative values?

And is it still true that Java is slow for trig, on x86, for small
arguments?

Tom Hawtin

Chris Uppal · Sep 17, 2005

I said:
The output[] array is in fact palindromic, but the only algorithm I've
been able to think of to take advantage of the fact is the following very
messy code:

Patricia has pointed out that this is only true as a result of the over-regular
input data I was using, and does not hold in general (I /knew/ I should have
tried random input..).

So the "very messy code" should be ignored.

Many thanks for the correction, Patricia.

-- chris

Mark Thornton · Sep 17, 2005

Chris said:
Mark Thornton wrote:

Ah! Very interesting.

BTW, do you have a reference for that (I don't doubt you, I just want to find
more background info if any's available) ?

-- chris

James Gosling has blogged on this recently
http://blogs.sun.com/roller/page/jag?entry=transcendental_meditation
http://blogs.sun.com/roller/page/jag?entry=transcendental_two

Mark Thornton

Roedy Green · Sep 18, 2005

For large values of x the dominant terms of the power series expansion
of e^x move down the sequence. Presumably the small terms should be
added together first.

There is no FEXP hardware instruction. There is a FYLX2 and FYL2XP1
which compute y * log2 x and y * log2 (x+1)

Here is how I implemented e^x in orgasm, a postfix assembler part of
BBL Forth. First I implemented 2^x (^ means to the power). Note there
are no polynomial approximations or loops.

CODE $2^X ( F -- H )
\ calc 2**X where X is TOS
\ works if depth 1..6
\ save present NPX control word
CS: CTRL-WORD #) FSTCW
\ save NPX control word for modification
CS: TEMP/CTRL-WORD #) FSTCW

\ change rounding mode to round down
F3FF # CS: TEMP/CTRL-WORD #) AND
0400 # CS: TEMP/CTRL-WORD #) OR
CS: TEMP/CTRL-WORD #) FLDCW

\ st(0) = st(1)
ST(0) FLD
FRNDINT
\ restore the original control word
CS: CTRL-WORD #) FLDCW

\ get fraction part of X
ST(0) ST(1) FSUB
ST(1) FXCH

\ divide fraction by 1/2
CS: DWORD $HALF #) FLD
ST(1) FXCH
FPREM
\ was fraction > 1/2?
CS: FSTATUS-WORD #) FSTSW
ST(1) FSTP
F2XM1
FLD1
ST(0) ST(1) FADDP
02 # CS: BYTE FSTATUS-WORD 1+ #) TEST

0<>, IF,
\ fraction was greater than 1/2
FLD1
ST(0) ST(0) FADD
FSQRT
ST(0) ST(1) FMULP
THEN,
\ fraction was less than or equal to 1/2
\ guaranteed integer for 2**X
FSCALE
ST(1) FSTP
RET ;C

Now I have that, I can do e^x

CODE FE^X ( X -- e^X ) \ e.g. e**X
\ works if depth 1..6
FLDL2E
ST(0) ST(1) FMULP
' $2^X CALL
NEXT
;C

Chris Uppal · Sep 18, 2005

Mark said:
James Gosling has blogged on this recently
http://blogs.sun.com/roller/page/jag?entry=transcendental_meditation
http://blogs.sun.com/roller/page/jag?entry=transcendental_two

Thanks. An interesting read (once I'd turned off the webpages' background
images, anyway)

-- chris

ChrisWSU · Sep 19, 2005

problem i would say has more to do with fact that your using a
algorithm with an exponential runtime, you can optimize it as much as
you want but the fact of the matter that its one of the worst runtimes
you can get next to factorial, Id try to spend more time looking for a
different solution to the problem as opposed to tring to make it do a
division in 20 nanoseconds instead of 30. If theres no other possible
way to solve them and a small % difference is important then worry
about optimizing. Although to be completely fair there are instances
where a slight speed increase is needed, and it can be real fun to try

remember:
"Premature optimization is the root of all evil"
O( n^k / 5 ) == O(n^k) aka slow is slow

Roedy Green · Sep 20, 2005

"Premature optimization is the root of all evil"

In this case it was determined that optimisation was needed. It was
not premature. Optimisation is not evil, just premature optimisation.

In this case we managed to shave well over 50% of the time off the
loop and provided hints to go well beyond that. We did not change the
order of the problem, but the size of his problem is fixed, so it does
not matter. What matters to him is not having to sit around for 2
hours while a little matrix initialises.

star · Sep 22, 2005

Hi guys,
I really want to thank you for all your suggestions. Now the
application runs with 50% faster. I made almost everything that you
suggested, I even tried to call native method in C and I used some code
that calculates two times faster that exp in C does ( just to see how
far I am ) and I got almost the same result. So I assume there is not
much more that can be done.

Thanks to all

megagurka · Sep 22, 2005

star said:
So I assume there is not much more that can be done.

As I and others have pointed out, if you're willing to sacrifice some
precision, you can speed up the algorithm substantially.

/JN

Java matrix problem	3	Sep 10, 2023
Difference between using "let" in a "for" loop	0	Jul 3, 2022
loop optimization	7	Feb 2, 2012
How to loop in folder through all excel files and all sheets using pandas?	0	Dec 1, 2022
While Loop Freezing?	1	Feb 20, 2021
Python Gurobi Optimizing Cost has no errors but I get no sensible solution	0	Aug 30, 2022
C program: memory leak/ segmentation fault/ memory limit exceeded	0	Nov 12, 2022
Python code problem	2	Apr 23, 2023

java loop optimization

Patricia Shanahan

Chris Uppal

Chris Uppal

megagurka

Patricia Shanahan

Chris Uppal

Thomas Hawtin

Mark Thornton

Mark Thornton

Chris Uppal

Thomas Hawtin

Chris Uppal

Mark Thornton

Roedy Green

Chris Uppal

ChrisWSU

Roedy Green

star

megagurka

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads