java loop optimization

P

Patricia Shanahan

Roedy said:
ok, we'll bite. What is loop blocking? You mean arranging the work to
work on a tile of ram before moving on to the next tile.

Correct in principle. Usually, one thinks in terms of tiles, because the
commonest cause of nested loops is calculations on multidimensional arrays.

In this case, one would pick a range of values for each of k and j, such
that all Array and b elements in either range fit in cache, and then do
ALL the calculations involving pairs of k and j values in those ranges.

If the slice size is s, the innermost pair of loops do s*s calculations,
for the price of s cache misses. You want s to be as large as possible
without triggering extra cache misses.

Patricia
 
C

Chris Uppal

Patricia said:
The squares of (b[j]-b[k]) and (b[k]-b[j]) are equal. In terms of Chris'
sample code, once Math.exp(f * incr * incr) has been calculated, it
should be added to both Array[j] and Array[k].

Nice.

With the code now reading:

=============
double f = -0.5 / d / d;

for (int j = m; --j >= 0;)
{
double inputJ = input[j];
double sumJ = 1.0;
for (int k = m; --k > j;)
{
double diff = input[k]-inputJ;
double incr = Math.exp(f * diff * diff);
sumJ += incr;
output[k] += incr;
}
output[j] = sumJ;
}

for (int j = m; --j >= 0;)
output[j] /= c;
=============

(Where I've renamed the original b[] and Array[] to input[] and output[]
respectively. I'm being explicit about assuming the output is initially
zeroed. If that's shouldn't be true then the OP could either add my output[]
to his Array[], or re-instate the division by c that I've moved out of the
inner loop.)

BTW, by initialising sumJ to 1, and terminating the loop one step too early we
can avoid doing exp() when j==k.

With the conditions as before, m=10000, the run time comes down to about 12.5
seconds for this new code. Scaling up to m=150000 should now run in about 50
minutes (on my machine) assuming that cache effects do not seriously affect the
scaling (I have no evidence either way, and am not about to try a 50 minute run
just to find out ;-)

The output[] array is in fact palindromic, but the only algorithm I've been
able to think of to take advantage of the fact is the following very messy
code:

=============
double f = -0.5 / d / d;
int halfway = (m - 1) / 2;

for (int j = m; --j >= halfway;)
{
double inputJ = input[j];
double sumJ = 1.0;
for (int k = m; --k > j;)
{
double diff = input[k]-inputJ;
double incr = Math.exp(f * diff * diff);
sumJ += incr;
output[k] += incr;
}
output[j] = sumJ;
}

for (int j = halfway; --j >= 0;)
{
double inputJ = input[j];
double sumJ = 1.0;
int stop = Math.max(j, halfway-1);
for (int k = m; --k > stop;)
{
double diff = input[k]-inputJ;
double incr = Math.exp(f * diff * diff);
output[k] += incr;
}
}

for (int j = m; --j >= halfway;)
output[j] /= c;

for (int j = 0; j < halfway; j++)
output[j] = output[m-j-1];
=============

I have a lot of trouble with fencepost errors at the best of times, and in code
like the above I have /no/ confidence that I've got the details right (though
it does /appear/ to work correctly). Anyway, running that with m=10000 takes
8.5 seconds, which would suggest that my machine might be able to complete the
m=150K case in around half an hour.

-- chris
 
C

Chris Uppal

Thomas said:
I wouldn't be surprised if moving to non-x86 improved performance by an
order of magnitude.

Is x86 floating point so /very/ slow ? I hadn't realised that (and don't have
any non-x86 machines to try :-(

-- chris
 
M

megagurka

Another suggestion is to convert your input values to integer fixed
point values. You could then use a lookup table for the exp() function.
The precision could be 16 bits or higher, depending on the amount of
memory you have available for the lookup table. The loop would become
something like:

int[] bInt = <b[] converted to ints>;

for (int j = 0; j < m; j++) {
int v = 0;

for (int k = 0; k < m; k++) {
v += expTable[Math.abs(bInt[k] - bInt[j])];
}

Array[j] += (double) v * someConstant;
}

This code would run MUCH faster than the floating point variant. Maybe
someone has the time and energy to make a comparison. ;-)

/JN
 
P

Patricia Shanahan

Chris Uppal wrote:
....
With the conditions as before, m=10000, the run time comes down to about 12.5
seconds for this new code. Scaling up to m=150000 should now run in about 50
minutes (on my machine) assuming that cache effects do not seriously affect the
scaling (I have no evidence either way, and am not about to try a 50 minute run
just to find out ;-)
....

Could you send me your test/timing harness, so that I can try it out?
I'll have a long enough block of time when I won't need my computer for
anything else.

The total data for m=150,000 is 2*8*150,000 = 2,400,000 bytes,
which is probably a bit big for effective caching, but without a test it
is hard to tell whether cache effects will matter or not.

Patricia
 
T

Thomas Hawtin

Chris said:
Is x86 floating point so /very/ slow ? I hadn't realised that (and don't have
any non-x86 machines to try :-(

I've never done much fp stuff. However, my understanding is that some
functions (notably trig, which is very similar to exp) can with some
values be insufficiently accurate for the Java specs as laid down by
Sun. Code necessary to correct for the error is not fast.

If the specs for doing toRadian/toDegrees (which are really simple) were
as strict as for trig, Sun's code would just fail. The function
implementations will also overflow needlessly.

Tom Hawtin
 
M

Mark Thornton

Thomas said:
I've never done much fp stuff. However, my understanding is that some
functions (notably trig, which is very similar to exp) can with some
values be insufficiently accurate for the Java specs as laid down by
Sun. Code necessary to correct for the error is not fast.

The problem with trig performance is the accuracy of reducing arguments
to +-PI/4. The Intel hardware uses a 66 bit approximation of PI which
isn't adequate to meet the Java spec. This problem is specific to trig
methods and does not apply to exp.

Mark Thornton
 
C

Chris Uppal

Mark said:
The problem with trig performance is the accuracy of reducing arguments
to +-PI/4. The Intel hardware uses a 66 bit approximation of PI which
isn't adequate to meet the Java spec. This problem is specific to trig
methods and does not apply to exp.

Ah! Very interesting.

BTW, do you have a reference for that (I don't doubt you, I just want to find
more background info if any's available) ?

-- chris
 
T

Thomas Hawtin

Mark said:
The problem with trig performance is the accuracy of reducing arguments
to +-PI/4. The Intel hardware uses a 66 bit approximation of PI which
isn't adequate to meet the Java spec. This problem is specific to trig
methods and does not apply to exp.

As I say, I'm not an expert in this (approximate) field. I have just
done some very basic microbenchmarks to investigate what happens.

#include <math.h>
#include <stdio.h>

int main() {
double f = 0;
for (int ct=1; ct<=10*1000*1000; ++ct) {
f += sin(1/(double)ct);
}
printf("%e", f);
}

import static java.lang.Math.*;

class Sin {
public static void main(String[] args) {
for (int ct=0; ct<3; ++ct) {
long start = System.nanoTime();
double f = run();
long end = System.nanoTime();
System.out.println(end-start);
System.out.println(f);
}
}
private static double run() {
double f = 0;
for (int ct=1; ct<=10*1000*1000; ++ct) {
f += sin(1/(double)ct);
}
return f;
}
}

Strangely enough, g++ (GCC) 4.0.0 20050519 (Red Hat 4.0.0-8), Sun
1.6.0-ea-b51 and Sun 1.5.0_05-b05 gave me quite similar results. (Using
no options with java and -O2 with g++ (-march=pentium2 slowed it down).)
Still, they all take over 100 cycles/iteration.

For large values of x the dominant terms of the power series expansion
of e^x move down the sequence. Presumably the small terms should be
added together first. Is there an analogue of the PI/4 normalisation,
however exp is really calculated? Or does extended precision takes care
of getting to within 1 ulp, even for negative values?

And is it still true that Java is slow for trig, on x86, for small
arguments?

Tom Hawtin
 
C

Chris Uppal

I said:
The output[] array is in fact palindromic, but the only algorithm I've
been able to think of to take advantage of the fact is the following very
messy code:

Patricia has pointed out that this is only true as a result of the over-regular
input data I was using, and does not hold in general (I /knew/ I should have
tried random input..).

So the "very messy code" should be ignored.

Many thanks for the correction, Patricia.

-- chris
 
R

Roedy Green

For large values of x the dominant terms of the power series expansion
of e^x move down the sequence. Presumably the small terms should be
added together first.

There is no FEXP hardware instruction. There is a FYLX2 and FYL2XP1
which compute y * log2 x and y * log2 (x+1)


Here is how I implemented e^x in orgasm, a postfix assembler part of
BBL Forth. First I implemented 2^x (^ means to the power). Note there
are no polynomial approximations or loops.

CODE $2^X ( F -- H )
\ calc 2**X where X is TOS
\ works if depth 1..6
\ save present NPX control word
CS: CTRL-WORD #) FSTCW
\ save NPX control word for modification
CS: TEMP/CTRL-WORD #) FSTCW

\ change rounding mode to round down
F3FF # CS: TEMP/CTRL-WORD #) AND
0400 # CS: TEMP/CTRL-WORD #) OR
CS: TEMP/CTRL-WORD #) FLDCW

\ st(0) = st(1)
ST(0) FLD
FRNDINT
\ restore the original control word
CS: CTRL-WORD #) FLDCW

\ get fraction part of X
ST(0) ST(1) FSUB
ST(1) FXCH

\ divide fraction by 1/2
CS: DWORD $HALF #) FLD
ST(1) FXCH
FPREM
\ was fraction > 1/2?
CS: FSTATUS-WORD #) FSTSW
ST(1) FSTP
F2XM1
FLD1
ST(0) ST(1) FADDP
02 # CS: BYTE FSTATUS-WORD 1+ #) TEST

0<>, IF,
\ fraction was greater than 1/2
FLD1
ST(0) ST(0) FADD
FSQRT
ST(0) ST(1) FMULP
THEN,
\ fraction was less than or equal to 1/2
\ guaranteed integer for 2**X
FSCALE
ST(1) FSTP
RET ;C

Now I have that, I can do e^x

CODE FE^X ( X -- e^X ) \ e.g. e**X
\ works if depth 1..6
FLDL2E
ST(0) ST(1) FMULP
' $2^X CALL
NEXT
;C
 
C

ChrisWSU

problem i would say has more to do with fact that your using a
algorithm with an exponential runtime, you can optimize it as much as
you want but the fact of the matter that its one of the worst runtimes
you can get next to factorial, Id try to spend more time looking for a
different solution to the problem as opposed to tring to make it do a
division in 20 nanoseconds instead of 30. If theres no other possible
way to solve them and a small % difference is important then worry
about optimizing. Although to be completely fair there are instances
where a slight speed increase is needed, and it can be real fun to try
;)

remember:
"Premature optimization is the root of all evil"
O( n^k / 5 ) == O(n^k) aka slow is slow
 
R

Roedy Green

"Premature optimization is the root of all evil"

In this case it was determined that optimisation was needed. It was
not premature. Optimisation is not evil, just premature optimisation.


In this case we managed to shave well over 50% of the time off the
loop and provided hints to go well beyond that. We did not change the
order of the problem, but the size of his problem is fixed, so it does
not matter. What matters to him is not having to sit around for 2
hours while a little matrix initialises.
 
S

star

Hi guys,
I really want to thank you for all your suggestions. Now the
application runs with 50% faster. I made almost everything that you
suggested, I even tried to call native method in C and I used some code
that calculates two times faster that exp in C does ( just to see how
far I am ) and I got almost the same result. So I assume there is not
much more that can be done.

Thanks to all
 
M

megagurka

star said:
So I assume there is not much more that can be done.

As I and others have pointed out, if you're willing to sacrifice some
precision, you can speed up the algorithm substantially.

/JN
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,143
Latest member
SterlingLa
Top