taking advantage of SSE

P

pfalstad

Hi, is there any documentation out there which will tell me how to best
take advantage of SSE/SSE2? Is there any way to figure out if my java
applet is using SSE instructions or not?

I have a loop that looks like this:

for (j = 0; j < max; j++) {
for (i = 0; i < max; i++) {
float previ = func[i-1][j];
float nexti = func[i+1][j];
float prevj = func[j-1];
float nextj = func[j+1];
float basis = (nexti+previ+nextj+prevj)*.25f;
if (exceptional[j]) { ... handle rare/messy cases ... }
float a = (func[j] - basis) * damp[j];
float b = funci[j] * damp[j];
func [j] = basis + a*const1 - b*const2;
funci[j] = b*const1 + a*const2;
}
}

This loop seems like it could take advantage of SSE, but I doubt the
java compiler is smart enough to figure out how to do it without my
help. I also have no way of knowing what the JIT is doing internally.
Does anyone have any ideas on how I can best optimize this loop (aside
from trial-and-error)?
 
R

Roland

Hi, is there any documentation out there which will tell me how to best
take advantage of SSE/SSE2? Is there any way to figure out if my java
applet is using SSE instructions or not?

I have a loop that looks like this:

for (j = 0; j < max; j++) {
for (i = 0; i < max; i++) {
float previ = func[i-1][j];
float nexti = func[i+1][j];
float prevj = func[j-1];
float nextj = func[j+1];
float basis = (nexti+previ+nextj+prevj)*.25f;
if (exceptional[j]) { ... handle rare/messy cases ... }
float a = (func[j] - basis) * damp[j];
float b = funci[j] * damp[j];
func [j] = basis + a*const1 - b*const2;
funci[j] = b*const1 + a*const2;
}
}

This loop seems like it could take advantage of SSE, but I doubt the
java compiler is smart enough to figure out how to do it without my
help. I also have no way of knowing what the JIT is doing internally.
Does anyone have any ideas on how I can best optimize this loop (aside
from trial-and-error)?

Unless the JRE has been compiled to take advantage of the SSE
instructions, I doubt that your applet will benefit from it. AFAIK, the
JRE's that Sun offers are compiled for the common denominator of
instructions available for Intel/AMD processors, and therefore don't
take advantage of SSE, MMX or whatever instructions.
Of course, I could be wrong, and it maybe well possible that the JIT
compiler, or rather the bytecode to native code translator does generate
SSE instructions (after it has detected that the processor supports them).

Have you done some profiling on your code to see if the nested loop
actually is a serious bottleneck?
--
Regards,

Roland de Ruiter
___ ___
/__/ w_/ /__/
/ \ /_/ / \
 
R

Roland

here's where I saw that Java2 supports SSE/SSE2:

http://java.sun.com/j2se/1.4.2/1.4.2_whitepaper.html#7




no profiling, but I'm positive it's the bottleneck.. :)
Well, I stand corrected. [A 1.5 to 1.6 performance increase between Java
1.4.1 and 1.4.2 in that benchmark is nice, but not that dramatic.]
Then I guess the your loop might benefit from the SSE instructions (if
your applet runs on a SSE supporting JRE and processor).
--
Regards,

Roland de Ruiter
___ ___
/__/ w_/ /__/
/ \ /_/ / \
 
P

Patricia Shanahan

Hi, is there any documentation out there which will tell me how to best
take advantage of SSE/SSE2? Is there any way to figure out if my java
applet is using SSE instructions or not?

I have a loop that looks like this:

for (j = 0; j < max; j++) {
for (i = 0; i < max; i++) {
float previ = func[i-1][j];
float nexti = func[i+1][j];
float prevj = func[j-1];
float nextj = func[j+1];
float basis = (nexti+previ+nextj+prevj)*.25f;
if (exceptional[j]) { ... handle rare/messy cases ... }
float a = (func[j] - basis) * damp[j];
float b = funci[j] * damp[j];
func [j] = basis + a*const1 - b*const2;
funci[j] = b*const1 + a*const2;
}
}

This loop seems like it could take advantage of SSE, but I doubt the
java compiler is smart enough to figure out how to do it without my
help. I also have no way of knowing what the JIT is doing internally.
Does anyone have any ideas on how I can best optimize this loop (aside
from trial-and-error)?


I can't answer your specific question. However, this type of
code is generally difficult to vectorize or parallelize
because you have dependencies between consecutive iterations
of the inner loop. The value of func[n-1][m] is generated
near the end of the i=n-1,j=m iteration, and used in the
first calculation in the i=n,j=m iteration.

Optimizers usually do better on loops that don't contain if
statements. Is there any way you can avoid the exceptional
issue? Pretend the rare/messy cases are normal and fix up
later? It might be worth measuring without the exceptional
test to see if you get any gain.

If max is large, you might gain by tiling the loops, working
on rectangular sections small enough to fit in cache.

Patricia
 
P

pfalstad

I can't answer your specific question. However, this type of
code is generally difficult to vectorize or parallelize
because you have dependencies between consecutive iterations
of the inner loop. The value of func[n-1][m] is generated
near the end of the i=n-1,j=m iteration, and used in the
first calculation in the i=n,j=m iteration.

That's not necessary, though; in fact, I really should be looking at
the values from the previous iteration. If it would make the code
faster, I could easily fix that.
Optimizers usually do better on loops that don't contain if
statements. Is there any way you can avoid the exceptional
issue? Pretend the rare/messy cases are normal and fix up
later?

sure..
 
P

Patricia Shanahan

I can't answer your specific question. However, this type of
code is generally difficult to vectorize or parallelize
because you have dependencies between consecutive iterations
of the inner loop. The value of func[n-1][m] is generated
near the end of the i=n-1,j=m iteration, and used in the
first calculation in the i=n,j=m iteration.


That's not necessary, though; in fact, I really should be looking at
the values from the previous iteration. If it would make the code
faster, I could easily fix that.

Unfortunately, there does not seem to be any way to find out
exactly what is really being executed, so you are probably
stuck with some trial and error. If breaking the dependency
enables SSE it should help, but if the JIT does not use SSE
anyway, you may find it makes things worse by increasing
cache misses.

Patricia
 
P

pfalstad

If breaking the dependency
enables SSE it should help, but if the JIT does not use SSE
anyway, you may find it makes things worse by increasing
cache misses.

well guess what, I broke the dependency, and I that it worse, probably
by increasing cache misses. My loop may be too complicated for the
compiler to figure out how to use SSE anyway; I could do some more
trial and error at some point.

Even better, though, I discovered that replacing the two-dimensional
array by a one-dimensional array resulted in a ridiculous increase in
speed; the loop runs in about 1/5 the time. Wow!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,056
Latest member
GlycogenSupporthealth

Latest Threads

Top