taking advantage of SSE

pfalstad · Feb 20, 2005

Hi, is there any documentation out there which will tell me how to best
take advantage of SSE/SSE2? Is there any way to figure out if my java
applet is using SSE instructions or not?

I have a loop that looks like this:

for (j = 0; j < max; j++) {
for (i = 0; i < max; i++) {
float previ = func[i-1][j];
float nexti = func[i+1][j];
float prevj = func[j-1];
float nextj = func[j+1];
float basis = (nexti+previ+nextj+prevj)*.25f;
if (exceptional[j]) { ... handle rare/messy cases ... }
float a = (func[j] - basis) * damp[j];
float b = funci[j] * damp[j];
func [j] = basis + a*const1 - b*const2;
funci[j] = b*const1 + a*const2;
}
}

This loop seems like it could take advantage of SSE, but I doubt the
java compiler is smart enough to figure out how to do it without my
help. I also have no way of knowing what the JIT is doing internally.
Does anyone have any ideas on how I can best optimize this loop (aside
from trial-and-error)?

Roland · Feb 20, 2005

Hi, is there any documentation out there which will tell me how to best
take advantage of SSE/SSE2? Is there any way to figure out if my java
applet is using SSE instructions or not?

I have a loop that looks like this:

for (j = 0; j < max; j++) {
for (i = 0; i < max; i++) {
float previ = func[i-1][j];
float nexti = func[i+1][j];
float prevj = func[j-1];
float nextj = func[j+1];
float basis = (nexti+previ+nextj+prevj)*.25f;
if (exceptional[j]) { ... handle rare/messy cases ... }
float a = (func[j] - basis) * damp[j];
float b = funci[j] * damp[j];
func [j] = basis + a*const1 - b*const2;
funci[j] = b*const1 + a*const2;
}
}

This loop seems like it could take advantage of SSE, but I doubt the
java compiler is smart enough to figure out how to do it without my
help. I also have no way of knowing what the JIT is doing internally.
Does anyone have any ideas on how I can best optimize this loop (aside
from trial-and-error)?

Unless the JRE has been compiled to take advantage of the SSE
instructions, I doubt that your applet will benefit from it. AFAIK, the
JRE's that Sun offers are compiled for the common denominator of
instructions available for Intel/AMD processors, and therefore don't
take advantage of SSE, MMX or whatever instructions.
Of course, I could be wrong, and it maybe well possible that the JIT
compiler, or rather the bytecode to native code translator does generate
SSE instructions (after it has detected that the processor supports them).

Have you done some profiling on your code to see if the nested loop
actually is a serious bottleneck?
--
Regards,

Roland de Ruiter
___ ___
/__/ w_/ /__/
/ \ /_/ / \

pfalstad · Feb 20, 2005

here's where I saw that Java2 supports SSE/SSE2:

http://java.sun.com/j2se/1.4.2/1.4.2_whitepaper.html#7

Have you done some profiling on your code to see if the nested loop
actually is a serious bottleneck?

no profiling, but I'm positive it's the bottleneck..

Roland · Feb 20, 2005

here's where I saw that Java2 supports SSE/SSE2:

http://java.sun.com/j2se/1.4.2/1.4.2_whitepaper.html#7

no profiling, but I'm positive it's the bottleneck..

Well, I stand corrected. [A 1.5 to 1.6 performance increase between Java
1.4.1 and 1.4.2 in that benchmark is nice, but not that dramatic.]
Then I guess the your loop might benefit from the SSE instructions (if
your applet runs on a SSE supporting JRE and processor).
--
Regards,

Roland de Ruiter
___ ___
/__/ w_/ /__/
/ \ /_/ / \

Patricia Shanahan · Feb 20, 2005

Hi, is there any documentation out there which will tell me how to best
take advantage of SSE/SSE2? Is there any way to figure out if my java
applet is using SSE instructions or not?

I have a loop that looks like this:

for (j = 0; j < max; j++) {
for (i = 0; i < max; i++) {
float previ = func[i-1][j];
float nexti = func[i+1][j];
float prevj = func[j-1];
float nextj = func[j+1];
float basis = (nexti+previ+nextj+prevj)*.25f;
if (exceptional[j]) { ... handle rare/messy cases ... }
float a = (func[j] - basis) * damp[j];
float b = funci[j] * damp[j];
func [j] = basis + a*const1 - b*const2;
funci[j] = b*const1 + a*const2;
}
}

This loop seems like it could take advantage of SSE, but I doubt the
java compiler is smart enough to figure out how to do it without my
help. I also have no way of knowing what the JIT is doing internally.
Does anyone have any ideas on how I can best optimize this loop (aside
from trial-and-error)?

I can't answer your specific question. However, this type of
code is generally difficult to vectorize or parallelize
because you have dependencies between consecutive iterations
of the inner loop. The value of func[n-1][m] is generated
near the end of the i=n-1,j=m iteration, and used in the
first calculation in the i=n,j=m iteration.

Optimizers usually do better on loops that don't contain if
statements. Is there any way you can avoid the exceptional
issue? Pretend the rare/messy cases are normal and fix up
later? It might be worth measuring without the exceptional
test to see if you get any gain.

If max is large, you might gain by tiling the loops, working
on rectangular sections small enough to fit in cache.

Patricia

pfalstad · Feb 21, 2005

I can't answer your specific question. However, this type of

code is generally difficult to vectorize or parallelize
because you have dependencies between consecutive iterations
of the inner loop. The value of func[n-1][m] is generated
near the end of the i=n-1,j=m iteration, and used in the
first calculation in the i=n,j=m iteration.

That's not necessary, though; in fact, I really should be looking at
the values from the previous iteration. If it would make the code
faster, I could easily fix that.

Optimizers usually do better on loops that don't contain if
statements. Is there any way you can avoid the exceptional
issue? Pretend the rare/messy cases are normal and fix up
later?

sure..

Patricia Shanahan · Feb 21, 2005

I can't answer your specific question. However, this type of
code is generally difficult to vectorize or parallelize
because you have dependencies between consecutive iterations
of the inner loop. The value of func[n-1][m] is generated
near the end of the i=n-1,j=m iteration, and used in the
first calculation in the i=n,j=m iteration.

Click to expand...

That's not necessary, though; in fact, I really should be looking at
the values from the previous iteration. If it would make the code
faster, I could easily fix that.

Unfortunately, there does not seem to be any way to find out
exactly what is really being executed, so you are probably
stuck with some trial and error. If breaking the dependency
enables SSE it should help, but if the JIT does not use SSE
anyway, you may find it makes things worse by increasing
cache misses.

Patricia

pfalstad · Feb 28, 2005

If breaking the dependency

enables SSE it should help, but if the JIT does not use SSE
anyway, you may find it makes things worse by increasing
cache misses.

well guess what, I broke the dependency, and I that it worse, probably
by increasing cache misses. My loop may be too complicated for the
compiler to figure out how to use SSE anyway; I could do some more
trial and error at some point.

Even better, though, I discovered that replacing the two-dimensional
array by a one-dimensional array resulted in a ridiculous increase in
speed; the loop runs in about 1/5 the time. Wow!

Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
[ANN] Multidimensional Array - MDArray 0.5.5	0	Nov 19, 2013
my fingerprint identification project , plz help	2	Apr 9, 2007
Fundamentals of Financial Management Concise 7e Brigham Houston	0	May 1, 2011
8 buttons ,3 states and PJON Arduino	0	Jan 15, 2022
atoi not working inside of loop	0	Jan 16, 2008
why printf("%d", arg) works with arg of type int, short, char	21	Mar 1, 2014
Xerces C++ Problem	3	May 10, 2008

taking advantage of SSE

pfalstad

Roland

pfalstad

Roland

Patricia Shanahan

pfalstad

Patricia Shanahan

pfalstad

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads