Multithreading / Scalability

P

Philipp Kayser

Hi,

I wrote a small program to test scalability in a multiprocessor
environment (in my case an Athlon 64 X2). I included the source below.

To my surprise the calculation does not run faster if I use 2 threads
(which should be the case if I have 2 processors) but it runs 5 times
slower (e.g. 15s instead of 3s)! Everything else seems to be okay: If i
use 1 thread, I have a total CPU usage of 50%, if use two threads I get
100%.
The best thing is: if I limit the JVM to one processor by setting the
CPU affinity for the process, but again take 2 threads, the calculation
runs only 2 times slower (6s instead of 3s for one thread).

My current current diagnosis is that it may have something to do with
the CPU cache. I searched the Internet for similar problems and found
the terms "CPU Cache trashing" and "Ping-Pong-Effect": the processors
always switch between the two threads and by doing so their CPU cache
gets flushed every time.

Does anyone have an idea?

Best regards,
Philipp Kayser.

public class Test
{
private final int number_of_threads = 2;
private static int number_of_finished_threads;
private Thread threads[] = new Thread[number_of_threads];
double result[] = new double[number_of_threads];
private Thread main_thread;

private class CalculationThread extends Thread
{
int thread_number;

CalculationThread(int n)
{
super();
thread_number = n;
}

public void run()
{
try
{
synchronized (this)
{
while (true)
{
wait();

int n = 0;

for (n = 0; n < 600000000; n++)
if (n % number_of_threads == thread_number)
result[thread_number] += Math.sqrt(n);

synchronized (main_thread)
{
number_of_finished_threads++;
main_thread.notify();
}
}
}
}
catch (InterruptedException e)
{
}
}
}

private void multithreaded_calculation()
{
synchronized (main_thread)
{
number_of_finished_threads = 0;

int i;
for (i = 0; i < number_of_threads; i++)
{
synchronized (threads)
{
threads.notify();
}
}

do
{
try
{
main_thread.wait();
}
catch (InterruptedException e)
{
}
}
while (number_of_finished_threads < number_of_threads);

double total_result = 0;
for (i = 0; i < number_of_threads; i++)
total_result += result;

System.out.println(total_result);
}
}

private void test()
{
main_thread = Thread.currentThread();

int i;
for (i = 0; i < number_of_threads; i++)
{
threads = new CalculationThread(i);
threads.setPriority(Thread.NORM_PRIORITY);
threads.setDaemon(true);
threads.start();
}

try
{
Thread.sleep(1000);
}
catch (InterruptedException e)
{
}

long t0 = new Date().getTime();

multithreaded_calculation();

long t1 = new Date().getTime();

System.out.println(((double)t1 - t0)/1000);
}

public static void main(String[] args) {
new Test().test();
}
}
 
T

Thomas Hawtin

Philipp said:
I wrote a small program to test scalability in a multiprocessor
environment (in my case an Athlon 64 X2). I included the source below.

To my surprise the calculation does not run faster if I use 2 threads
(which should be the case if I have 2 processors) but it runs 5 times
slower (e.g. 15s instead of 3s)! Everything else seems to be okay: If i
use 1 thread, I have a total CPU usage of 50%, if use two threads I get
100%.
The best thing is: if I limit the JVM to one processor by setting the
CPU affinity for the process, but again take 2 threads, the calculation
runs only 2 times slower (6s instead of 3s for one thread).
My current current diagnosis is that it may have something to do with
the CPU cache. I searched the Internet for similar problems and found
the terms "CPU Cache trashing" and "Ping-Pong-Effect": the processors
always switch between the two threads and by doing so their CPU cache
gets flushed every time.

Highly unlikely. Your program actively uses very little memory, so it
isn't going to be a problem with cache capacity.
private class CalculationThread extends Thread

It is rarely a good idea to override Thread. There is no need for
inheritance.
public void run()
{
try
{
synchronized (this)
{

Using a complex object like Thread as a lock can be a bad idea. Perhaps
the class itself uses it as a lock. In the case of Thread, Sun JRE does.
while (true)
{
wait();

When they say always put wait in a while loop, don't do it like this.
You are not guaranteed that the wait wont wake up early. Check a
condition in an enclosing while loop.
int n = 0;

for (n = 0; n < 600000000; n++)
if (n % number_of_threads == thread_number)
result[thread_number] += Math.sqrt(n);

Although you would expect the square root to dominate, this loop is
simpler if number_of_threads is one. Exactly how it will get optimised,
I have no idea.

Some multicore chips share FPUs. Niagra does, but I don't think Athlons
do. Such code is obviously not going to work so well on shared FPU
processors.

threads.setDaemon(true);


Making the threads daemon only masks your bug above, where threads do
not exit.
threads.start();
}

try
{
Thread.sleep(1000);
}
catch (InterruptedException e)
{
}


Looks odd. I guess you don't want to include the thread start up time.
The other issue is that you have a potential deadlock.
long t0 = new Date().getTime();

System.currentTimeMillis() is more conventional. Or System.nanoTime()
from 1.5.

Tom Hawtin
 
P

Philipp Kayser

Hi Thomas,

thanks for your response.
It is rarely a good idea to override Thread. There is no need for
inheritance.
...
Using a complex object like Thread as a lock can be a bad idea. Perhaps
the class itself uses it as a lock. In the case of Thread, Sun JRE does.
I changed class CalculationThread to be of type Runnable and also
changed main_thread to be of type Object. So i removed all locks on
Thread objects. Same result as before.
for (n = 0; n < 600000000; n++)
if (n % number_of_threads == thread_number)
result[thread_number] += Math.sqrt(n);
Although you would expect the square root to dominate, this loop is
simpler if number_of_threads is one. Exactly how it will get optimised,
I have no idea.
I also considered this. I changed the loop to:

for (n = 0; n < 30000; n++)
if (n % number_of_threads == thread_number)
{
double a = 2;
for (int p = 0; p < n; p++)
{
a += Math.sqrt(a);
result[thread_number] += a;
}
}

I dont think the inner loop can be optimized. The result changed a bit:

2 CPUs / 1 Threads : 8.75s
2 CPUs / 2 Threads : 9.35s

Although the result is better than before, 2 CPUs are again slower than
1 CPU. I would expect 2 CPUs too be nearly 2x faster than 1 CPU in such
a simple algorithm.


Best regards,
Philipp Kayser.
 
D

Daniel Dyer

I dont think the inner loop can be optimized. The result changed a bit:

2 CPUs / 1 Threads : 8.75s
2 CPUs / 2 Threads : 9.35s

Although the result is better than before, 2 CPUs are again slower than
1 CPU. I would expect 2 CPUs too be nearly 2x faster than 1 CPU in such
a simple algorithm.


Best regards,
Philipp Kayser.

Are you running your app with the -server option? Is there any
significant difference between performance with the server and client VMs?

Dan.
 
P

Philipp Kayser

Hi again,

I think I found the problem. I changed the loop again to:

for (n = 0; n < 10000; n++)
if (n % number_of_threads == thread_number)
{
double a = 2;
for (int p = 0; p < n; p++)
{
a += Math.sin(a);
}
result[thread_number] += a;
}

Now I get satisfying resuls:

2 CPUs / 1 Thread : 10.7s
2 CPUs / 2 Threads : 5.469s

I think the problem is the read/write-access to the result-array. If the
result-array is changed on one CPU, the cache on the second CPU gets
invalid. By pulling out the result-assignment of the inner loop, the
amount of cache invalidations are reduced greatly.

Best regards,
Philipp Kayser.
 
B

blmblm

Highly unlikely. Your program actively uses very little memory, so it
isn't going to be a problem with cache capacity.

But there might be a problem with something called "false sharing",
which happens when different threads access variables that are close
together in memory -- as happens here with the "results" array.

I'm not sure I want to try to explain it (don't have time and might get
it wrong anyway), but a Google search for "cache" and "false sharing"
brings up some useful links.

[ snip ]

You could simplify this to

for (n = number_of_threads; n < 600000000; n+=number_of_threads)

which I'd think would help a little.

[ snip ]
 
B

blmblm

Hi again,

I think I found the problem. I changed the loop again to:

for (n = 0; n < 10000; n++)
if (n % number_of_threads == thread_number)
{
double a = 2;
for (int p = 0; p < n; p++)
{
a += Math.sin(a);
}
result[thread_number] += a;
}

Now I get satisfying resuls:

2 CPUs / 1 Thread : 10.7s
2 CPUs / 2 Threads : 5.469s

I think the problem is the read/write-access to the result-array. If the
result-array is changed on one CPU, the cache on the second CPU gets
invalid. By pulling out the result-assignment of the inner loop, the
amount of cache invalidations are reduced greatly.

"False sharing"! wish I'd noticed this post before composing my reply
of a few minutes ago.

But the comment (in the other post) about improving the loop might
still be worthwhile.
 
P

Philipp Kayser

Hi,
Are you running your app with the -server option? Is there any
significant difference between performance with the server and client VMs?
I think there is no server JVM for Windows. The "-server" option gives
me an error.

Best regards,
Philipp.
 
D

Daniel Dyer

Hi,

I think there is no server JVM for Windows. The "-server" option gives
me an error.

For some reason the server VM is only included with the JDK, not the JRE.
So if you use the JRE java.exe it will give you an error with "-server".

Dan.
 
R

Roedy Green

Some multicore chips share FPUs. Niagra does, but I don't think Athlons
do. Such code is obviously not going to work so well on shared FPU
processors.

Is your Athlon a hyperthread or a true dual cpu or something between?

For cpu bound stuff, hyperthreading won't help since you have no true
extra computing horsepower.

We are going to need some new sort of number to measure CPU speed now
that raw clock speed or even aggregate MIPs has become meaningless.
 
R

Roedy Green

I think there is no server JVM for Windows. The "-server" option gives
me an error.

You have to use the java.exe in the JDK not the JRE. It understands
-server. I have been using it for the great signum bakeoff.
 
B

blmblm

Hi again,

I think I found the problem. I changed the loop again to:

for (n = 0; n < 10000; n++)
if (n % number_of_threads == thread_number)
{
double a = 2;
for (int p = 0; p < n; p++)
{
a += Math.sin(a);
}
result[thread_number] += a;
}

Now I get satisfying resuls:

2 CPUs / 1 Thread : 10.7s
2 CPUs / 2 Threads : 5.469s

I think the problem is the read/write-access to the result-array. If the
result-array is changed on one CPU, the cache on the second CPU gets
invalid. By pulling out the result-assignment of the inner loop, the
amount of cache invalidations are reduced greatly.

"False sharing"! wish I'd noticed this post before composing my reply
of a few minutes ago.

Following up a little, after doing some experiments on a four-processor
machine at work ....:

My idea for getting rid of the "false sharing" cache problem was
to do the original calculation summing into a local variable,
and then add into result[thread_number] at the end. That should
produce similar results to what you're doing above ....

And it does help -- performance changes from "the more threads, the
slower the program" to reasonable speedups with 2 and 4 threads,
compared to 1.

But ....
But the comment (in the other post) about improving the loop might
still be worthwhile.

By accident I discovered that making this change to the original
calculation, *instead of* making the switch to summing into a local
variable *actually produces a faster program*, with good speedups
for 2 and 4 threads.

I don't understand this at all. Maybe I've made some stupid
blunder. Otherwise -- hm, I don't know!

Below is my modified version of your code. I did make some other
changes -- simplified to have the main thread use "join" to wait
for the calculation threads (did you know you could do that?) and
to move all thread activity into the timed part of the code so
I don't have to do the slightly complicated stuff to wait until all
threads are started before starting the timed part ...

public class Test
{
private static int number_of_threads;
private Thread threads[] = new Thread[number_of_threads];
double result[] = new double[number_of_threads];

private class CalculationThread implements Runnable
{
int thread_number;

CalculationThread(int n)
{
thread_number = n;
}

public void run()
{
/*
double local_result = 0.0;
for (int n = 0; n < 600000000; n++)
if (n % number_of_threads == thread_number)
local_result += Math.sqrt(n);
result[thread_number] += local_result;
*/
// this actually is faster! ????
for (int n = thread_number; n < 600000000; n+=number_of_threads)
result[thread_number] += Math.sqrt(n);
}
}

private void multithreaded_calculation()
{
for (int i = 0; i < number_of_threads; i++)
{
threads = new Thread(new CalculationThread(i));
//threads.setPriority(Thread.NORM_PRIORITY);
//threads.setDaemon(true);
threads.start();
}

try
{
for (int i = 0; i < number_of_threads; i++)
threads.join();
}
catch (InterruptedException e)
{
}

double total_result = 0;
for (int i = 0; i < number_of_threads; i++)
total_result += result;

System.out.println(total_result);
}

private void test()
{
long t0 = System.currentTimeMillis();

multithreaded_calculation();

long t1 = System.currentTimeMillis();

System.out.println(((double)t1 - t0)/1000);
}

public static void main(String[] args) {
number_of_threads = Integer.parseInt(args[0]);
new Test().test();
}
}
 
B

blmblm

[ snip ]
I wrote a
couple of tests of my own to try and got more predictable results.
Although not quite.

The first program below times the calculations as yours did but I think
it is still not optimum for this test.

The second program calculates the number of calculation cycles per ms
which I think is more to the point. I got very similar results to the
first test though and that is that one thread is definitely slower than
two threads but more than that don't really improve performance. On my
single processor machine, every increase in threads reduced the number
of calculations that could be performed although again not as
dramatically as I expected with the increase in number of threads.

[ snip ]

Interesting second test program (test3 below). Some questions, though:

Why do you have this?

Thread.sleep(2000); // wait till all threads are created

This seems like an ugly hack to avoid writing proper code to wait
until the threads are created? ugly hacks are not terrible in
quick-and-dirty code, but still.

Also, "calculations" is shared among threads, but you're not ensuring
one-at-a-time access with "synchronized" or some other mechanism.
Why do you think this will work? You do declare it "volatile",
but as I understand it, this only ensures atomic loads and stores,
while you also have a "++calculations". I would have said this
was not guaranteed to be atomic on all processors. No?


import java.util.concurrent.*;

public class test3 implements Runnable {
volatile long calculations;
volatile boolean runFlag = true;
Object o = new Object();
Semaphore sem;

public test3(String[] args) {
int numberOfThreads = Integer.parseInt(args[0]);
Thread[] thread = new Thread[numberOfThreads];
sem = new Semaphore(numberOfThreads);
try {
sem.acquire(numberOfThreads);

for (int i=0; i<numberOfThreads; i++) {
thread = new Thread(this);
thread.start();
}
Thread.sleep(2000); // wait till all threads are created

long then = System.currentTimeMillis();
sem.release(numberOfThreads);
Thread.sleep(20000);
runFlag = false;
long now = System.currentTimeMillis();

System.out.println(calculations/(now-then));
} catch (InterruptedException ie) {
ie.printStackTrace();
}
}

public void run() {
try {
sem.acquire();
while (runFlag) {
double d = Math.sqrt(1234.56789);
double t = Math.tan(d);
++calculations;
}
} catch (InterruptedException ie) {
ie.printStackTrace();
}
}

public static void main(String[] args) {
new test3(args);
}
}
 
P

Philipp Kayser

Hi,
My idea for getting rid of the "false sharing" cache problem was
to do the original calculation summing into a local variable,
and then add into result[thread_number] at the end. That should
produce similar results to what you're doing above ....

And it does help -- performance changes from "the more threads, the
slower the program" to reasonable speedups with 2 and 4 threads,
compared to 1.
Yes, I can duplicate this results. The two Thread stacks (where local
variables are being held) seem to be far enough away from each other to
avoid the "false sharing"-problem.
But ....
By accident I discovered that making this change to the original
calculation, *instead of* making the switch to summing into a local
variable *actually produces a faster program*, with good speedups
for 2 and 4 threads.
I don't understand this at all. Maybe I've made some stupid
blunder. Otherwise -- hm, I don't know!
I can also duplicate this result. And I also don't have an explanation
for it. Maybe it has to do with code optimization because of the now
simpler code: the JVM holds "result[thread_number]" in a machine
register and later writes the value back to memory when the calculation
has finished.
Below is my modified version of your code. I did make some other
changes -- simplified to have the main thread use "join" to wait
for the calculation threads (did you know you could do that?)...
My intention was to avoid unnecessary thread creations. In this case it
is no problem because the threads are only started once. But I am
currently working on parallelising a bigger program, where this
"multithreaded_calculation" is started over and over again.
But maybe thread creation is not a big topic for an OS so this
optimization is not really needed.
To wait 1s for the threads to reach the wait() is very hacky I know,
but I found no better solution (except of the join-solution).

Best regards,
Philipp.
 
B

blmblm

[ snip ]
But ....
By accident I discovered that making this change to the original
calculation, *instead of* making the switch to summing into a local
variable *actually produces a faster program*, with good speedups
for 2 and 4 threads.
I don't understand this at all. Maybe I've made some stupid
blunder. Otherwise -- hm, I don't know!
I can also duplicate this result. And I also don't have an explanation
for it. Maybe it has to do with code optimization because of the now
simpler code: the JVM holds "result[thread_number]" in a machine
register and later writes the value back to memory when the calculation
has finished.

That's a very plausible explanation -- it's maybe a little puzzling
that this optimization wouldn't have also been done with the original
version said:
My intention was to avoid unnecessary thread creations. In this case it
is no problem because the threads are only started once. But I am
currently working on parallelising a bigger program, where this
"multithreaded_calculation" is started over and over again.
But maybe thread creation is not a big topic for an OS so this
optimization is not really needed.
To wait 1s for the threads to reach the wait() is very hacky I know,
but I found no better solution (except of the join-solution).

There's some nice stuff in the java.util.concurrent package, added
in Java 1.5 (5.0), for creating "thread pools", which sounds like
what you want.

Doing some quick Googling, here's a tutorial that seems reasonable,
with some examples:

http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter10/concurrencyTools.html
 
N

Nigel Wade

Philipp said:
Hi,

I wrote a small program to test scalability in a multiprocessor
environment (in my case an Athlon 64 X2). I included the source below.

To my surprise the calculation does not run faster if I use 2 threads
(which should be the case if I have 2 processors) but it runs 5 times
slower (e.g. 15s instead of 3s)! Everything else seems to be okay: If i
use 1 thread, I have a total CPU usage of 50%, if use two threads I get
100%.
The best thing is: if I limit the JVM to one processor by setting the
CPU affinity for the process, but again take 2 threads, the calculation
runs only 2 times slower (6s instead of 3s for one thread).

My current current diagnosis is that it may have something to do with
the CPU cache. I searched the Internet for similar problems and found
the terms "CPU Cache trashing" and "Ping-Pong-Effect": the processors
always switch between the two threads and by doing so their CPU cache
gets flushed every time.

Does anyone have an idea?

Best regards,
Philipp Kayser.

public class Test
{
private final int number_of_threads = 2;
private static int number_of_finished_threads;
private Thread threads[] = new Thread[number_of_threads];
double result[] = new double[number_of_threads];
private Thread main_thread;

private class CalculationThread extends Thread
{
int thread_number;

CalculationThread(int n)
{
super();
thread_number = n;
}

public void run()
{
try
{
synchronized (this)
{
while (true)
{
wait();

int n = 0;

for (n = 0; n < 600000000; n++)
if (n % number_of_threads == thread_number)
result[thread_number] += Math.sqrt(n);

synchronized (main_thread)
{
number_of_finished_threads++;
main_thread.notify();
}
}
}
}
catch (InterruptedException e)
{
}
}
}

private void multithreaded_calculation()
{
synchronized (main_thread)
{
number_of_finished_threads = 0;

int i;
for (i = 0; i < number_of_threads; i++)
{
synchronized (threads)
{
threads.notify();
}
}

do
{
try
{
main_thread.wait();
}
catch (InterruptedException e)
{
}
}
while (number_of_finished_threads < number_of_threads);

double total_result = 0;
for (i = 0; i < number_of_threads; i++)
total_result += result;

System.out.println(total_result);
}
}

private void test()
{
main_thread = Thread.currentThread();

int i;
for (i = 0; i < number_of_threads; i++)
{
threads = new CalculationThread(i);
threads.setPriority(Thread.NORM_PRIORITY);
threads.setDaemon(true);
threads.start();
}

try
{
Thread.sleep(1000);
}
catch (InterruptedException e)
{
}

long t0 = new Date().getTime();

multithreaded_calculation();

long t1 = new Date().getTime();

System.out.println(((double)t1 - t0)/1000);
}

public static void main(String[] args) {
new Test().test();
}
}


I would say it's almost certainly a dirty cache problem due to SMP access to
shared memory.

Just look at what's happening in the tightest part of the loop. When one thread
writes a value in the array the entire cache line containing that particular
element of the array is marked dirty (don't forget that in a SMP system this
affects the cache line in other processors). This means that before any other
thread can either read or write the same cache line that the cache line must be
re-read from main memory. Since each thread is writing to consecutive elements
of the array, they are almost guaranteed to all be writing to the same cache
line.

This is a standard problem in multi-threaded SMP applications. Get each thread
to keep the updating value in thread local storage, and only fill in the array
when the computation is complete. Compare run times.
 
K

Knute Johnson

Philipp said:
Hi,

I wrote a small program to test scalability in a multiprocessor
environment (in my case an Athlon 64 X2). I included the source below.

To my surprise the calculation does not run faster if I use 2 threads
(which should be the case if I have 2 processors) but it runs 5 times
slower (e.g. 15s instead of 3s)! Everything else seems to be okay: If i
use 1 thread, I have a total CPU usage of 50%, if use two threads I get
100%.
The best thing is: if I limit the JVM to one processor by setting the
CPU affinity for the process, but again take 2 threads, the calculation
runs only 2 times slower (6s instead of 3s for one thread).

My current current diagnosis is that it may have something to do with
the CPU cache. I searched the Internet for similar problems and found
the terms "CPU Cache trashing" and "Ping-Pong-Effect": the processors
always switch between the two threads and by doing so their CPU cache
gets flushed every time.

Does anyone have an idea?

Best regards,
Philipp Kayser.

public class Test
{
private final int number_of_threads = 2;
private static int number_of_finished_threads;
private Thread threads[] = new Thread[number_of_threads];
double result[] = new double[number_of_threads];
private Thread main_thread;

private class CalculationThread extends Thread
{
int thread_number;

CalculationThread(int n)
{
super();
thread_number = n;
}

public void run()
{
try
{
synchronized (this)
{
while (true)
{
wait();

int n = 0;

for (n = 0; n < 600000000; n++)
if (n % number_of_threads == thread_number)
result[thread_number] += Math.sqrt(n);

synchronized (main_thread)
{
number_of_finished_threads++;
main_thread.notify();
}
}
}
}
catch (InterruptedException e)
{
}
}
}

private void multithreaded_calculation()
{
synchronized (main_thread)
{
number_of_finished_threads = 0;

int i;
for (i = 0; i < number_of_threads; i++)
{
synchronized (threads)
{
threads.notify();
}
}

do
{
try
{
main_thread.wait();
}
catch (InterruptedException e)
{
}
}
while (number_of_finished_threads < number_of_threads);

double total_result = 0;
for (i = 0; i < number_of_threads; i++)
total_result += result;

System.out.println(total_result);
}
}

private void test()
{
main_thread = Thread.currentThread();

int i;
for (i = 0; i < number_of_threads; i++)
{
threads = new CalculationThread(i);
threads.setPriority(Thread.NORM_PRIORITY);
threads.setDaemon(true);
threads.start();
}

try
{
Thread.sleep(1000);
}
catch (InterruptedException e)
{
}

long t0 = new Date().getTime();

multithreaded_calculation();

long t1 = new Date().getTime();

System.out.println(((double)t1 - t0)/1000);
}

public static void main(String[] args) {
new Test().test();
}
}


Saw your post and thought it was interesting. I too duplicated your
results but I don't have a clue as to what the problem is. I wrote a
couple of tests of my own to try and got more predictable results.
Although not quite.

The first program below times the calculations as yours did but I think
it is still not optimum for this test.

The second program calculates the number of calculation cycles per ms
which I think is more to the point. I got very similar results to the
first test though and that is that one thread is definitely slower than
two threads but more than that don't really improve performance. On my
single processor machine, every increase in threads reduced the number
of calculations that could be performed although again not as
dramatically as I expected with the increase in number of threads.

I tested these on my 1.6Ghz P4 and on a dual Xeon 2.8Ghz. The Xeons are
dual core and as I added threads I could see the additional cores being
used. Another interesting thing, if I used one thread the processor
usage showed 25%, two threads 50% and so on. The P4 is running Windows
XP Home and the Xeon is running 32bit Windows XP Pro.

I do think that the comment about the floating point processor could be
significant too.

Anyway a very interesting post.


public class test2 implements Runnable {
int numberOfThreads;
int calculations = 100000000;
Thread[] thread;

public test2(String[] args) {
numberOfThreads = Integer.parseInt(args[0]);
thread = new Thread[numberOfThreads];

long then = System.currentTimeMillis();

for (int i=0; i<numberOfThreads; i++) {
thread = new Thread(this);
thread.start();
}

try {
for (int i=0; i<numberOfThreads; i++)
thread.join();
} catch (InterruptedException ie) {
ie.printStackTrace();
}

System.out.println(System.currentTimeMillis() - then);
}

public void run() {
double d;
int n = calculations / numberOfThreads;
for (int i=1; i<=n; i++) {
d = Math.sqrt(i*1.234);
double t = d / i;
}
}

public static void main(String[] args) {
new test2(args);
}
}

import java.util.concurrent.*;

public class test3 implements Runnable {
volatile long calculations;
volatile boolean runFlag = true;
Object o = new Object();
Semaphore sem;

public test3(String[] args) {
int numberOfThreads = Integer.parseInt(args[0]);
Thread[] thread = new Thread[numberOfThreads];
sem = new Semaphore(numberOfThreads);
try {
sem.acquire(numberOfThreads);

for (int i=0; i<numberOfThreads; i++) {
thread = new Thread(this);
thread.start();
}
Thread.sleep(2000); // wait till all threads are created

long then = System.currentTimeMillis();
sem.release(numberOfThreads);
Thread.sleep(20000);
runFlag = false;
long now = System.currentTimeMillis();

System.out.println(calculations/(now-then));
} catch (InterruptedException ie) {
ie.printStackTrace();
}
}

public void run() {
try {
sem.acquire();
while (runFlag) {
double d = Math.sqrt(1234.56789);
double t = Math.tan(d);
++calculations;
}
} catch (InterruptedException ie) {
ie.printStackTrace();
}
}

public static void main(String[] args) {
new test3(args);
}
}
 
K

Knute Johnson

Interesting second test program (test3 below). Some questions, though:

Why do you have this?

Thread.sleep(2000); // wait till all threads are created

This seems like an ugly hack to avoid writing proper code to wait
until the threads are created? ugly hacks are not terrible in
quick-and-dirty code, but still.

Instead of the answer I want to give (which is bite me) I'll tell you
why that is in there. I wanted to make sure that I wasn't using up a
bunch of time in the thread creation. Turns out it doesn't really make
any significant difference. And I was in a hurry and I know it was an
ugly hack and it was simple(ton) :).
Also, "calculations" is shared among threads, but you're not ensuring
one-at-a-time access with "synchronized" or some other mechanism.
Why do you think this will work? You do declare it "volatile",
but as I understand it, this only ensures atomic loads and stores,
while you also have a "++calculations". I would have said this
was not guaranteed to be atomic on all processors. No?

The Java Language Specification, Third Edition
17.4.4 Synchronization Order
Every execution has a synchronization order. A synchronization
order is a total order over all of the synchronization actions
of an execution. For each thread t, the synchronization order of
the synchronization actions (§17.4.2) in t is consistent with
the program order (§17.4.3) of t.

Synchronization actions induce the synchronized-with relation on
actions, defined as follows:

....
A write to a volatile variable (§8.3.1.4) v synchronizes-with all
subsequent reads of v by any thread (where subsequent is defined
according to the synchronization order).

Looks synchronized to me.
 
T

Thomas Hawtin

Knute said:
A write to a volatile variable (§8.3.1.4) v synchronizes-with all
subsequent reads of v by any thread (where subsequent is defined
according to the synchronization order).

Looks synchronized to me.

But increment has a read and a separate write. Read the value.
Increment. Write the new value.

Best to use java.util.concurrent.AtomicLong (from 1.5) for these sort of
things.

http://download.java.net/jdk6/docs/api/java/util/concurrent/atomic/AtomicLong.html#incrementAndGet()

Tom Hawtin
 
P

Philipp Kayser

Hi,
For some reason the server VM is only included with the JDK, not the
JRE. So if you use the JRE java.exe it will give you an error with
"-server".
Ok, this works. But I noticed no better performance for the
"false-sharing" problem.
However, the server JVM seems to be significant faster (about 25%) than
the client JVM. Thanks for this hint.

Best regards,
Philipp.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top