Multithreading / Scalability

K

Knute Johnson

Thomas said:
But increment has a read and a separate write. Read the value.
Increment. Write the new value.

Best to use java.util.concurrent.AtomicLong (from 1.5) for these sort of
things.

http://download.java.net/jdk6/docs/api/java/util/concurrent/atomic/AtomicLong.html#incrementAndGet()


Tom Hawtin

Tom:

Yes, I see what you are saying. I'll rerun those tests with it
synchronized and see what happens. If they were in fact passing between
the read and write then the results would show lower performance for
more threads.
 
K

Knute Johnson

I tried synchronizing the access to the calculations variable but that
made it run very poorly, it was interrupting the calculations. So I
rewrote it in a different way and discovered some more interesting
things. First this program is faster even on a single processor machine
when more threads are used. I don't think that my XP Home can make a
100 threads but if I watch on the dual Xeon machine, it shows a 100
threads on the Task Manager.

I would be curious to know if anybody else duplicates my results. And
if they have an idea why it would do more calculations on a single
processor machine with more threads.

1.6Ghz XP Home SP2

Threads Calculations per ms
1 1155
2 1326
3 1367
4 1401
10 1444

2.8Ghz dual Xeon XP Pro SP2

Threads Calculations per ms
1 2402
2 4650
3 6141
4 7187
10 7608
100 8224
200 8281

import java.util.concurrent.*;

public class test3 implements Runnable {
public static volatile boolean runFlag = true;
public long calculations;

public test3() {
new Thread(this).start();
}

public void run() {
while (runFlag) {
double d = Math.sqrt(1234.56789);
double t = Math.tan(d);
++calculations;
}
}

public static void main(String[] args) {
long totalCalcs = 0;
int numberOfThreads = Integer.parseInt(args[0]);
test3[] tests = new test3[numberOfThreads];

long then = System.currentTimeMillis();
for (int i=0; i<numberOfThreads; i++) {
tests = new test3();
}
try {
Thread.sleep(10000);
} catch (InterruptedException ie) {
System.out.println(ie);
}
tests[0].runFlag = false;
long now = System.currentTimeMillis();

for (int i=0; i<numberOfThreads; i++)
totalCalcs += tests.calculations;

System.out.println(totalCalcs/(now-then));
}
}
 
A

Alex Buell

1.6Ghz XP Home SP2

Threads Calculations per ms
1 1155
2 1326
3 1367
4 1401
10 1444

1.13Ghz Pentium III, Linux 2.6.15

1 2442
2 2433
3 2430
4 2409
10 2288

Is that useful for you?
 
T

Thomas Hawtin

Knute said:
I would be curious to know if anybody else duplicates my results. And
if they have an idea why it would do more calculations on a single
processor machine with more threads.

1.6Ghz XP Home SP2

I can't reproduce that result. On both a 2.26 GHz Celeron D and 266 MHz
PII, running Fedora Core 4. Odd. Perhaps there is another program eating
cycles and Windows allocates more to the program with most threads...

Tom Hawtin
 
P

Philipp Kayser

Hi,
I would be curious to know if anybody else duplicates my results. And
if they have an idea why it would do more calculations on a single
processor machine with more threads.

I can reproduce the results:

AMD 64 X2 3800+ XP Pro SP2

Threads Calculations per ms
1 3650
2 6784
3 6780
4 6800
10 7010

It is interesting that I do not get 2x3650=7300c/ms for 2 Threads. Maybe
there are still some conflicts here due to "false-sharing" between some
Thread stacks and/or Thread objects. And when there are more threads,
the probability for cache invalidation is reduced -> more calculations
per second.

Best regards,
Philipp.
 
P

Philipp Kayser

Addition to my previous post:

AMD 64 X2 3800+ XP Pro SP2, Affinity to 1. CPU

Threads Calculations per ms
1 3660
2 3708
3 3717
4 3771
10 3775

There is also an increase here, but not as big as in your test. Maybe
because of the massive Thread creation: the OS cannot hold the
Sleep(10000), but instead holds 11s or so.

Best regards,
Philipp.
 
C

Chris Uppal

Knute said:
I would be curious to know if anybody else duplicates my results. And
if they have an idea why it would do more calculations on a single
processor machine with more threads.

1.6Ghz XP Home SP2

Threads Calculations per ms
1 1155
2 1326
3 1367
4 1401
10 1444

Odd. I find the same thing (I changed your code to create the threads outside
the timed section, but to start() them inside it):

WinXP Pro, sp1. 1.5 GHz uniprocessor. JDK 1.5.0-b64:
1 2794
2 2816
3 2841
4 2834
10 2845

Win2K, sp4. 2.5 GHz uniprocessor. JDK 1..5.0_05-b05:
1 2078
2 2107
3 2104
4 2115
10 2123

On same Win2K box as above but using VMWare to run SuSE 9 + JDK 1.4.2-b28:
1 1715
2 1705
3 1783
4 1815
10 2029

Note that the last set of figures show that it cannot be an artefact of Windows
scheduling since Linux processes running inside VMWare are not visible to the
Windows scheduler, nor is memory use, etc.

I don't have an explanation (yet?) I rather suspect that at least part of it
is to do with when/how the JIT gets around to doing optimisation. I'll
probably investigate that later.

-- chris
 
C

Chris Uppal

I said:
I don't have an explanation (yet?) I rather suspect that at least part
of it is to do with when/how the JIT gets around to doing optimisation.
I'll probably investigate that later.

Me, back again sooner than I'd planned.

First off, I realised that my modified code accidentally left the old
constructor in, and so started a new thread in each constructor /as well/ as
the ones I was using for testing -- stupid. Still, it doesn't seem to have
affected the results significantly.

However, putting a loop around the test loop, to allow the JITer more time to
do its thing, produces the following (all on the same 1.5 GHz WinXP box as
before):

Threads Run1 Run2 ....
1 2757 2906 2910 2919 2907
2 2806 2907 2892 2901 2904
3 2815 2907 2909 2909 2905
4 2826 2902 2908 2903 2913
10 2848 2910 2911 2909 2915

As you'll see, by the third time around the loop, the apparent corellation
between the calculations/sec and the number of threads had vanished into the
noise. So I think the effect must be purely a question of how early the JITer
kicks in.

-- chris
 
B

blmblm

Instead of the answer I want to give (which is bite me) I'll tell you
why that is in there.

Well, I was in too much of a hurry to send that post out, and didn't
phrase the question as diplomatically as I might have. Sorry about
that. The question I wanted to ask was whether there was some subtle
reason for the apparently ugly hack (one that I wasn't getting),
or whether it was just a quick-and-dirty thing. NTTAWWT, in some
contexts, and this is one of them.
I wanted to make sure that I wasn't using up a
bunch of time in the thread creation.

Meaning that you didn't want to start timing until all the threads
were created, right? Yes, that's what I figured.
Turns out it doesn't really make
any significant difference. And I was in a hurry and I know it was an
ugly hack and it was simple(ton) :).

[ snip ]
 
G

Gordon Beaton

I would be curious to know if anybody else duplicates my results.
And if they have an idea why it would do more calculations on a
single processor machine with more threads.

I don't see that here, these curves are *flat*:

n A B C
1 2983 2313 361
2 2983 2307 356
3 2966 2338 348
4 3008 2332 355
5 2992 2337 351
6 2989 2327 339
7 3002 2329 356
8 2986 2327 358
9 2983 2336 356
10 2976 2334 350
100 2983 2330 358

A: 3.2 GHz Pentium, jdk 1.5.0, Fedora Linux 2
B: 1.8 GHz Opteron, jdk 1.4.2, Fedora Linux 3
C: 140 MHz UltraSparc, Blackdown Java 1.4.1 beta, Aurora Linux 1.0

Interesting to note is that the Pentium manages only 8 times the work,
despite a nearly 23x "speed" advantage over the Ultra 1!

/gordon
 
L

Luc The Perverse

Philipp Kayser said:
Hi,


I can reproduce the results:

AMD 64 X2 3800+ XP Pro SP2

Threads Calculations per ms
1 3650
2 6784
3 6780
4 6800
10 7010

It is interesting that I do not get 2x3650=7300c/ms for 2 Threads. Maybe
there are still some conflicts here due to "false-sharing" between some
Thread stacks and/or Thread objects. And when there are more threads,
the probability for cache invalidation is reduced -> more calculations
per second.

I will tell you this. If ANYTHING is being swapped to memory then your two
cores are going to have to share a memory controller/and sharing is always
slower than running alone.

I take it you don't expect this to be happening, but if you are running XP
PRO then windows will force your process into the background to do OS
related stuff, even if just briefly. This could easily account for the
slowing.
 
R

Roedy Green

As you'll see, by the third time around the loop, the apparent corellation
between the calculations/sec and the number of threads had vanished into the
noise. So I think the effect must be purely a question of how early the JITer
kicks in.

See http://mindprod.com/jgloss/benchmark.html

I did a warm up "conformance test" partly whose job was to ensure the
jit optimisations had been done before I starting timing anything.
 
R

Roedy Green

I will tell you this. If ANYTHING is being swapped to memory then your two
cores are going to have to share a memory controller/and sharing is always
slower than running alone.

Do this multicore chips, do they typically share the same SRAM cache
or are they busily worrying about stale copies in each other's caches?

Do they share a port to the SRAM or is the SRAM multiported?
 
C

Chris Smith

Gordon Beaton said:
A: 3.2 GHz Pentium, jdk 1.5.0, Fedora Linux 2
B: 1.8 GHz Opteron, jdk 1.4.2, Fedora Linux 3
C: 140 MHz UltraSparc, Blackdown Java 1.4.1 beta, Aurora Linux 1.0

Interesting to note is that the Pentium manages only 8 times the work,
despite a nearly 23x "speed" advantage over the Ultra 1!

Not too surprising, though. Clock speed wars in modern day are
basically marketing nonsense. Performance improvements come mainly from
cache architectures, pipelining strategies, etc. Since average computer
users (and even average high-tech gamers) don't know enough to compare
those things, manufacturers keep up the charade by bumping up the clock
speeds, as well. In truth, the CPU spends most of those blazingly fast
clock cycles waiting on RAM.

--
www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation
 
K

Knute Johnson

Gordon said:
I don't see that here, these curves are *flat*:

n A B C
1 2983 2313 361
2 2983 2307 356
3 2966 2338 348
4 3008 2332 355
5 2992 2337 351
6 2989 2327 339
7 3002 2329 356
8 2986 2327 358
9 2983 2336 356
10 2976 2334 350
100 2983 2330 358

A: 3.2 GHz Pentium, jdk 1.5.0, Fedora Linux 2
B: 1.8 GHz Opteron, jdk 1.4.2, Fedora Linux 3
C: 140 MHz UltraSparc, Blackdown Java 1.4.1 beta, Aurora Linux 1.0

Interesting to note is that the Pentium manages only 8 times the work,
despite a nearly 23x "speed" advantage over the Ultra 1!

/gordon

I was surprised that my 2.8Ghz Xeon running the test with only one
thread wasn't that much faster than the 1.6Ghz P4.
 
K

Knute Johnson

Philipp said:
Addition to my previous post:

AMD 64 X2 3800+ XP Pro SP2, Affinity to 1. CPU

Threads Calculations per ms
1 3660
2 3708
3 3717
4 3771
10 3775

There is also an increase here, but not as big as in your test. Maybe
because of the massive Thread creation: the OS cannot hold the
Sleep(10000), but instead holds 11s or so.

Best regards,
Philipp.

Those are both interesting, thanks. That AMD looks like it zips right
along too.
 
L

Luc The Perverse

Roedy Green said:
Do this multicore chips, do they typically share the same SRAM cache
or are they busily worrying about stale copies in each other's caches?

Do they share a port to the SRAM or is the SRAM multiported?

According to a randomly selected product information page they appear to
each have completely seperate L1 and L2 caches. I think it is truly two
chips on one die, with an integrated memory controller independant of the
two.

Although I will admit that I don't really understand your last sentence.
 
S

slippymississippi

I think you've got the wrong idea of the benefits of threads.

Most markedly, threads will reduce your memory usage. To put it very
simply, a thread is like a lightweight process, so you have reduced
memory overhead when compared to processes. In other words, it's a lot
less memory intensive to have one application supporting 20 threads
reading input over sockets than configuring 20 applications to read
that data.

As a result of these memory benefits, you get performance benefits as a
byproduct, because the OS has to page swap a lot less often.

I'm coming from a C++ background where I simply created my own threads,
established a mutex for my critical data and/or resources (queue), and
let fly. So I'm not real fond of Java's screwy implementation of
threads, where every object on the face of the planet has its own
mutex, and Java gives you a lot of rope to hang yourself waiting for
the mutex availability. But one thing I can see looking at your code
is that you're using a lot of thread setup to calculate a very simple
result. The correct analogue is to perform the same task with
processes (does Java even support fork?).

A thread is a very useful tool. A hammer is a very useful tool, too,
but you wouldn't want to cut a tree down with it. In the same manner,
you want to use threads for what they're designed: server-type
processes that read data from multiple inputs or perform other
asynchronous tasks with less overhead than processes.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top