Multithreading / Scalability

Knute Johnson · Feb 6, 2006

Thomas said:
But increment has a read and a separate write. Read the value.
Increment. Write the new value.

Best to use java.util.concurrent.AtomicLong (from 1.5) for these sort of
things.

http://download.java.net/jdk6/docs/api/java/util/concurrent/atomic/AtomicLong.html#incrementAndGet()

Tom Hawtin

Tom:

Yes, I see what you are saying. I'll rerun those tests with it
synchronized and see what happens. If they were in fact passing between
the read and write then the results would show lower performance for
more threads.

Knute Johnson · Feb 7, 2006

I tried synchronizing the access to the calculations variable but that
made it run very poorly, it was interrupting the calculations. So I
rewrote it in a different way and discovered some more interesting
things. First this program is faster even on a single processor machine
when more threads are used. I don't think that my XP Home can make a
100 threads but if I watch on the dual Xeon machine, it shows a 100
threads on the Task Manager.

I would be curious to know if anybody else duplicates my results. And
if they have an idea why it would do more calculations on a single
processor machine with more threads.

1.6Ghz XP Home SP2

Threads Calculations per ms
1 1155
2 1326
3 1367
4 1401
10 1444

2.8Ghz dual Xeon XP Pro SP2

Threads Calculations per ms
1 2402
2 4650
3 6141
4 7187
10 7608
100 8224
200 8281

import java.util.concurrent.*;

public class test3 implements Runnable {
public static volatile boolean runFlag = true;
public long calculations;

public test3() {
new Thread(this).start();
}

public void run() {
while (runFlag) {
double d = Math.sqrt(1234.56789);
double t = Math.tan(d);
++calculations;
}
}

public static void main(String[] args) {
long totalCalcs = 0;
int numberOfThreads = Integer.parseInt(args[0]);
test3[] tests = new test3[numberOfThreads];

long then = System.currentTimeMillis();
for (int i=0; i<numberOfThreads; i++) {
tests = new test3();
}
try {
Thread.sleep(10000);
} catch (InterruptedException ie) {
System.out.println(ie);
}
tests[0].runFlag = false;
long now = System.currentTimeMillis();

for (int i=0; i<numberOfThreads; i++)
totalCalcs += tests.calculations;

System.out.println(totalCalcs/(now-then));
}
}

Alex Buell · Feb 7, 2006

1.6Ghz XP Home SP2

Threads Calculations per ms
1 1155
2 1326
3 1367
4 1401
10 1444

1.13Ghz Pentium III, Linux 2.6.15

1 2442
2 2433
3 2430
4 2409
10 2288

Is that useful for you?

Thomas Hawtin · Feb 7, 2006

Knute said:
I would be curious to know if anybody else duplicates my results. And
if they have an idea why it would do more calculations on a single
processor machine with more threads.

1.6Ghz XP Home SP2

I can't reproduce that result. On both a 2.26 GHz Celeron D and 266 MHz
PII, running Fedora Core 4. Odd. Perhaps there is another program eating
cycles and Windows allocates more to the program with most threads...

Tom Hawtin

Philipp Kayser · Feb 7, 2006

Hi,

I would be curious to know if anybody else duplicates my results. And
if they have an idea why it would do more calculations on a single
processor machine with more threads.

I can reproduce the results:

AMD 64 X2 3800+ XP Pro SP2

Threads Calculations per ms
1 3650
2 6784
3 6780
4 6800
10 7010

It is interesting that I do not get 2x3650=7300c/ms for 2 Threads. Maybe
there are still some conflicts here due to "false-sharing" between some
Thread stacks and/or Thread objects. And when there are more threads,
the probability for cache invalidation is reduced -> more calculations
per second.

Best regards,
Philipp.

Philipp Kayser · Feb 7, 2006

Addition to my previous post:

AMD 64 X2 3800+ XP Pro SP2, Affinity to 1. CPU

Threads Calculations per ms
1 3660
2 3708
3 3717
4 3771
10 3775

There is also an increase here, but not as big as in your test. Maybe
because of the massive Thread creation: the OS cannot hold the
Sleep(10000), but instead holds 11s or so.

Best regards,
Philipp.

Chris Uppal · Feb 7, 2006

Knute said:
I would be curious to know if anybody else duplicates my results. And
if they have an idea why it would do more calculations on a single
processor machine with more threads.

1.6Ghz XP Home SP2

Threads Calculations per ms
1 1155
2 1326
3 1367
4 1401
10 1444

Odd. I find the same thing (I changed your code to create the threads outside
the timed section, but to start() them inside it):

WinXP Pro, sp1. 1.5 GHz uniprocessor. JDK 1.5.0-b64:
1 2794
2 2816
3 2841
4 2834
10 2845

Win2K, sp4. 2.5 GHz uniprocessor. JDK 1..5.0_05-b05:
1 2078
2 2107
3 2104
4 2115
10 2123

On same Win2K box as above but using VMWare to run SuSE 9 + JDK 1.4.2-b28:
1 1715
2 1705
3 1783
4 1815
10 2029

Note that the last set of figures show that it cannot be an artefact of Windows
scheduling since Linux processes running inside VMWare are not visible to the
Windows scheduler, nor is memory use, etc.

I don't have an explanation (yet?) I rather suspect that at least part of it
is to do with when/how the JIT gets around to doing optimisation. I'll
probably investigate that later.

-- chris

Chris Uppal · Feb 7, 2006

I said:
I don't have an explanation (yet?) I rather suspect that at least part
of it is to do with when/how the JIT gets around to doing optimisation.
I'll probably investigate that later.

Me, back again sooner than I'd planned.

First off, I realised that my modified code accidentally left the old
constructor in, and so started a new thread in each constructor /as well/ as
the ones I was using for testing -- stupid. Still, it doesn't seem to have
affected the results significantly.

However, putting a loop around the test loop, to allow the JITer more time to
do its thing, produces the following (all on the same 1.5 GHz WinXP box as
before):

Threads Run1 Run2 ....
1 2757 2906 2910 2919 2907
2 2806 2907 2892 2901 2904
3 2815 2907 2909 2909 2905
4 2826 2902 2908 2903 2913
10 2848 2910 2911 2909 2915

As you'll see, by the third time around the loop, the apparent corellation
between the calculations/sec and the number of threads had vanished into the
noise. So I think the effect must be purely a question of how early the JITer
kicks in.

-- chris

blmblm · Feb 7, 2006

Instead of the answer I want to give (which is bite me) I'll tell you
why that is in there.

Well, I was in too much of a hurry to send that post out, and didn't
phrase the question as diplomatically as I might have. Sorry about
that. The question I wanted to ask was whether there was some subtle
reason for the apparently ugly hack (one that I wasn't getting),
or whether it was just a quick-and-dirty thing. NTTAWWT, in some
contexts, and this is one of them.

I wanted to make sure that I wasn't using up a
bunch of time in the thread creation.

Meaning that you didn't want to start timing until all the threads
were created, right? Yes, that's what I figured.

Turns out it doesn't really make
any significant difference. And I was in a hurry and I know it was an
ugly hack and it was simple(ton) .

[ snip ]

blmblm · Feb 7, 2006

But increment has a read and a separate write. Read the value.
Increment. Write the new value.

Exactly the point I was trying to make earlier. Thanks.

Gordon Beaton · Feb 7, 2006

I would be curious to know if anybody else duplicates my results.
And if they have an idea why it would do more calculations on a
single processor machine with more threads.

I don't see that here, these curves are *flat*:

n A B C
1 2983 2313 361
2 2983 2307 356
3 2966 2338 348
4 3008 2332 355
5 2992 2337 351
6 2989 2327 339
7 3002 2329 356
8 2986 2327 358
9 2983 2336 356
10 2976 2334 350
100 2983 2330 358

A: 3.2 GHz Pentium, jdk 1.5.0, Fedora Linux 2
B: 1.8 GHz Opteron, jdk 1.4.2, Fedora Linux 3
C: 140 MHz UltraSparc, Blackdown Java 1.4.1 beta, Aurora Linux 1.0

Interesting to note is that the Pentium manages only 8 times the work,
despite a nearly 23x "speed" advantage over the Ultra 1!

/gordon

Luc The Perverse · Feb 7, 2006

Philipp Kayser said:
Hi,

I can reproduce the results:

AMD 64 X2 3800+ XP Pro SP2

Threads Calculations per ms
1 3650
2 6784
3 6780
4 6800
10 7010

It is interesting that I do not get 2x3650=7300c/ms for 2 Threads. Maybe
there are still some conflicts here due to "false-sharing" between some
Thread stacks and/or Thread objects. And when there are more threads,
the probability for cache invalidation is reduced -> more calculations
per second.

I will tell you this. If ANYTHING is being swapped to memory then your two
cores are going to have to share a memory controller/and sharing is always
slower than running alone.

I take it you don't expect this to be happening, but if you are running XP
PRO then windows will force your process into the background to do OS
related stuff, even if just briefly. This could easily account for the
slowing.

Roedy Green · Feb 7, 2006

As you'll see, by the third time around the loop, the apparent corellation
between the calculations/sec and the number of threads had vanished into the
noise. So I think the effect must be purely a question of how early the JITer
kicks in.

See http://mindprod.com/jgloss/benchmark.html

I did a warm up "conformance test" partly whose job was to ensure the
jit optimisations had been done before I starting timing anything.

Roedy Green · Feb 7, 2006

I will tell you this. If ANYTHING is being swapped to memory then your two
cores are going to have to share a memory controller/and sharing is always
slower than running alone.

Do this multicore chips, do they typically share the same SRAM cache
or are they busily worrying about stale copies in each other's caches?

Do they share a port to the SRAM or is the SRAM multiported?

Chris Smith · Feb 7, 2006

Gordon Beaton said:
A: 3.2 GHz Pentium, jdk 1.5.0, Fedora Linux 2
B: 1.8 GHz Opteron, jdk 1.4.2, Fedora Linux 3
C: 140 MHz UltraSparc, Blackdown Java 1.4.1 beta, Aurora Linux 1.0

Interesting to note is that the Pentium manages only 8 times the work,
despite a nearly 23x "speed" advantage over the Ultra 1!

Not too surprising, though. Clock speed wars in modern day are
basically marketing nonsense. Performance improvements come mainly from
cache architectures, pipelining strategies, etc. Since average computer
users (and even average high-tech gamers) don't know enough to compare
those things, manufacturers keep up the charade by bumping up the clock
speeds, as well. In truth, the CPU spends most of those blazingly fast
clock cycles waiting on RAM.

--
www.designacourse.com
The Easiest Way To Train Anyone... Anywhere.

Chris Smith - Lead Software Developer/Technical Trainer
MindIQ Corporation

Knute Johnson · Feb 7, 2006

Alex said:
1.13Ghz Pentium III, Linux 2.6.15

1 2442
2 2433
3 2430
4 2409
10 2288

Is that useful for you?

Thanks. That's more like what I expected.

Knute Johnson · Feb 7, 2006

Gordon said:
I don't see that here, these curves are *flat*:

n A B C
1 2983 2313 361
2 2983 2307 356
3 2966 2338 348
4 3008 2332 355
5 2992 2337 351
6 2989 2327 339
7 3002 2329 356
8 2986 2327 358
9 2983 2336 356
10 2976 2334 350
100 2983 2330 358

A: 3.2 GHz Pentium, jdk 1.5.0, Fedora Linux 2
B: 1.8 GHz Opteron, jdk 1.4.2, Fedora Linux 3
C: 140 MHz UltraSparc, Blackdown Java 1.4.1 beta, Aurora Linux 1.0

Interesting to note is that the Pentium manages only 8 times the work,
despite a nearly 23x "speed" advantage over the Ultra 1!

/gordon

I was surprised that my 2.8Ghz Xeon running the test with only one
thread wasn't that much faster than the 1.6Ghz P4.

Knute Johnson · Feb 7, 2006

Philipp said:
Addition to my previous post:

AMD 64 X2 3800+ XP Pro SP2, Affinity to 1. CPU

Threads Calculations per ms
1 3660
2 3708
3 3717
4 3771
10 3775

There is also an increase here, but not as big as in your test. Maybe
because of the massive Thread creation: the OS cannot hold the
Sleep(10000), but instead holds 11s or so.

Best regards,
Philipp.

Those are both interesting, thanks. That AMD looks like it zips right
along too.

Luc The Perverse · Feb 7, 2006

Roedy Green said:
Do this multicore chips, do they typically share the same SRAM cache
or are they busily worrying about stale copies in each other's caches?

Do they share a port to the SRAM or is the SRAM multiported?

According to a randomly selected product information page they appear to
each have completely seperate L1 and L2 caches. I think it is truly two
chips on one die, with an integrated memory controller independant of the
two.

Although I will admit that I don't really understand your last sentence.

slippymississippi · Feb 8, 2006

I think you've got the wrong idea of the benefits of threads.

Most markedly, threads will reduce your memory usage. To put it very
simply, a thread is like a lightweight process, so you have reduced
memory overhead when compared to processes. In other words, it's a lot
less memory intensive to have one application supporting 20 threads
reading input over sockets than configuring 20 applications to read
that data.

As a result of these memory benefits, you get performance benefits as a
byproduct, because the OS has to page swap a lot less often.

I'm coming from a C++ background where I simply created my own threads,
established a mutex for my critical data and/or resources (queue), and
let fly. So I'm not real fond of Java's screwy implementation of
threads, where every object on the face of the planet has its own
mutex, and Java gives you a lot of rope to hang yourself waiting for
the mutex availability. But one thing I can see looking at your code
is that you're using a lot of thread setup to calculate a very simple
result. The correct analogue is to perform the same task with
processes (does Java even support fork?).

A thread is a very useful tool. A hammer is a very useful tool, too,
but you wouldn't want to cut a tree down with it. In the same manner,
you want to use threads for what they're designed: server-type
processes that read data from multiple inputs or perform other
asynchronous tasks with less overhead than processes.

MultiThreading	1	Sep 11, 2013
The distinction between a java applet and an application	1	Jan 4, 2023
Multithreading - Problem with notifyAll() and wait()	5	Oct 13, 2006
In C, the longest palindromic subsequence multithread exists	0	Nov 23, 2022
LiveConnect Applet Architecture Bug with Thread utilization (with SSCCE!)	0	Nov 20, 2010
CAS operations and scalability [...]	1	Sep 13, 2012
java thread question	9	Apr 3, 2014
CAS operations and scalability...	1	Aug 26, 2012

Multithreading / Scalability

Knute Johnson

Knute Johnson

Alex Buell

Thomas Hawtin

Philipp Kayser

Philipp Kayser

Chris Uppal

Chris Uppal

blmblm

blmblm

Gordon Beaton

Luc The Perverse

Roedy Green

Roedy Green

Chris Smith

Knute Johnson

Knute Johnson

Knute Johnson

Luc The Perverse

slippymississippi

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads