Multithreading / Scalability

S

slippymississippi

Another thing I see in your code that sends up red flags is
"synchronized(this)" on your thread class ... the correct usage for
mutex protection is something like a library card system. You have a
central resource (book) that multiple threads (customers) want to use,
so you have a system of tracking when the resource is checked out
(library card). One person takes the book, the librarian takes the
card from the book. The next customer in line has to wait for the
librarian to return the card to the book and call "next!"
(notify/notifyAll) before the customer can begin reading. Because you
want to limit the number of people waiting for books, you assign the
library card to the smallest atomic entity possible (book) rather than
larger objects (shelf, room, floor, library).

So, given my brief experience with Java, I believe you should be using
synchronize on a very small static singleton bean that encapsulates
your data and does very little else ... or simply enforce
synchronization (because you never know how someone else is going to
abuse your classes) by setting the data put/get methods to synchronized
on your bean. And yes, unless the data is remarkably atomic, you have
to protect your getters as well as your setters, because there's
nothing stopping the OS from swapping out a thread in the middle of a
data update.

In general, the more atomic the data being protected by the mutex, and
the smaller the amount of code updating this data, the more efficient
will be your code.
 
R

Roedy Green

Although I will admit that I don't really understand your last sentence.

In the olden days Univac and Burroughs mainframes had multiple CPUs, a
novelty at the time. You could pay extra for multiported "core" (which
might have been iron cores, plated wire or RAM) so that each CPU could
access core without waiting for the others. The feature was called
"multiported memory". I wondered if the equivalent existed today in
the access to SRAM and RAM.

I believe IBM mainframes also had a form of multiported memory so that
I/O channels had a back door into memory that did not block the CPU.
 
B

blmblm

I think you've got the wrong idea of the benefits of threads.

Most markedly, threads will reduce your memory usage. To put it very
simply, a thread is like a lightweight process, so you have reduced
memory overhead when compared to processes. In other words, it's a lot
less memory intensive to have one application supporting 20 threads
reading input over sockets than configuring 20 applications to read
that data.

As a result of these memory benefits, you get performance benefits as a
byproduct, because the OS has to page swap a lot less often.

I'm coming from a C++ background where I simply created my own threads,
established a mutex for my critical data and/or resources (queue), and
let fly. So I'm not real fond of Java's screwy implementation of
threads, where every object on the face of the planet has its own
mutex, and Java gives you a lot of rope to hang yourself waiting for
the mutex availability.

Maybe it would make more sense if you regarded it not as a "screwy
implementation of threads" but as an implementation of (some of) the
"monitors" synchronization mechanism first proposed (AFAICT) by Per
Brinch Hansen and C.A.R. Hoare in the 1970s.

One thing Java gives you that thread libraries may not (I'm basing
this on imperfect recollections of using the Pthreads library) is a
reasonable mechanism for doing kinds of synchronization other than
simple mutual exclusion -- e.g., what you would need for a simple
producer/consumer setup (in which one thread produces data for
another to consume, and you would like the consumer thread to wait
until there is data to consume, preferably without busy-waiting).
But one thing I can see looking at your code
is that you're using a lot of thread setup to calculate a very simple
result. The correct analogue is to perform the same task with
processes (does Java even support fork?).

Are you saying that, if the goal is improve performance by splitting
computation among processors, this is better done with processes
than threads? Why? In some ways processes communicating by
message-passing is a less error-prone way to "parallelize" code
(exploit multiple processors) than threads communicating via shared
memory, and of course if you don't *have* shared memory it's the only
choice. But if you can do either one and want maximum performance,
it seems to me that threads are the way to go. Do you disagree?
A thread is a very useful tool. A hammer is a very useful tool, too,
but you wouldn't want to cut a tree down with it. In the same manner,
you want to use threads for what they're designed: server-type
processes that read data from multiple inputs or perform other
asynchronous tasks with less overhead than processes.

Or to improve performance if you have more than one processor. (Or
even if you don't, and your code can be split up into threads in a way
that allows one thread to make progress while another thread waits
for something to happen, I/O for example -- i.e., if threads can be
used to "mask latency".)
 
B

blmblm

Another thing I see in your code that sends up red flags is
"synchronized(this)" on your thread class ...

It's possible that this was being used not so much to guarantee
mutual exclusion as to allow use of wait/notify, which can only
be done from within a synchronized block. (The OP's use of wait
and notify seemed a little confused to me, but I haven't thought
through the problem in enough detail to be sure.)
the correct usage for
mutex protection is something like a library card system. You have a
central resource (book) that multiple threads (customers) want to use,
so you have a system of tracking when the resource is checked out
(library card). One person takes the book, the librarian takes the
card from the book. The next customer in line has to wait for the
librarian to return the card to the book and call "next!"
(notify/notifyAll) before the customer can begin reading. Because you
want to limit the number of people waiting for books, you assign the
library card to the smallest atomic entity possible (book) rather than
larger objects (shelf, room, floor, library).

So, given my brief experience with Java, I believe you should be using
synchronize on a very small static singleton bean that encapsulates
your data and does very little else ... or simply enforce
synchronization (because you never know how someone else is going to
abuse your classes) by setting the data put/get methods to synchronized
on your bean. And yes, unless the data is remarkably atomic, you have
to protect your getters as well as your setters, because there's
nothing stopping the OS from swapping out a thread in the middle of a
data update.

In general, the more atomic the data being protected by the mutex, and
the smaller the amount of code updating this data, the more efficient
will be your code.

Some good points here, but it might be as well to keep in mind
in critiquing the OP's code that it appears to have been a toy
program intended to find out how much performance improvement can
be achieved with threads and multiple processors, so some advice
that would be relevant in other contexts might not apply.

(In fact, I think the OP's program could be rewritten to not use
any explicit synchronization except join() -- create the threads,
start timing, start the threads, wait for them to finish with join(),
combine results, stop timing. If each thread writes results into
a separate element of an array, no data synchronization is needed.
Some caution is needed, though, to avoid "false sharing" of cache.)
 
D

Dimitri Maziuk

(e-mail address removed) sez:
Are you saying that, if the goal is improve performance by splitting
computation among processors, this is better done with processes
than threads? Why?

Threads (as in "lightweight processes") were originally introduced
to avoid the overhead of process creation. Threads are not the
only solution: Plan 9 folks came up with low-overhead process
creation mechanism, Linux picked it up quickly. In a modern OS
with low-cost process creation mechanism the benefits of threads
(vs. processes) are small, and the overhead is an extra layer of
complexity.

This is a moot point in Java, of course, since Java only has
threads. There Is No Fork().

.... But if you can do either one and want maximum performance,
it seems to me that threads are the way to go. Do you disagree?

Two main sources of overhead are initial creation and
synchronization. If they are fully independent (no
synchronization) and long-running (creation cost amortized
to close to 0 over runtime), there should be no difference
between threads and processes performance-wise.
Or to improve performance if you have more than one processor. (Or
even if you don't, and your code can be split up into threads in a way
that allows one thread to make progress while another thread waits
for something to happen, I/O for example -- i.e., if threads can be
used to "mask latency".)

Only if you have several execution streams that can run
asynchronously.

Anyway, the real deal is that threads don't economically scale
past 4 CPUs (4-way being the cusp of bang-per-buck curve) and
thus 4-16 threads: e.g. Sun will happily sell you a 24-CPU
cupboard for a cool million bucks, or 24 1-CPU pizza boxen for
about $18,000. Which means that if you're doing it for scalability,
you're way better off going fully distributed (hence, independent
processes) or with some specialized grid/cluster setup (probably
different programming techniques altogether).

Dima
 
B

blmblm

(e-mail address removed) sez:

Threads (as in "lightweight processes") were originally introduced
to avoid the overhead of process creation. Threads are not the
only solution: Plan 9 folks came up with low-overhead process
creation mechanism, Linux picked it up quickly. In a modern OS
with low-cost process creation mechanism the benefits of threads
(vs. processes) are small, and the overhead is an extra layer of
complexity.

Threads introduce an extra layer of complexity? Hm. I'm not sure
what you mean there ....:

I can believe that in current operating systems it's almost as fast
to create processes as to create threads. Context switch times
might also need to be taken into account, but maybe there too the
difference is not really significant.

But my understanding is that a key difference between threads and
processes is that the former share an address space and the latter
do not, and this leads to rather different programming models --
communication via message-passing versus communication via shared
variables. Shared variables in some ways seem easier and more
natural, but the need for synchronization means that there are lots
of potential pitfalls. Message passing is less easy/natural but
avoid some of the potential pitfalls.

All of this is of course moot if the threads/processes are totally
independent. I'm coming to this discussion from a background in
parallel computing, though, where the threads/processes are rarely
*completely* independent.
This is a moot point in Java, of course, since Java only has
threads. There Is No Fork().

Um, no, but doesn't the "exec" method in Runtime do something along
those lines? I have zero experience working with this method; I'm
going by the API and by vague recollections of discussions in this
group.
... But if you can do either one and want maximum performance,

Two main sources of overhead are initial creation and
synchronization. If they are fully independent (no
synchronization) and long-running (creation cost amortized
to close to 0 over runtime), there should be no difference
between threads and processes performance-wise.

Probably so. I wasn't really thinking of totally independent
threads/processes.

Also, if you're using the additional threads/processes to mask
latency, then context switch times need to be taken into account.
Maybe there too the difference is less than I think -- threads
would seem to win out, but perhaps not by much.
Only if you have several execution streams that can run
asynchronously.

Sure. I do enough of this multithreading-for-performance (as a
research interest more than for practical reasons) that I'm apt to
forget that not everyone thinks in those terms, maybe.
Anyway, the real deal is that threads don't economically scale
past 4 CPUs (4-way being the cusp of bang-per-buck curve) and
thus 4-16 threads: e.g. Sun will happily sell you a 24-CPU
cupboard for a cool million bucks, or 24 1-CPU pizza boxen for
about $18,000. Which means that if you're doing it for scalability,
you're way better off going fully distributed (hence, independent
processes) or with some specialized grid/cluster setup (probably
different programming techniques altogether).

Traditionally this has been true, yes. I wonder if it will continue
to be true in the next few years -- dual-core chips are not the
rare beasts they were a few years ago, and one hears so much about
multicore chips in general that I wonder whether in a few years
there won't be a lot more 4-way and 8-way systems .... But I'm
not a hardware designer, and it may be that there's still no way
to make shared-memory systems scalable.

In any case, it seems to me that arguments about scalability, while
interesting and important, miss the point that if what you have is
a 2-processor or 4-processor system, a program that exploits the
extra processors is a good thing to have, even if it doesn't scale
to more than 4 processors.

Just some thoughts / my two cents' worth.
 
R

Roedy Green

I can believe that in current operating systems it's almost as fast
to create processes as to create threads.

nowhere near as fast, especially if you start several threads using
the same classes. The classes are already loaded so there is little to
do other than allocate some ram for a stack.

With an external process you must load a partly compressed exe file,
and relocate all the addresses in it, and hook up all the DLLs, then
you can think about loading the DLLs.
 
B

blmblm

nowhere near as fast, especially if you start several threads using
the same classes. The classes are already loaded so there is little to
do other than allocate some ram for a stack.

Well, what you're saying here is that *for Java programs* it's
a lot faster to start a new thread than a new process, right?
which is more relevant in this newsgroup than a general discussion
of processes versus threads -- but in the post you're replying to,
the discussion had (I thought) become more general.
With an external process you must load a partly compressed exe file,
and relocate all the addresses in it, and hook up all the DLLs, then
you can think about loading the DLLs.

The terminology here is somewhat Windows-specific, but .... Okay,
okay, similar ideas (exe files and DLLs) exist in other operating
systems. I'm a little skeptical that everything you describe has
to be done, however:
load a partly compressed exe file,

I was under the impression that "smart" operating systems can allow
multiple processes running the same code to share a copy in memory.
But maybe that only works if the processes do something special?
and relocate all the addresses in it,

Relocate addresses? I thought this was something that had been
necessary in the not-so-good old days, before the advent of dynamic
address translation, but didn't need to be done now? I may be
misunderstanding what you're saying though, or more confused than
I think about how things work under the hood.
and hook up all the DLLs, then
you can think about loading the DLLs.

Which .... If you're starting a new process to execute the same
code as an existing process, that shouldn't be necessary, right?
Shouldn't the relevant DLLs already be loaded? Do I not understand
how DLLs work? (I would have thought that two processes using the
same one could share an in-memory copy.) And if it's executing new
code, then you have the same issues with starting a new thread, no?

I'm not really sure about all of this, but notice that the person
to whom I replied said this:

So it may depend on the operating system?
 
G

Gordon Beaton

nowhere near as fast, especially if you start several threads using
the same classes. The classes are already loaded so there is little
to do other than allocate some ram for a stack.

On Linux, the kernel makes no distinction between threads and
processes, so they are identically fast to create and schedule. This
has nothing to do with classes.
With an external process you must load a partly compressed exe file,
and relocate all the addresses in it, and hook up all the DLLs, then
you can think about loading the DLLs.

On Unix (and Linux), common shared libraries (most notably libc) will
already be loaded before the process starts, so they don't need to be
loaded again for every process. Also the kernel normally caches
recently used code, so it is faster to load the same program a second
time.

/gordon
 
R

Roedy Green

The terminology here is somewhat Windows-specific, but .... Okay,
okay, similar ideas (exe files and DLLs) exist in other operating
systems. I'm a little skeptical that everything you describe has
to be done, however:

IIRC one of the old DEC oses RSTS-11? allowed you to pre-relocate
programs for fast load. It then only had to load them without
processing.

IIRC the ancient Multics OS just memory mapped your program into the
address space to start it. Instant on.

Univac-1100 has hardware relocation so load was trivial too.

IBM 360 and Windows made program load into a great production.
 
R

Roedy Green

I was under the impression that "smart" operating systems can allow
multiple processes running the same code to share a copy in memory.
But maybe that only works if the processes do something special?

With windows DLLs get shared. I don't know about ordinary code. It
would have to load at the same spot each person's address space to be
shared. Seems to me with NT they made a partial attempt to ensure
that happens.
 
R

Roedy Green

Relocate addresses? I thought this was something that had been
necessary in the not-so-good old days, before the advent of dynamic
address translation, but didn't need to be done now? I may be
misunderstanding what you're saying though, or more confused than
I think about how things work under the hood.

loot at the exe header. All the relocation information is still there.
I think what they do is set them up so that IF the program loads at a
certain location, there is less fixup to do.

Even then there are a zillion hooks to dlls etc to arrange.
 
R

Roedy Green

Which .... If you're starting a new process to execute the same
code as an existing process, that shouldn't be necessary, right?
Shouldn't the relevant DLLs already be loaded?

the DLLS may be loaded, but the executable contains only symbolic
links into the DLL, those all have to be resolved and patched.
 
R

Roedy Green

Do I not understand
how DLLs work? (I would have thought that two processes using the
same one could share an in-memory copy.) And if it's executing new
code, then you have the same issues with starting a new thread, no?

DLLs used to share code and memory. Now they just share code.
You will notice how java.exe is more spritely if you have recently
used it. That is because its DLLs are still hanging around in RAM and
don't have to be reloaded.
 
R

Roedy Green

On Linux, the kernel makes no distinction between threads and
processes, so they are identically fast to create and schedule. This
has nothing to do with classes.

Perhaps we are using a term differently. by Thread I presume shared
address space. By process I presume a new address space, e.g. via
exec.

What do you mean by the terms?
 
G

Gordon Beaton

Perhaps we are using a term differently. by Thread I presume shared
address space. By process I presume a new address space, e.g. via
exec.

What do you mean by the terms?

On Linux, the kernel only knows about "tasks". They can share none,
part, or all of their address space with one another. When they don't
share any address space, they look like traditional processes. When
they share all of their address space, they look like threads (even
though they each task has its own pid). But the kernel does not treat
them any differently, so it cannot be claimed that one is faster or
cheaper to create than the other.

It is this lack of distinction that previously confused many people
who did not understand why "ps" often showed so multiple Java
processes on Linux.

/gordon
 
D

Dimitri Maziuk

(e-mail address removed) sez:
Threads introduce an extra layer of complexity? Hm. I'm not sure
what you mean there ....:

Think of it as a graph: instead of a root (kernel) with
multiple leaf nodes (processes), you now have a root (kernel)
with multiple nodes where each node may be a leaf or a root
with multiple leafs (threads) underneath.
But my understanding is that a key difference between threads and
processes is that the former share an address space and the latter
do not, and this leads to rather different programming models --
communication via message-passing versus communication via shared
variables. Shared variables in some ways seem easier and more
natural, but the need for synchronization means that there are lots
of potential pitfalls. Message passing is less easy/natural but
avoid some of the potential pitfalls.

See Gordon's reply: address space remain shared until you tell
the OS otherwise. IPC via shared memory and semphores existed
before threads and you can use pipes and signals betwen threads.
So in practice your choice of communication mechanism is
independent of whether you use threads or processes.
Um, no, but doesn't the "exec" method in Runtime do something along
those lines? I have zero experience working with this method; I'm
going by the API and by vague recollections of discussions in this
group.

It's more like system() -- it calls out to OS shell.
Also, if you're using the additional threads/processes to mask
latency, then context switch times need to be taken into account.
Maybe there too the difference is less than I think -- threads
would seem to win out, but perhaps not by much.

Potentially, if their shared memory pages don't need to be
swapped in/out. It really depends on the job mix and the
scheduler, so I wouldn't count on too much.
Traditionally this has been true, yes. I wonder if it will continue
to be true in the next few years -- dual-core chips are not the
rare beasts they were a few years ago, and one hears so much about
multicore chips in general that I wonder whether in a few years
there won't be a lot more 4-way and 8-way systems .... But I'm
not a hardware designer, and it may be that there's still no way
to make shared-memory systems scalable.

IIRC Sun's specs for that 24-way box say "48 threads". Dunno if
they're dual-core chips (one thread per core)...

Anyway, IBM's playstation chip -- the only multicore chip I know
of where "multi" > 2 -- does not have SMP cores, it's SIMD cores.
So threads are not going to work the same way on it, it'll need
a different programming model altogether.
In any case, it seems to me that arguments about scalability, while
interesting and important, miss the point that if what you have is
a 2-processor or 4-processor system, a program that exploits the
extra processors is a good thing to have, even if it doesn't scale
to more than 4 processors.

Yeah, sure, but the devil is in the details. You need some way
to figure out how many threads to create: too few and you're
underutilizing your CPUs, too many and context switch times
eat up all the extra bang. A program that exploits 4 processors
won't run faster on 8-way box and will thrash a 2-way machine.

Dima
 
R

Roedy Green

On Linux, the kernel only knows about "tasks". They can share none,
part, or all of their address space with one another. When they don't
share any address space, they look like traditional processes. When
they share all of their address space, they look like threads (even
though they each task has its own pid). But the kernel does not treat
them any differently, so it cannot be claimed that one is faster or
cheaper to create than the other.

It is this lack of distinction that previously confused many people
who did not understand why "ps" often showed so multiple Java
processes on Linux.

I presume it tracks task groups so you can kill all related tasks at
once.

But still it takes a heck of a lot longer to start a task from a raw
executable than from code already in RAM. The task may look the same
after they are born, but the threads have a much easier birth. Perhaps
Linux conceptually separates out loading from starting the task. If
that is so then indeed tasks and threads could look identical, even in
the way they are started.

What we should be aiming for is MOST of the time, to start a program
all you do is map its code into virtual memory and start executing,
causing an immediate page fault. If something happens to disturb the
memory maps, then they should be automatically rebuilt, perhaps in the
background when the system in idle, not built from scratch on every
execution. You could get very close to the idea that programs never
stop, just sleep.

see http://mindprod.com/projects/gespenster.html
http://mindprod.com/projects/suspendedanimation.html
 
R

Roedy Green

Yeah, sure, but the devil is in the details. You need some way
to figure out how many threads to create: too few and you're
underutilizing your CPUs, too many and context switch times
eat up all the extra bang. A program that exploits 4 processors
won't run faster on 8-way box and will thrash a 2-way machine.

This sort of tweaking has the additional complication that it depends
on the JVM and the hardware and the dataset you are working on. Every
customer needs it tweaked differently.

See my idea to automate the tweaking process.

http://mindprod.com/jgloss/tweakable.html
 
B

blmblm

loot at the exe header. All the relocation information is still there.
I think what they do is set them up so that IF the program loads at a
certain location, there is less fixup to do.

Seems like with every process having its own address space it usually
*would* be the case that the program loads at the same location (in
the address space) every time?

(Good suggestion about looking at the format of executable files.
Something else I really must do sometime.)
Even then there are a zillion hooks to dlls etc to arrange.

Yeah ....
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,076
Latest member
OrderKetoBeez

Latest Threads

Top