multi-core software

Xah Lee · Jun 4, 2009

Of interest:

â€¢ Why Must Software Be Rewritten For Multi-Core Processors?
http://xahlee.org/UnixResource_dir/writ/multi-core_software.html

plain text version follows.

--------------------------------------------------
Why Must Software Be Rewritten For Multi-Core Processors?

Xah Lee, 2009-06-04

I had a revelation today, namely, that it is necessary to rewrite
software to use multi-processor in order to benefit from it.

This may sound stupid, but is a revelation to me. For the past decade,
the question has been on my mind, about why should software needs to
be rewritten to take advantage of multi-processors. Because, in my
mind, i thought that software are at some fundamental level just
algorithms, and algorithms, have nothing to do with hardware
implementation aspects such as number of processors. I always felt,
that those talks about the need or difficulty of rewriting software
for multi-processor (or multi-core these days) must be the product of
idiocy of industrial imperative coding monkies. In particular, some
languages such as java, the way they deal with it, seems to me
extremely stupid. e.g. the concept of threads. In my mind, there
should be a layer between the software and the hardware, such as the
operating system, or the compiler, that should be able to
automatically handle or compile the software so that it FULLY use the
multi-processors when present. In short, i thought that a algorithm
really has nothing to do with hardware details.

I never really thought hard about this issue, but today, since i got a
quad-core PC, so i looked into the issue, and thought about it, and i
realized the answer. The gist is that, algorithm, fundamentally means
manipulating some hardware, in fact, algorithm is a step by step
instruction about some form of hardware, even the hardware may be
abstract or virtual. For example, let's say the simplest case of 1+1.
It is a algorithm, but where is the hardware? You may say it's
abstract concept, or it being a mathematical model. What you call 1+1
depends on the context, but in our context, those numbers are the
hardware. To see this, lets say more complex example of listing primes
by sieve. Again, it comes down to â€œwhat is a numberâ€? Here, numbers
can be stones, or arrangement of beads on abacus, it's hardware! As
another example, say sorting. To begin with, you have to have some
something to sort, that's hardware.

Another way to see this is that, for a given computing problem, there
are infinite number of algorithms to achieve the same thing. Some,
will be better ones, requiring less steps, or requiring less storage.
All these are concrete manipulation issues, and the thing being
manipulated, ultimately we have to call it hardware. So, when hardware
changes, say from one cpu to multi-cpu, there's no way for the
algorithm to magically change and adopt the changed hardware. If you
need a algorithm that is catered to the new hardware, you need a new
algorithm.

One might think that there might be algorithm Omega, such that it
takes input of old hardware O, new hardware N, and a algorithm A, and
output a algorithm B, such that B has the same behavior as A, but N+B
performs better than O+A. This is like asking for Strong AI.

One thing we often forgot is that algorithms ultimately translates to
manipulating hardware. In a modern digital computer, that means
software algorithms ultimately becomes machine instructions in CPU,
which manipulate the 1s and 0s in register, or electricity voltage in
transisters.

In a more mundane point of view, a automatic system for software to
work on multi-processors is a problem of breaking a problem into
discrete units (so that they can be computed in parallel). The problem
of finding a algorithm is entirely different from the problem of
breaking a problem into distinct units. The problem of breaking a
problem into distinct units is a entire new branch of mathematics. For
example, let's say factoring. Factoring is a well defined mathematical
problem. There are millions algorithms to do it, each class has
different properties such as number of steps or storage units.
However, if we look at these algorithms from the point of view of
distinct units, it's a new perspective on classification of
algorithms. Software are in general too ill-defined and fuzzy and
complex, the software we use on personal computers such as email,
browsers, games, don't even have mathematical models. They don't even
have mathematical models of their inputs and outputs. To talk about
automatic system of unitizing software, would be more like a AI
fantasy. Roughly, a term that describes this aspect of research is
Parallel computing.

In the Wikipedia article, it talks about types of parallelism: Bit-
level parallelism, Instruction-level parallelism, Data parallelism,
Task parallelism. Then it also discusses hardware aspects classes:
multicore, symmetric multiprocessing, distributed computing, cluster,
grid. The subjects mentioned there more close to this essay are â€œdata
parallelismâ€ and â€œtask parallelismâ€. However, neither are high level
enough as i discussed here. The issue discussed here would be like
â€œalgorithmic parallelismâ€. Indeed, Wikipedia mentioned â€œAutomatic
parallelizationâ€, which is exactly what i'm talking about here. Quote:

Automatic parallelization of a sequential program by a compiler is
the holy grail of parallel computing. Despite decades of work by
compiler researchers, automatic parallelization has had only limited
success.[40]

Mainstream parallel programming languages remain either explicitly
parallel or (at best) partially implicit, in which a programmer gives
the compiler directives for parallelization. A few fully implicit
parallel programming languages exist â€” SISAL, Parallel Haskell, and
(for FPGAs) Mitrion-C.

It says â€œA few fully implicit parallel programming languages existâ€.
If you are a comp lang nutcase, don't get carried away by what those
words might seem to mean.

(Note: Wikipedia has a dedicated article on the subject: Automatic
parallelization)

Xah
âˆ‘ http://xahlee.org/

â˜„

Roedy Green · Jun 4, 2009

• Why Must Software Be Rewritten For Multi-Core Processors?

Threads have been part of Java since Day 1. Using threads complicates
your code, but even with a single core processor, they can improve
performance, particularly if you are doing something like combing
multiple websites.

The nice thing about Java is whether you are on a single core
processor or a 256 CPU machine (We got to run our Athena Integer Java
spreadsheet engine on such a beast), does not concern your code.

You just have to make sure your threads don't interfere with each
other, and Java/the OS, handle exploiting all the CPUs available.

--
Roedy Green Canadian Mind Products
http://mindprod.com

Never discourage anyone... who continually makes progress, no matter how slow.
~ Plato 428 BC died: 348 BC at age: 80

Kaz Kylheku · Jun 4, 2009

["Followup-To:" header set to comp.lang.lisp.]

Threads have been part of Java since Day 1.

Unfortunately, not sane threads designed by people who actually understand
multithreading.

The nice thing about Java is whether you are on a single core
processor or a 256 CPU machine (We got to run our Athena Integer Java
spreadsheet engine on such a beast), does not concern your code.

You are dreaming if you think that there are any circumstances (other than
circumstances in which performance doesn't matter) in which you don't have to
concern yourself about the difference between a uniprocessor and a 256 CPU
machine.

MRAB · Jun 4, 2009

Kaz said:
["Followup-To:" header set to comp.lang.lisp.]

Threads have been part of Java since Day 1.

Click to expand...

Unfortunately, not sane threads designed by people who actually understand
multithreading.

The nice thing about Java is whether you are on a single core
processor or a 256 CPU machine (We got to run our Athena Integer Java
spreadsheet engine on such a beast), does not concern your code.

Click to expand...

You are dreaming if you think that there are any circumstances (other than
circumstances in which performance doesn't matter) in which you don't have to
concern yourself about the difference between a uniprocessor and a 256 CPU
machine.

If you're interested in parallel programming, have a look at Flow-Based
Programming:

http://www.jpaulmorrison.com/fbp/

Seamus MacRae · Jun 5, 2009

Kaz Kylheku wrote:
[snip]

I see you're no longer confining yourself to the thread from hell, and
now randomly crashing other threads comp.lang.java.programmer to
badmouth Java and crosspost into cll.

Unfortunately, not sane threads designed by people who actually understand
multithreading.

Then let me enlighten you! We have had some nice concurrency support
over here in Java-land since Java 5, and especially Java 6, what with
java.util.concurrent, nonblocking streams in NIO, and
SwingWorker/SwingUtilities.

Kaz Kylheku · Jun 5, 2009

You need to decompose your problem in 256 independent tasks.

It can be trivial for some problems and difficult or perhaps
impossible for some others.

Even for problems where it appears trivial, there can be hidden
issues, like false cache coherency communication where no actual
sharing is taking place. Or locks that appear to have low contention and
negligible performance impact on ``only'' 8 processors suddenly turn into
bottlenecks. Then there is NUMA. A given address in memory may be
RAM attached to the processor accessing it, or to another processor,
with very different access costs.

Scott Burson · Jun 5, 2009

Of interest:

â€¢ Why Must Software Be Rewritten For Multi-Core Processors?
Â http://xahlee.org/UnixResource_dir/writ/multi-core_software.html

plain text version follows.

--------------------------------------------------
Why Must Software Be Rewritten For Multi-Core Processors?

Xah Lee, 2009-06-04

I had a revelation today, namely, that it is necessary to rewrite
software to use multi-processor in order to benefit from it.

This may sound stupid, but is a revelation to me. For the past decade,
the question has been on my mind, about why should software needs to
be rewritten to take advantage of multi-processors. Because, in my
mind, i thought that software are at some fundamental level just
algorithms, and algorithms, have nothing to do with hardware
implementation aspects such as number of processors. I always felt,
that those talks about the need or difficulty of rewriting software
for multi-processor (or multi-core these days) must be the product of
idiocy of industrial imperative coding monkies. In particular, some
languages such as java, the way they deal with it, seems to me
extremely stupid. e.g. the concept of threads. In my mind, there
should be a layer between the software and the hardware, such as the
operating system, or the compiler, that should be able to
automatically handle or compile the software so that it FULLY use the
multi-processors when present. In short, i thought that a algorithm
really has nothing to do with hardware details.

I never really thought hard about this issue, but today, since i got a
quad-core PC, so i looked into the issue, and thought about it, and i
realized the answer. The gist is that, algorithm, fundamentally means
manipulating some hardware, in fact, algorithm is a step by step
instruction about some form of hardware, even the hardware may be
abstract or virtual. For example, let's say the simplest case of 1+1.
It is a algorithm, but where is the hardware? You may say it's
abstract concept, or it being a mathematical model. What you call 1+1
depends on the context, but in our context, those numbers are the
hardware. To see this, lets say more complex example of listing primes
by sieve. Again, it comes down to â€œwhat is a numberâ€? Here, numbers
can be stones, or arrangement of beads on abacus, it's hardware! As
another example, say sorting. To begin with, you have to have some
something to sort, that's hardware.

Another way to see this is that, for a given computing problem, there
are infinite number of algorithms to achieve the same thing. Some,
will be better ones, requiring less steps, or requiring less storage.
All these are concrete manipulation issues, and the thing being
manipulated, ultimately we have to call it hardware. So, when hardware
changes, say from one cpu to multi-cpu, there's no way for the
algorithm to magically change and adopt the changed hardware. If you
need a algorithm that is catered to the new hardware, you need a new
algorithm.

One might think that there might be algorithm Omega, such that it
takes input of old hardware O, new hardware N, and a algorithm A, and
output a algorithm B, such that B has the same behavior as A, but N+B
performs better than O+A. This is like asking for Strong AI.

One thing we often forgot is that algorithms ultimately translates to
manipulating hardware. In a modern digital computer, that means
software algorithms ultimately becomes machine instructions in CPU,
which manipulate the 1s and 0s in register, or electricity voltage in
transisters.

In a more mundane point of view, a automatic system for software to
work on multi-processors is a problem of breaking a problem into
discrete units (so that they can be computed in parallel). The problem
of finding a algorithm is entirely different from the problem of
breaking a problem into distinct units. The problem of breaking a
problem into distinct units is a entire new branch of mathematics. For
example, let's say factoring. Factoring is a well defined mathematical
problem. There are millions algorithms to do it, each class has
different properties such as number of steps or storage units.
However, if we look at these algorithms from the point of view of
distinct units, it's a new perspective on classification of
algorithms. Software are in general too ill-defined and fuzzy and
complex, the software we use on personal computers such as email,
browsers, games, don't even have mathematical models. They don't even
have mathematical models of their inputs and outputs. To talk about
automatic system of unitizing software, would be more like a AI
fantasy. Roughly, a term that describes this aspect of research is
Parallel computing.

In the Wikipedia article, it talks about types of parallelism: Bit-
level parallelism, Instruction-level parallelism, Data parallelism,
Task parallelism. Then it also discusses hardware aspects classes:
multicore, symmetric multiprocessing, distributed computing, cluster,
grid. The subjects mentioned there more close to this essay are â€œdata
parallelismâ€ and â€œtask parallelismâ€. However, neither are high level
enough as i discussed here. The issue discussed here would be like
â€œalgorithmic parallelismâ€. Indeed, Wikipedia mentioned â€œAutomatic
parallelizationâ€, which is exactly what i'm talking about here. Quote:

Â Â Automatic parallelization of a sequential program by a compiler is
the holy grail of parallel computing. Despite decades of work by
compiler researchers, automatic parallelization has had only limited
success.[40]

Â Â Mainstream parallel programming languages remain either explicitly
parallel or (at best) partially implicit, in which a programmer gives
the compiler directives for parallelization. A few fully implicit
parallel programming languages exist â€” SISAL, Parallel Haskell, and
(for FPGAs) Mitrion-C.

It says â€œA few fully implicit parallel programming languages existâ€.
If you are a comp lang nutcase, don't get carried away by what those
words might seem to mean.

(Note: Wikipedia has a dedicated article on the subject: Automatic
parallelization)

Â Xah
âˆ‘http://xahlee.org/

â˜„

Roedy Green · Jun 6, 2009

Even for problems where it appears trivial, there can be hidden
issues, like false cache coherency communication where no actual
sharing is taking place. Or locks that appear to have low contention and
negligible performance impact on ``only'' 8 processors suddenly turn into
bottlenecks. Then there is NUMA. A given address in memory may be
RAM attached to the processor accessing it, or to another processor,
with very different access costs.

Could what you are saying be summed up by saying, "The more threads
you have the more important it is to keep your threads independent,
sharing as little data as possible."
--
Roedy Green Canadian Mind Products
http://mindprod.com

Never discourage anyone... who continually makes progress, no matter how slow.
~ Plato 428 BC died: 348 BC at age: 80

Raymond Wiker · Jun 6, 2009

Roedy Green said:
Could what you are saying be summed up by saying, "The more threads
you have the more important it is to keep your threads independent,
sharing as little data as possible."

Absolutely... not a new observation, either, as it follows
directly from Amdahl's law.

George Neuner · Jun 6, 2009

Could what you are saying be summed up by saying, "The more threads
you have the more important it is to keep your threads independent,
sharing as little data as possible."

And therein lies the problem of leveraging many cores. There is a lot
of potential parallelism in programs (even in Java

that is lost
because it is too fine a grain for threads. Even the lightest weight
user space ("green") threads need a few hundred instructions, minimum,
to amortize the cost of context switching.

Add to that the fact that programmers have shown themselves, on
average, to be remarkably bad at figuring out what _should_ be done in
parallel - as opposed to what _can_ be done - and you've got a clear
indicator that threads, as we know them, are not scalable except under
a limited set of conditions.

George

Paul Rubin · Jun 7, 2009

George Neuner said:
Even the lightest weight
user space ("green") threads need a few hundred instructions, minimum,
to amortize the cost of context switching.

I thought the definition of green threads was that multiplexing them
doesn't require context switches.

Jeff M. · Jun 7, 2009

I thought the definition of green threads was that multiplexing them
doesn't require context switches.

There's always a context switch. It's just whether or not you are
switching in/out a virtual stack and registers for the context or the
hardware stack/registers.

Jeff M.

Paul Rubin · Jun 7, 2009

Jeff M. said:
There's always a context switch. It's just whether or not you are
switching in/out a virtual stack and registers for the context or the
hardware stack/registers.

I don't see the hundreds of instructions in that case.

http://shootout.alioth.debian.org/u32q/benchmark.php?test=threadring&lang=ghc&id=3

shows GHC doing 50 million lightweight thread switches in 8.47
seconds, passing a token around a thread ring. Almost all of that is
probably spent acquiring and releasing the token's lock as the token
is passed from one thread to another. That simply doesn't leave time
for hundreds of instructions per switch.

Jeff M. · Jun 7, 2009

I don't see the hundreds of instructions in that case.

http://shootout.alioth.debian.org/u32q/benchmark.php?test=threadring&....

shows GHC doing 50 million lightweight thread switches in 8.47
seconds, passing a token around a thread ring. Almost all of that is
probably spent acquiring and releasing the token's lock as the token
is passed from one thread to another. That simply doesn't leave time
for hundreds of instructions per switch.

Who said there has to be? Sample code below (just to get the point
across):

struct context {
vir_reg pc, sp, bp, ... ;
object* stack;

// ...

context* next;
};

struct vm {
context* active_context;
};

void switch_context(vm* v)
{
// maybe GC v->active_context before switching

v->active_context = v->active_context->next;
}

Also, there isn't "hundreds of instructions" with multiplexing,
either. It's all done in hardware. Take a look at the disassembly for
any application: one that uses native threads on a platform that
supports preemption. You won't see any instructions anywhere in the
program that perform a context switch. If you did that would be
absolutely horrible. Imagine if the compiler did something like this:

while(1)
{
// whatever
}

do_context_switch_here();

That would suck. ;-)

That's not to imply that there isn't a cost; there's always a cost.
The example above just goes to show that for green threads, the cost
[of the switch] can be reduced down to a single pointer assignment.

Jeff M.

Lew · Jun 7, 2009

Scott said:
the nub of the problem is not on the benchmarks. There is something
to be said for the good old daays when you looked up the instruction
timings that you used in a little document for your machine, and could
know the cost of any loop. We are faster now, but part of the cost of
that speed is that timing is a black art.

Those good old days never existed. Those manuals never accounted for things
that affected timing even then, like memory latency or refresh time. SRAM
cache made things worse, since the published timings never mentioned
cache-miss delays. Though memory cache might seem a recent innovation, it's
been around a while. It would be challenging to find any published timing
since the commercialization of computers that would actually tell the cost of
any loop.

Things got worse when chips like the '86 family acquired multiple instructions
for doing loops, still worse when pre-fetch pipelines became deeper and wider,
absolutely Dark Art due to multi-level memory caches becoming universal, and
throw-your-hands-up-and-leave-for-the-corner-bar with multiprocessor NUMA
systems. OSes and high-level languages complicate the matter - you never know
how much time slice you'll get or how your source got compiled or optimized by
run-time.

So the good old days are a matter of degree and self-deception - it was easier
to fool ourselves then that we could at least guess timings proportionately if
not absolutely, but things definitely get more unpredictable over evolution.

Arved Sandstrom · Jun 7, 2009

Jon said:
I see no problem with mutable shared state.

In which case, Jon, you're in a small minority.

AHS

Joshua Cranmer · Jun 7, 2009

Jon said:
No. Most programmers still care about performance and performance means
mutable state.

[ Citation needed ].

Most programmers I've met could care less about performance.

Arved Sandstrom · Jun 7, 2009

Jon said:
No. Most programmers still care about performance and performance means
mutable state.

Quite apart from performance and mutable state, I believe we were
talking about mutable _shared_ state. And this is something that gets a
_lot_ of people into trouble.

AHS

Jeff M. · Jun 7, 2009

Quite apart from performance and mutable state, I believe we were
talking about mutable _shared_ state. And this is something that gets a
_lot_ of people into trouble.

Mutable shared state gets _bad_ (err.. perhaps "inexperienced" would
be a better adjective) programmers - who don't know what they are
doing - in trouble. There are many problem domains that either benefit
greatly from mutable shared states or can't [easily] be done without
them. Unified memory management being an obvious example... there are
many more. Unshared state has its place. Immutable state has its
place. Shared immutable state has its place. Shared mutable place has
its place.

Jeff M.

Lew · Jun 7, 2009

Patricia said:
In my opinion, shared mutable state has a lot of problems. It is also
sometimes the best design for performance reasons.

As Dr. Jon pointed out upthread, one can write decent code with mutable shared
state. It is also true that mutable state presents a lot of problems -
potential problems, ones that can be solved, but not ones that can be solved
thoughtlessly. On the flip side, one can write a tremendous amount of
effective multi-threaded code involving shared mutable state with attention to
a few rules of thumb, like always synchronize access and don't use different
monitors to do so.

Unlike some environments (e.g., database management systems), Java's tools to
manage concurrency are explicit and low level. The programmer's job is to
make sure those tools are used correctly to avoid problems. As long as they
do that, then there is no special problem with shared mutable state.

There is, however, a cost. Certain things must happen slower when you share
mutable state, than when you share immutable state or don't share state.
Certain things must happen when you share mutable state, regardless of speed,
because without them your code doesn't work. For some reason, concurrent
programming is an area often not well understood by a significant percentage
of workaday programmers. When problems do arise, they tend to be
probabilistic in nature and vary widely with system characteristics like
attempted load.

So the meeting ground is, yes, concurrent mutable state can present problems
if not properly managed. Properly managing such is not necessarily a huge
burden, but it must be borne. When done properly, shared mutable state will
not present problems in production.

CfP: 5th International Workshop on Multi-Core Computing Systems (MuCoCoS)	0	Jul 30, 2012
Multi-Core Computing Systems (MuCoCoS) at the SupercomputingConference SC12	0	Jul 30, 2012
CfP: 5th International Workshop on Multi-Core Computing Systems (MuCoC	0	Jul 31, 2012
CfP: 5th International Workshop on Multi-Core Computing Systems (MuCoC	0	Jul 31, 2012
CfP: 5th International Workshop on Multi-Core Computing Systems (MuCoC	0	Jul 30, 2012
Using Code from Existing Software	1	Jan 18, 2020
Software Engineering Bootcamps (worth consider?)	1	Nov 7, 2021
Guy Steele on Parallel Programing	1	Feb 5, 2011

multi-core software

Xah Lee

Roedy Green

Kaz Kylheku

MRAB

Seamus MacRae

Kaz Kylheku

Scott Burson

Roedy Green

Raymond Wiker

George Neuner

Paul Rubin

Jeff M.

Paul Rubin

Jeff M.

Lew

Arved Sandstrom

Joshua Cranmer

Arved Sandstrom

Jeff M.

Lew

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads