multi-core software

Discussion in 'Java' started by Xah Lee, Jun 4, 2009.

  1. Xah Lee

    Xah Lee Guest

    Of interest:

    • Why Must Software Be Rewritten For Multi-Core Processors?
    http://xahlee.org/UnixResource_dir/writ/multi-core_software.html

    plain text version follows.

    --------------------------------------------------
    Why Must Software Be Rewritten For Multi-Core Processors?

    Xah Lee, 2009-06-04

    I had a revelation today, namely, that it is necessary to rewrite
    software to use multi-processor in order to benefit from it.

    This may sound stupid, but is a revelation to me. For the past decade,
    the question has been on my mind, about why should software needs to
    be rewritten to take advantage of multi-processors. Because, in my
    mind, i thought that software are at some fundamental level just
    algorithms, and algorithms, have nothing to do with hardware
    implementation aspects such as number of processors. I always felt,
    that those talks about the need or difficulty of rewriting software
    for multi-processor (or multi-core these days) must be the product of
    idiocy of industrial imperative coding monkies. In particular, some
    languages such as java, the way they deal with it, seems to me
    extremely stupid. e.g. the concept of threads. In my mind, there
    should be a layer between the software and the hardware, such as the
    operating system, or the compiler, that should be able to
    automatically handle or compile the software so that it FULLY use the
    multi-processors when present. In short, i thought that a algorithm
    really has nothing to do with hardware details.

    I never really thought hard about this issue, but today, since i got a
    quad-core PC, so i looked into the issue, and thought about it, and i
    realized the answer. The gist is that, algorithm, fundamentally means
    manipulating some hardware, in fact, algorithm is a step by step
    instruction about some form of hardware, even the hardware may be
    abstract or virtual. For example, let's say the simplest case of 1+1.
    It is a algorithm, but where is the hardware? You may say it's
    abstract concept, or it being a mathematical model. What you call 1+1
    depends on the context, but in our context, those numbers are the
    hardware. To see this, lets say more complex example of listing primes
    by sieve. Again, it comes down to “what is a number� Here, numbers
    can be stones, or arrangement of beads on abacus, it's hardware! As
    another example, say sorting. To begin with, you have to have some
    something to sort, that's hardware.

    Another way to see this is that, for a given computing problem, there
    are infinite number of algorithms to achieve the same thing. Some,
    will be better ones, requiring less steps, or requiring less storage.
    All these are concrete manipulation issues, and the thing being
    manipulated, ultimately we have to call it hardware. So, when hardware
    changes, say from one cpu to multi-cpu, there's no way for the
    algorithm to magically change and adopt the changed hardware. If you
    need a algorithm that is catered to the new hardware, you need a new
    algorithm.

    One might think that there might be algorithm Omega, such that it
    takes input of old hardware O, new hardware N, and a algorithm A, and
    output a algorithm B, such that B has the same behavior as A, but N+B
    performs better than O+A. This is like asking for Strong AI.

    One thing we often forgot is that algorithms ultimately translates to
    manipulating hardware. In a modern digital computer, that means
    software algorithms ultimately becomes machine instructions in CPU,
    which manipulate the 1s and 0s in register, or electricity voltage in
    transisters.

    In a more mundane point of view, a automatic system for software to
    work on multi-processors is a problem of breaking a problem into
    discrete units (so that they can be computed in parallel). The problem
    of finding a algorithm is entirely different from the problem of
    breaking a problem into distinct units. The problem of breaking a
    problem into distinct units is a entire new branch of mathematics. For
    example, let's say factoring. Factoring is a well defined mathematical
    problem. There are millions algorithms to do it, each class has
    different properties such as number of steps or storage units.
    However, if we look at these algorithms from the point of view of
    distinct units, it's a new perspective on classification of
    algorithms. Software are in general too ill-defined and fuzzy and
    complex, the software we use on personal computers such as email,
    browsers, games, don't even have mathematical models. They don't even
    have mathematical models of their inputs and outputs. To talk about
    automatic system of unitizing software, would be more like a AI
    fantasy. Roughly, a term that describes this aspect of research is
    Parallel computing.

    In the Wikipedia article, it talks about types of parallelism: Bit-
    level parallelism, Instruction-level parallelism, Data parallelism,
    Task parallelism. Then it also discusses hardware aspects classes:
    multicore, symmetric multiprocessing, distributed computing, cluster,
    grid. The subjects mentioned there more close to this essay are “data
    parallelism†and “task parallelismâ€. However, neither are high level
    enough as i discussed here. The issue discussed here would be like
    “algorithmic parallelismâ€. Indeed, Wikipedia mentioned “Automatic
    parallelizationâ€, which is exactly what i'm talking about here. Quote:

    Automatic parallelization of a sequential program by a compiler is
    the holy grail of parallel computing. Despite decades of work by
    compiler researchers, automatic parallelization has had only limited
    success.[40]

    Mainstream parallel programming languages remain either explicitly
    parallel or (at best) partially implicit, in which a programmer gives
    the compiler directives for parallelization. A few fully implicit
    parallel programming languages exist — SISAL, Parallel Haskell, and
    (for FPGAs) Mitrion-C.

    It says “A few fully implicit parallel programming languages existâ€.
    If you are a comp lang nutcase, don't get carried away by what those
    words might seem to mean.

    (Note: Wikipedia has a dedicated article on the subject: Automatic
    parallelization)

    Xah
    ∑ http://xahlee.org/

    ☄
    Xah Lee, Jun 4, 2009
    #1
    1. Advertising

  2. Xah Lee

    Roedy Green Guest

    On Thu, 4 Jun 2009 09:46:44 -0700 (PDT), Xah Lee <>
    wrote, quoted or indirectly quoted someone who said :

    >• Why Must Software Be Rewritten For Multi-Core Processors?


    Threads have been part of Java since Day 1. Using threads complicates
    your code, but even with a single core processor, they can improve
    performance, particularly if you are doing something like combing
    multiple websites.

    The nice thing about Java is whether you are on a single core
    processor or a 256 CPU machine (We got to run our Athena Integer Java
    spreadsheet engine on such a beast), does not concern your code.

    You just have to make sure your threads don't interfere with each
    other, and Java/the OS, handle exploiting all the CPUs available.

    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    Never discourage anyone... who continually makes progress, no matter how slow.
    ~ Plato 428 BC died: 348 BC at age: 80
    Roedy Green, Jun 4, 2009
    #2
    1. Advertising

  3. Xah Lee

    Kaz Kylheku Guest

    ["Followup-To:" header set to comp.lang.lisp.]
    On 2009-06-04, Roedy Green <> wrote:
    > On Thu, 4 Jun 2009 09:46:44 -0700 (PDT), Xah Lee <>
    > wrote, quoted or indirectly quoted someone who said :
    >
    >>• Why Must Software Be Rewritten For Multi-Core Processors?

    >
    > Threads have been part of Java since Day 1.


    Unfortunately, not sane threads designed by people who actually understand
    multithreading.

    > The nice thing about Java is whether you are on a single core
    > processor or a 256 CPU machine (We got to run our Athena Integer Java
    > spreadsheet engine on such a beast), does not concern your code.


    You are dreaming if you think that there are any circumstances (other than
    circumstances in which performance doesn't matter) in which you don't have to
    concern yourself about the difference between a uniprocessor and a 256 CPU
    machine.
    Kaz Kylheku, Jun 4, 2009
    #3
  4. Kaz Kylheku <> writes:
    > ["Followup-To:" header set to comp.lang.lisp.]


    Expanded to include comp.lang.java.programmer, since I'm pretty sure
    that's where Roedy Green reads.

    > On 2009-06-04, Roedy Green <> wrote:
    >
    > > The nice thing about Java is whether you are on a single core
    > > processor or a 256 CPU machine (We got to run our Athena Integer Java
    > > spreadsheet engine on such a beast), does not concern your code.

    >
    > You are dreaming if you think that there are any circumstances (other than
    > circumstances in which performance doesn't matter) in which you don't have to
    > concern yourself about the difference between a uniprocessor and a 256 CPU
    > machine.


    I concur. You would generally do the program design and use of threads
    differently depending on whether you expected to be using a multi-core
    parallel processing system versus a uniprocessor.

    For one thing, spawning a large number of threads to do computation or
    subdividing one big computation into smaller chunks to do in separate
    threads usually incurs some often significant overhead that only makes
    sense if you can make up for the overhead by having parallel execution
    speed things up.

    For example, suppose that I had four matrix multiplications that I
    needed to do in order to further my computation. I could choose to
    either do them sequentially, or else spawn three additional "helper"
    threads so that each multiplication happens on a single thread.

    But spawning the additional threads has its own overhead in thread
    creation, the need to synchronize the completion of the three helper and
    arrange a more complicated means of getting the information into and out
    of the helper threads. That is overhead that would be pure waste if
    running on a uni-processor, since the overhead buys you nothing, and in
    fact imposes the additional cost of having the processor need to switch
    threads to simulate the multi-threading.

    So that sort of program design would only make sense if you knew that
    you would have multiple processors available to handle the helper
    threads so that the administrative overhead ends up being offset by the
    parallel execution of the four big multiplications.


    --
    Thomas A. Russ, USC/Information Sciences Institute
    Thomas A. Russ, Jun 4, 2009
    #4
  5. Xah Lee

    Jeff M. Guest

    On Jun 4, 3:35 pm, (Thomas A. Russ) wrote:
    > Kaz Kylheku <> writes:
    > > ["Followup-To:" header set to comp.lang.lisp.]

    >
    > Expanded to include comp.lang.java.programmer, since I'm pretty sure
    > that's where Roedy Green reads.
    >
    > > On 2009-06-04, Roedy Green <> wrote:

    >
    > > > The nice thing about Java is whether you are on a single core
    > > > processor or a 256 CPU machine (We got to run our Athena Integer Java
    > > > spreadsheet engine on such a beast), does not concern your code.

    >
    > > You are dreaming if you think that there are any circumstances (other than
    > > circumstances in which performance doesn't matter) in which you don't have to
    > > concern yourself about the difference between a uniprocessor and a 256 CPU
    > > machine.

    >
    > I concur.  You would generally do the program design and use of threads
    > differently depending on whether you expected to be using a multi-core
    > parallel processing system versus a uniprocessor.


    I also feel the need to weigh in and agree. Anyone who has done any
    kind of serious parallel processing (note: I chose that term
    specifically over "multi-threading") knows this. There are many, many
    aspects of parallel processing that go FAR beyond a Thread class, a
    couple different "Lock" mechanisms, and some functions like start,
    stop, pause, resume, and join.

    The most common assumptions made by people who like to talk about
    multi-threading are that all the processors being utilized are the
    same (e.g. dual core+) and that they all share unified memory (ala a
    single desktop PC). While that certainly does happen, most of the time
    the processors aren't even at the same location, let along the same or
    sharing memory.

    Another misconception is that you dispatch work to speed things up a
    particular operation, when usually that isn't the case. Yes,
    occasionally you happen to have a great problem that yields itself to
    either a map/reduce solution, or you have 1,000,000 data sets to
    process and you simply send 1,000 to each of your 1,000 available
    hardware threads. But, that (in my experience) is also a rare
    occurrence. Typically, you are just dispatching work to something else
    so that you can continue on with other things that are just as
    important. Going faster may just be a fringe benefit.

    To assume that a programmer doesn't need to even think about the
    problem and that the language/OS will just take care of it for you
    demonstrates gross inexperience with the problem domain. It's would be
    akin to "Java has support for OpenGL, so making a 3D game must be
    trivial."

    Jeff M.
    Jeff M., Jun 5, 2009
    #5
  6. In article
    <>,
    "Jeff M." <> wrote:

    > On Jun 4, 3:35 pm, (Thomas A. Russ) wrote:
    > > Kaz Kylheku <> writes:
    > > > ["Followup-To:" header set to comp.lang.lisp.]

    > >
    > > Expanded to include comp.lang.java.programmer, since I'm pretty
    > > sure that's where Roedy Green reads.
    > >
    > > > On 2009-06-04, Roedy Green <> wrote:

    > >
    > > > > The nice thing about Java is whether you are on a single core
    > > > > processor or a 256 CPU machine (We got to run our Athena
    > > > > Integer Java spreadsheet engine on such a beast), does not
    > > > > concern your code.

    > >
    > > > You are dreaming if you think that there are any circumstances
    > > > (other than circumstances in which performance doesn't matter) in
    > > > which you don't have to concern yourself about the difference
    > > > between a uniprocessor and a 256 CPU machine.

    > >
    > > I concur.  You would generally do the program design and use of
    > > threads differently depending on whether you expected to be using a
    > > multi-core parallel processing system versus a uniprocessor.

    >
    > I also feel the need to weigh in and agree. Anyone who has done any
    > kind of serious parallel processing (note: I chose that term
    > specifically over "multi-threading") knows this. There are many, many
    > aspects of parallel processing that go FAR beyond a Thread class, a
    > couple different "Lock" mechanisms, and some functions like start,
    > stop, pause, resume, and join.


    Indeed, the 3rd edition of the Java Language Specification [1] revised
    the threading model, having previously deprecated several of the
    methods mentioned [2]. About the same time, several task oriented
    utilities were added to the standard library [3].

    The results, while not ground-breaking, can dramatically improve the
    responsiveness of GUI applications, for example, where rendering is
    often confined to a single thread of execution.

    [1]<http://java.sun.com/docs/books/jls/third_edition/html/j3TOC.html>
    [2]<http://java.sun.com/javase/6/docs/api/java/lang/Thread.html>
    [3]<http://java.sun.com/javase/6/docs/api/java/util/concurrent/package-summary.html>

    --
    John B. Matthews
    trashgod at gmail dot com
    <http://sites.google.com/site/drjohnbmatthews>
    John B. Matthews, Jun 5, 2009
    #6
  7. Xah Lee

    Roedy Green Guest

    On 04 Jun 2009 13:35:03 -0700, (Thomas A. Russ)
    wrote, quoted or indirectly quoted someone who said :

    >For one thing, spawning a large number of threads to do computation or
    >subdividing one big computation into smaller chunks to do in separate
    >threads usually incurs some often significant overhead that only makes
    >sense if you can make up for the overhead by having parallel execution
    >speed things up.


    Obviously you have to tweak the number of threads to account for how
    many CPUs you have and how much RAM you have (and how slow/wide your
    Internet connection is) I have proposed that it might be possible to
    make apps self-tweaking. See http://mindprod.com/jgloss/tweakable.html


    Is there anything else you have to change in your code to deal with
    the number of cores available?
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    Never discourage anyone... who continually makes progress, no matter how slow.
    ~ Plato 428 BC died: 348 BC at age: 80
    Roedy Green, Jun 5, 2009
    #7
  8. Xah Lee

    Paul Wallich Guest

    Roedy Green wrote:
    > On 04 Jun 2009 13:35:03 -0700, (Thomas A. Russ)
    > wrote, quoted or indirectly quoted someone who said :
    >
    >> For one thing, spawning a large number of threads to do computation or
    >> subdividing one big computation into smaller chunks to do in separate
    >> threads usually incurs some often significant overhead that only makes
    >> sense if you can make up for the overhead by having parallel execution
    >> speed things up.

    >
    > Obviously you have to tweak the number of threads to account for how
    > many CPUs you have and how much RAM you have (and how slow/wide your
    > Internet connection is) I have proposed that it might be possible to
    > make apps self-tweaking. See http://mindprod.com/jgloss/tweakable.html
    >
    > Is there anything else you have to change in your code to deal with
    > the number of cores available?


    Yes. Unless you're project is embarassingly embarassingly parallel, the
    optimal communications pattern may change according to the number of
    cores. Even the entire partitioning strategy might change. And not only
    by number of cores but by the interaction between number of cores and
    well- or ill-behaved nature of the particular data set.
    Paul Wallich, Jun 5, 2009
    #8
  9. Kaz Kylheku wrote:
    [snip]

    I see you're no longer confining yourself to the thread from hell, and
    now randomly crashing other threads comp.lang.java.programmer to
    badmouth Java and crosspost into cll.

    > Unfortunately, not sane threads designed by people who actually understand
    > multithreading.


    Then let me enlighten you! We have had some nice concurrency support
    over here in Java-land since Java 5, and especially Java 6, what with
    java.util.concurrent, nonblocking streams in NIO, and
    SwingWorker/SwingUtilities.
    Seamus MacRae, Jun 5, 2009
    #9
  10. Xah Lee

    Mark Space Guest

    Seamus MacRae wrote:

    > Then let me enlighten you! We have had some nice concurrency support
    > over here in Java-land since Java 5, and especially Java 6, what with
    > java.util.concurrent, nonblocking streams in NIO, and
    > SwingWorker/SwingUtilities.



    Looks like more good stuff on the way for Java 7 too. None of this is
    set in stone afaik, but it looks like more concurrency utilities and
    NIO2 are in.

    http://tech.puredanger.com/java7/#jsr166

    http://tech.puredanger.com/java7/#jsr203
    Mark Space, Jun 5, 2009
    #10
  11. Xah Lee

    Kaz Kylheku Guest

    On 2009-06-05, Vend <> wrote:
    > On Jun 4, 8:35 pm, Roedy Green <>
    > wrote:
    >> On Thu, 4 Jun 2009 09:46:44 -0700 (PDT), Xah Lee <>
    >> wrote, quoted or indirectly quoted someone who said :
    >>
    >> >• Why Must Software Be Rewritten For Multi-Core Processors?

    >>
    >> Threads have been part of Java since Day 1.  Using threads complicates
    >> your code, but even with a single core processor, they can improve
    >> performance, particularly if you are doing something like combing
    >> multiple websites.
    >>
    >> The nice thing about Java is whether you are on a single core
    >> processor or a 256 CPU machine (We got to run our Athena Integer Java
    >> spreadsheet engine on such a beast), does not concern your code.
    >>
    >> You just have to make sure your threads don't interfere with each
    >> other, and Java/the OS, handle exploiting all the CPUs available.

    >
    > You need to decompose your problem in 256 independent tasks.
    >
    > It can be trivial for some problems and difficult or perhaps
    > impossible for some others.


    Even for problems where it appears trivial, there can be hidden
    issues, like false cache coherency communication where no actual
    sharing is taking place. Or locks that appear to have low contention and
    negligible performance impact on ``only'' 8 processors suddenly turn into
    bottlenecks. Then there is NUMA. A given address in memory may be
    RAM attached to the processor accessing it, or to another processor,
    with very different access costs.
    Kaz Kylheku, Jun 5, 2009
    #11
  12. Xah Lee

    Scott Burson Guest

    On Jun 4, 9:46 am, Xah Lee <> wrote:
    > Of interest:
    >
    > • Why Must Software Be Rewritten For Multi-Core Processors?
    >  http://xahlee.org/UnixResource_dir/writ/multi-core_software.html
    >
    > plain text version follows.
    >
    > --------------------------------------------------
    > Why Must Software Be Rewritten For Multi-Core Processors?
    >
    > Xah Lee, 2009-06-04
    >
    > I had a revelation today, namely, that it is necessary to rewrite
    > software to use multi-processor in order to benefit from it.
    >
    > This may sound stupid, but is a revelation to me. For the past decade,
    > the question has been on my mind, about why should software needs to
    > be rewritten to take advantage of multi-processors. Because, in my
    > mind, i thought that software are at some fundamental level just
    > algorithms, and algorithms, have nothing to do with hardware
    > implementation aspects such as number of processors. I always felt,
    > that those talks about the need or difficulty of rewriting software
    > for multi-processor (or multi-core these days) must be the product of
    > idiocy of industrial imperative coding monkies. In particular, some
    > languages such as java, the way they deal with it, seems to me
    > extremely stupid. e.g. the concept of threads. In my mind, there
    > should be a layer between the software and the hardware, such as the
    > operating system, or the compiler, that should be able to
    > automatically handle or compile the software so that it FULLY use the
    > multi-processors when present. In short, i thought that a algorithm
    > really has nothing to do with hardware details.
    >
    > I never really thought hard about this issue, but today, since i got a
    > quad-core PC, so i looked into the issue, and thought about it, and i
    > realized the answer. The gist is that, algorithm, fundamentally means
    > manipulating some hardware, in fact, algorithm is a step by step
    > instruction about some form of hardware, even the hardware may be
    > abstract or virtual. For example, let's say the simplest case of 1+1.
    > It is a algorithm, but where is the hardware? You may say it's
    > abstract concept, or it being a mathematical model. What you call 1+1
    > depends on the context, but in our context, those numbers are the
    > hardware. To see this, lets say more complex example of listing primes
    > by sieve. Again, it comes down to “what is a number� Here, numbers
    > can be stones, or arrangement of beads on abacus, it's hardware! As
    > another example, say sorting. To begin with, you have to have some
    > something to sort, that's hardware.
    >
    > Another way to see this is that, for a given computing problem, there
    > are infinite number of algorithms to achieve the same thing. Some,
    > will be better ones, requiring less steps, or requiring less storage.
    > All these are concrete manipulation issues, and the thing being
    > manipulated, ultimately we have to call it hardware. So, when hardware
    > changes, say from one cpu to multi-cpu, there's no way for the
    > algorithm to magically change and adopt the changed hardware. If you
    > need a algorithm that is catered to the new hardware, you need a new
    > algorithm.
    >
    > One might think that there might be algorithm Omega, such that it
    > takes input of old hardware O, new hardware N, and a algorithm A, and
    > output a algorithm B, such that B has the same behavior as A, but N+B
    > performs better than O+A. This is like asking for Strong AI.
    >
    > One thing we often forgot is that algorithms ultimately translates to
    > manipulating hardware. In a modern digital computer, that means
    > software algorithms ultimately becomes machine instructions in CPU,
    > which manipulate the 1s and 0s in register, or electricity voltage in
    > transisters.
    >
    > In a more mundane point of view, a automatic system for software to
    > work on multi-processors is a problem of breaking a problem into
    > discrete units (so that they can be computed in parallel). The problem
    > of finding a algorithm is entirely different from the problem of
    > breaking a problem into distinct units. The problem of breaking a
    > problem into distinct units is a entire new branch of mathematics. For
    > example, let's say factoring. Factoring is a well defined mathematical
    > problem. There are millions algorithms to do it, each class has
    > different properties such as number of steps or storage units.
    > However, if we look at these algorithms from the point of view of
    > distinct units, it's a new perspective on classification of
    > algorithms. Software are in general too ill-defined and fuzzy and
    > complex, the software we use on personal computers such as email,
    > browsers, games, don't even have mathematical models. They don't even
    > have mathematical models of their inputs and outputs. To talk about
    > automatic system of unitizing software, would be more like a AI
    > fantasy. Roughly, a term that describes this aspect of research is
    > Parallel computing.
    >
    > In the Wikipedia article, it talks about types of parallelism: Bit-
    > level parallelism, Instruction-level parallelism, Data parallelism,
    > Task parallelism. Then it also discusses hardware aspects classes:
    > multicore, symmetric multiprocessing, distributed computing, cluster,
    > grid. The subjects mentioned there more close to this essay are “data
    > parallelism†and “task parallelismâ€. However, neither are high level
    > enough as i discussed here. The issue discussed here would be like
    > “algorithmic parallelismâ€. Indeed, Wikipedia mentioned “Automatic
    > parallelizationâ€, which is exactly what i'm talking about here. Quote:
    >
    >     Automatic parallelization of a sequential program by a compiler is
    > the holy grail of parallel computing. Despite decades of work by
    > compiler researchers, automatic parallelization has had only limited
    > success.[40]
    >
    >     Mainstream parallel programming languages remain either explicitly
    > parallel or (at best) partially implicit, in which a programmer gives
    > the compiler directives for parallelization. A few fully implicit
    > parallel programming languages exist — SISAL, Parallel Haskell, and
    > (for FPGAs) Mitrion-C.
    >
    > It says “A few fully implicit parallel programming languages existâ€.
    > If you are a comp lang nutcase, don't get carried away by what those
    > words might seem to mean.
    >
    > (Note: Wikipedia has a dedicated article on the subject: Automatic
    > parallelization)
    >
    >   Xah
    > ∑http://xahlee.org/
    >
    > ☄
    Scott Burson, Jun 5, 2009
    #12
  13. Xah Lee

    Roedy Green Guest

    On Fri, 5 Jun 2009 18:15:00 +0000 (UTC), Kaz Kylheku
    <> wrote, quoted or indirectly quoted someone who
    said :

    >Even for problems where it appears trivial, there can be hidden
    >issues, like false cache coherency communication where no actual
    >sharing is taking place. Or locks that appear to have low contention and
    >negligible performance impact on ``only'' 8 processors suddenly turn into
    >bottlenecks. Then there is NUMA. A given address in memory may be
    >RAM attached to the processor accessing it, or to another processor,
    >with very different access costs.


    Could what you are saying be summed up by saying, "The more threads
    you have the more important it is to keep your threads independent,
    sharing as little data as possible."
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com

    Never discourage anyone... who continually makes progress, no matter how slow.
    ~ Plato 428 BC died: 348 BC at age: 80
    Roedy Green, Jun 6, 2009
    #13
  14. Roedy Green <> writes:

    > On Fri, 5 Jun 2009 18:15:00 +0000 (UTC), Kaz Kylheku
    > <> wrote, quoted or indirectly quoted someone who
    > said :
    >
    >>Even for problems where it appears trivial, there can be hidden
    >>issues, like false cache coherency communication where no actual
    >>sharing is taking place. Or locks that appear to have low contention and
    >>negligible performance impact on ``only'' 8 processors suddenly turn into
    >>bottlenecks. Then there is NUMA. A given address in memory may be
    >>RAM attached to the processor accessing it, or to another processor,
    >>with very different access costs.

    >
    > Could what you are saying be summed up by saying, "The more threads
    > you have the more important it is to keep your threads independent,
    > sharing as little data as possible."


    Absolutely... not a new observation, either, as it follows
    directly from Amdahl's law.
    Raymond Wiker, Jun 6, 2009
    #14
  15. On Fri, 05 Jun 2009 16:26:37 -0700, Roedy Green
    <> wrote:

    >On Fri, 5 Jun 2009 18:15:00 +0000 (UTC), Kaz Kylheku
    ><> wrote, quoted or indirectly quoted someone who
    >said :
    >
    >>Even for problems where it appears trivial, there can be hidden
    >>issues, like false cache coherency communication where no actual
    >>sharing is taking place. Or locks that appear to have low contention and
    >>negligible performance impact on ``only'' 8 processors suddenly turn into
    >>bottlenecks. Then there is NUMA. A given address in memory may be
    >>RAM attached to the processor accessing it, or to another processor,
    >>with very different access costs.

    >
    >Could what you are saying be summed up by saying, "The more threads
    >you have the more important it is to keep your threads independent,
    >sharing as little data as possible."


    And therein lies the problem of leveraging many cores. There is a lot
    of potential parallelism in programs (even in Java :) that is lost
    because it is too fine a grain for threads. Even the lightest weight
    user space ("green") threads need a few hundred instructions, minimum,
    to amortize the cost of context switching.

    Add to that the fact that programmers have shown themselves, on
    average, to be remarkably bad at figuring out what _should_ be done in
    parallel - as opposed to what _can_ be done - and you've got a clear
    indicator that threads, as we know them, are not scalable except under
    a limited set of conditions.

    George
    George Neuner, Jun 6, 2009
    #15
  16. Xah Lee

    Paul Rubin Guest

    George Neuner <> writes:
    > Even the lightest weight
    > user space ("green") threads need a few hundred instructions, minimum,
    > to amortize the cost of context switching.


    I thought the definition of green threads was that multiplexing them
    doesn't require context switches.
    Paul Rubin, Jun 7, 2009
    #16
  17. Xah Lee

    Jeff M. Guest

    On Jun 6, 9:58 pm, Paul Rubin <http://> wrote:
    > George Neuner <> writes:
    > > Even the lightest weight
    > > user space ("green") threads need a few hundred instructions, minimum,
    > > to amortize the cost of context switching.

    >
    > I thought the definition of green threads was that multiplexing them
    > doesn't require context switches.


    There's always a context switch. It's just whether or not you are
    switching in/out a virtual stack and registers for the context or the
    hardware stack/registers.

    Jeff M.
    Jeff M., Jun 7, 2009
    #17
  18. Xah Lee

    Paul Rubin Guest

    "Jeff M." <> writes:
    > > > Even the lightest weight
    > > > user space ("green") threads need a few hundred instructions, minimum,
    > > > to amortize the cost of context switching....

    > There's always a context switch. It's just whether or not you are
    > switching in/out a virtual stack and registers for the context or the
    > hardware stack/registers.


    I don't see the hundreds of instructions in that case.

    http://shootout.alioth.debian.org/u32q/benchmark.php?test=threadring&lang=ghc&id=3

    shows GHC doing 50 million lightweight thread switches in 8.47
    seconds, passing a token around a thread ring. Almost all of that is
    probably spent acquiring and releasing the token's lock as the token
    is passed from one thread to another. That simply doesn't leave time
    for hundreds of instructions per switch.
    Paul Rubin, Jun 7, 2009
    #18
  19. Xah Lee

    Jeff M. Guest

    On Jun 7, 1:56 am, Paul Rubin <http://> wrote:
    > "Jeff M." <> writes:
    > > > > Even the lightest weight
    > > > > user space ("green") threads need a few hundred instructions, minimum,
    > > > > to amortize the cost of context switching....

    > > There's always a context switch. It's just whether or not you are
    > > switching in/out a virtual stack and registers for the context or the
    > > hardware stack/registers.

    >
    > I don't see the hundreds of instructions in that case.  
    >
    > http://shootout.alioth.debian.org/u32q/benchmark.php?test=threadring&....
    >
    > shows GHC doing 50 million lightweight thread switches in 8.47
    > seconds, passing a token around a thread ring.  Almost all of that is
    > probably spent acquiring and releasing the token's lock as the token
    > is passed from one thread to another.  That simply doesn't leave time
    > for hundreds of instructions per switch.


    Who said there has to be? Sample code below (just to get the point
    across):

    struct context {
    vir_reg pc, sp, bp, ... ;
    object* stack;

    // ...

    context* next;
    };

    struct vm {
    context* active_context;
    };

    void switch_context(vm* v)
    {
    // maybe GC v->active_context before switching

    v->active_context = v->active_context->next;
    }

    Also, there isn't "hundreds of instructions" with multiplexing,
    either. It's all done in hardware. Take a look at the disassembly for
    any application: one that uses native threads on a platform that
    supports preemption. You won't see any instructions anywhere in the
    program that perform a context switch. If you did that would be
    absolutely horrible. Imagine if the compiler did something like this:

    while(1)
    {
    // whatever
    }

    do_context_switch_here();

    That would suck. ;-)

    That's not to imply that there isn't a cost; there's always a cost.
    The example above just goes to show that for green threads, the cost
    [of the switch] can be reduced down to a single pointer assignment.

    Jeff M.
    Jeff M., Jun 7, 2009
    #19
  20. Xah Lee

    Lew Guest

    Scott David Daniels wrote:
    > the nub of the problem is not on the benchmarks. There is something
    > to be said for the good old daays when you looked up the instruction
    > timings that you used in a little document for your machine, and could
    > know the cost of any loop. We are faster now, but part of the cost of
    > that speed is that timing is a black art.


    Those good old days never existed. Those manuals never accounted for things
    that affected timing even then, like memory latency or refresh time. SRAM
    cache made things worse, since the published timings never mentioned
    cache-miss delays. Though memory cache might seem a recent innovation, it's
    been around a while. It would be challenging to find any published timing
    since the commercialization of computers that would actually tell the cost of
    any loop.

    Things got worse when chips like the '86 family acquired multiple instructions
    for doing loops, still worse when pre-fetch pipelines became deeper and wider,
    absolutely Dark Art due to multi-level memory caches becoming universal, and
    throw-your-hands-up-and-leave-for-the-corner-bar with multiprocessor NUMA
    systems. OSes and high-level languages complicate the matter - you never know
    how much time slice you'll get or how your source got compiled or optimized by
    run-time.

    So the good old days are a matter of degree and self-deception - it was easier
    to fool ourselves then that we could at least guess timings proportionately if
    not absolutely, but things definitely get more unpredictable over evolution.

    --
    Lew
    Lew, Jun 7, 2009
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    0
    Views:
    437
  2. John
    Replies:
    0
    Views:
    901
  3. John
    Replies:
    0
    Views:
    1,006
  4. Xah Lee

    multi-core software

    Xah Lee, Jun 4, 2009, in forum: Python
    Replies:
    41
    Views:
    1,081
    Arved Sandstrom
    Jun 11, 2009
  5. Gyoung-Yoon Noh
    Replies:
    1
    Views:
    89
    James Britt
    Dec 24, 2005
Loading...

Share This Page