Crash Course In Modern Hardware

Discussion in 'Java' started by John B. Matthews, Jan 17, 2010.

  1. Students of Java may enjoy this "Crash Course In Modern Hardware."

    "In this presentation from the JVM Languages Summit 2009, Cliff Click
    discusses the Von Neumann architecture, CISC vs RISC, the rise of
    multicore, Instruction-Level Parallelism (ILP), pipelining, out-of-order
    dispatch, static vs dynamic ILP, performance impact of cache misses,
    memory performance, memory vs CPU caching, examples of memory/CPU cache
    interaction, and tips for improving performance."

    <http://www.infoq.com/presentations/click-crash-course-modern-hardware>

    --
    John B. Matthews
    trashgod at gmail dot com
    <http://sites.google.com/site/drjohnbmatthews>
     
    John B. Matthews, Jan 17, 2010
    #1
    1. Advertising

  2. John B. Matthews

    Roedy Green Guest

    On Sun, 17 Jan 2010 00:05:20 -0500, "John B. Matthews"
    <> wrote, quoted or indirectly quoted someone who
    said :

    ><http://www.infoq.com/presentations/click-crash-course-modern-hardware>


    Sun uses a great new presentation technology for this.

    You see the guy talking in a little window, but the slides render as
    HTML (or something similar) on the bulk of your screen.

    I have not figured out how to just scan the slides.

    Usually on a video presentation, you can't read the slides. Here you
    can see them in perfect clarity.

    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    I decry the current tendency to seek patents on algorithms. There are better ways to earn a living than to prevent other people from making use of one’s contributions to computer science.
    ~ Donald Ervin Knuth (born: 1938-01-10 age: 72)
     
    Roedy Green, Jan 17, 2010
    #2
    1. Advertising

  3. On 17-01-2010 00:05, John B. Matthews wrote:
    > Students of Java may enjoy this "Crash Course In Modern Hardware."
    >
    > "In this presentation from the JVM Languages Summit 2009, Cliff Click
    > discusses the Von Neumann architecture, CISC vs RISC, the rise of
    > multicore, Instruction-Level Parallelism (ILP), pipelining, out-of-order
    > dispatch, static vs dynamic ILP, performance impact of cache misses,
    > memory performance, memory vs CPU caching, examples of memory/CPU cache
    > interaction, and tips for improving performance."
    >
    > <http://www.infoq.com/presentations/click-crash-course-modern-hardware>


    Very interesting.

    Arne
     
    Arne Vajhøj, Jan 17, 2010
    #3
  4. John B. Matthews

    Arne Vajhøj Guest

    On 17-01-2010 05:53, Roedy Green wrote:
    > On Sun, 17 Jan 2010 00:05:20 -0500, "John B. Matthews"
    > <> wrote, quoted or indirectly quoted someone who
    > said :
    >
    >> <http://www.infoq.com/presentations/click-crash-course-modern-hardware>

    >
    > Sun uses a great new presentation technology for this.
    >
    > You see the guy talking in a little window, but the slides render as
    > HTML (or something similar) on the bulk of your screen.
    >
    > I have not figured out how to just scan the slides.
    >
    > Usually on a video presentation, you can't read the slides. Here you
    > can see them in perfect clarity.


    If you want the slides then you can find them at:

    http://developers.sun.com/learning/javaoneonline/j1sessn.jsp?sessn=TS-5496&yr=2009&track=javase

    Arne
     
    Arne Vajhøj, Jan 17, 2010
    #4
  5. In article <4b539077$0$275$>,
    Arne Vajhøj <> wrote:

    > If you want the slides then you can find them at:


    http://developers.sun.com/learning/javaoneonline/j1sessn.jsp?sessn=TS-5496&yr=2009&track=javase

    Thanks! My favorite talking points (pp 68, 69):

    * Dominant operations
    1985: page faults
    Locality is critical
    1995: instructions executed
    Multiplies are expensive, loads are cheap
    Locality not so important
    2005: cache misses
    Multiplies are cheap, loads are expensive!
    Locality is critical again!

    * We need to update our mental performance models as the hardware evolves

    * Unless you profile (deeply) you just don't know

    --
    John B. Matthews
    trashgod at gmail dot com
    <http://sites.google.com/site/drjohnbmatthews>
     
    John B. Matthews, Jan 17, 2010
    #5
  6. On 17-01-2010 18:20, John B. Matthews wrote:
    > In article<4b539077$0$275$>,
    > Arne Vajhøj<> wrote:
    >
    >> If you want the slides then you can find them at:

    >
    > http://developers.sun.com/learning/javaoneonline/j1sessn.jsp?sessn=TS-5496&yr=2009&track=javase
    >
    > Thanks! My favorite talking points (pp 68, 69):
    >
    > * Dominant operations
    > 1985: page faults
    > Locality is critical
    > 1995: instructions executed
    > Multiplies are expensive, loads are cheap
    > Locality not so important
    > 2005: cache misses
    > Multiplies are cheap, loads are expensive!
    > Locality is critical again!
    >
    > * We need to update our mental performance models as the hardware evolves
    >
    > * Unless you profile (deeply) you just don't know


    I think he start by saying that the slides has been slightly modified
    compared to the Java One version.

    But they look very similar to me.

    Arne
     
    Arne Vajhøj, Jan 17, 2010
    #6
  7. John B. Matthews

    Roedy Green Guest

    On Sun, 17 Jan 2010 18:20:31 -0500, "John B. Matthews"
    <> wrote, quoted or indirectly quoted someone who
    said :

    >* We need to update our mental performance models as the hardware evolves


    I did not realise how important locality had become. A cache miss
    going to RAM costs 200 to 300 clock cycles! This penalty dominates
    everything else.

    This suggests that interpretive code with a tight core might run
    faster than "highly optimised" machine code since you could arrange
    that the core of it was entirely in cache.

    It also suggests FORTH-style coding with tiny methods and extreme
    reusability would give you speed boost because more of your code could
    fit in cache. We are no longer trying to reduce the number of
    instructions executed. We are trying to fit the entire program into
    cache. Techniques like loop unraveling could be counter productive
    since they increase the size of the code.

    Hyperthreading is a defence. If you have many hardware threads
    running in the same CPU, when one thread blocks to fetch from RAM, the
    other threads can keep going and keep multiple adders, instruction
    decoders etc chugging.


    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    I decry the current tendency to seek patents on algorithms. There are better ways to earn a living than to prevent other people from making use of one’s contributions to computer science.
    ~ Donald Ervin Knuth (born: 1938-01-10 age: 72)
     
    Roedy Green, Jan 18, 2010
    #7
  8. John B. Matthews

    Arne Vajhøj Guest

    On 17-01-2010 22:10, Roedy Green wrote:
    > On Sun, 17 Jan 2010 18:20:31 -0500, "John B. Matthews"
    > <> wrote, quoted or indirectly quoted someone who
    > said :
    >> * We need to update our mental performance models as the hardware evolves

    >
    > I did not realise how important locality had become. A cache miss
    > going to RAM costs 200 to 300 clock cycles! This penalty dominates
    > everything else.


    That is what he says.

    Note though that for some problems cache missed are given by data sizes.

    > This suggests that interpretive code with a tight core might run
    > faster than "highly optimised" machine code since you could arrange
    > that the core of it was entirely in cache.


    Why?

    The data fetched would still be the same.

    And the CPU intensive loop like inner loops seems more
    likely to fit into I cache than the relevant part of the
    interpreter.

    > It also suggests FORTH-style coding with tiny methods and extreme
    > reusability would give you speed boost because more of your code could
    > fit in cache. We are no longer trying to reduce the number of
    > instructions executed. We are trying to fit the entire program into
    > cache. Techniques like loop unraveling could be counter productive
    > since they increase the size of the code.


    loop unraveling == loop unrolling?

    With L1 cache in the 128-256KB size, then it requires a lot
    of unrolled loops to fill up the I cache.

    Arne
     
    Arne Vajhøj, Jan 18, 2010
    #8
  9. John B. Matthews

    Tom Anderson Guest

    On Sun, 17 Jan 2010, Arne Vajh?j wrote:

    > On 17-01-2010 22:10, Roedy Green wrote:
    >> On Sun, 17 Jan 2010 18:20:31 -0500, "John B. Matthews"
    >> <> wrote, quoted or indirectly quoted someone who
    >> said :
    >>> * We need to update our mental performance models as the hardware evolves

    >>
    >> I did not realise how important locality had become. A cache miss
    >> going to RAM costs 200 to 300 clock cycles! This penalty dominates
    >> everything else. This suggests that interpretive code with a tight core
    >> might run faster than "highly optimised" machine code since you could
    >> arrange that the core of it was entirely in cache.

    >
    > Why?
    >
    > The data fetched would still be the same.


    Not if the bytecode was more compact than the native code.

    > And the CPU intensive loop like inner loops seems more likely to fit
    > into I cache than the relevant part of the interpreter.


    If you have a single inner loop, then yes, the machine code will fit in
    the cache, and there's no performance advantage to bytecode. But if you
    have a large code footprint - something like an app server, say - then
    it's quite possible that more of the code will fit in the cache with
    bytecode than with native code.

    tom

    --
    This is the best kind of weird. It can make a corpse laugh back to
    death. -- feedmepaper
     
    Tom Anderson, Jan 18, 2010
    #9
  10. On 18.1.2010 15:39, Tom Anderson wrote:
    > On Sun, 17 Jan 2010, Arne Vajh?j wrote:
    >
    >> On 17-01-2010 22:10, Roedy Green wrote:
    >>> On Sun, 17 Jan 2010 18:20:31 -0500, "John B. Matthews"
    >>> <> wrote, quoted or indirectly quoted someone who
    >>> said :
    >>>> * We need to update our mental performance models as the hardware
    >>>> evolves
    >>>
    >>> I did not realise how important locality had become. A cache miss
    >>> going to RAM costs 200 to 300 clock cycles! This penalty dominates
    >>> everything else. This suggests that interpretive code with a tight
    >>> core might run faster than "highly optimised" machine code since you
    >>> could arrange that the core of it was entirely in cache.

    >>
    >> Why?
    >>
    >> The data fetched would still be the same.

    >
    > Not if the bytecode was more compact than the native code.
    >
    >> And the CPU intensive loop like inner loops seems more likely to fit
    >> into I cache than the relevant part of the interpreter.

    >
    > If you have a single inner loop, then yes, the machine code will fit in
    > the cache, and there's no performance advantage to bytecode. But if you
    > have a large code footprint - something like an app server, say - then
    > it's quite possible that more of the code will fit in the cache with
    > bytecode than with native code.
    >


    I thought the bytecode is nowadays always converted to native code by
    the JIT. Am I wrong?

    --
    You will have a long and unpleasant discussion with your supervisor.
     
    Donkey Hottie, Jan 18, 2010
    #10
  11. In article <>,
    Donkey Hottie <> wrote:

    > On 18.1.2010 15:39, Tom Anderson wrote:
    > > On Sun, 17 Jan 2010, Arne Vajh?j wrote:
    > >
    > >> On 17-01-2010 22:10, Roedy Green wrote:
    > >>> On Sun, 17 Jan 2010 18:20:31 -0500, "John B. Matthews"
    > >>> <> wrote, quoted or indirectly quoted
    > >>> someone who said :
    > >>>> * We need to update our mental performance models as the hardware
    > >>>> evolves
    > >>>
    > >>> I did not realise how important locality had become. A cache
    > >>> miss going to RAM costs 200 to 300 clock cycles! This penalty
    > >>> dominates everything else. This suggests that interpretive code
    > >>> with a tight core might run faster than "highly optimised"
    > >>> machine code since you could arrange that the core of it was
    > >>> entirely in cache.
    > >>
    > >> Why?
    > >>
    > >> The data fetched would still be the same.

    > >
    > > Not if the bytecode was more compact than the native code.
    > >
    > >> And the CPU intensive loop like inner loops seems more likely to
    > >> fit into I cache than the relevant part of the interpreter.

    > >
    > > If you have a single inner loop, then yes, the machine code will
    > > fit in the cache, and there's no performance advantage to bytecode.
    > > But if you have a large code footprint - something like an app
    > > server, say - then it's quite possible that more of the code will
    > > fit in the cache with bytecode than with native code.

    >
    > I thought the bytecode is nowadays always converted to native code by
    > the JIT. Am I wrong?


    Some, but not all: "The Java Hotspot[VM] does not include a plug-in JIT
    compiler but instead compiles and inline methods that appear [to be]
    the most used in the application."

    <http://java.sun.com/developer/onlineTraining/Programming/JDCBook/perf2.html>

    [Sorry about the heavy-handed editing.]

    --
    John B. Matthews
    trashgod at gmail dot com
    <http://sites.google.com/site/drjohnbmatthews>
     
    John B. Matthews, Jan 18, 2010
    #11
  12. John B. Matthews

    Lew Guest

    Donkey Hottie wrote:
    >> I thought the bytecode is nowadays always converted to native code by
    >> the JIT. Am I wrong?


    Yes.

    John B. Matthews wrote:
    > Some, but not all: "The Java Hotspot[VM] does not include a plug-in JIT
    > compiler but instead compiles and inline methods that appear [to be]
    > the most used in the application."
    >
    > <http://java.sun.com/developer/onlineTraining/Programming/JDCBook/perf2.html>


    Hotspot runs bytecode altogether, at first (JNI excluded from consideration
    here). Based on actual runtime heuristics, it might convert some parts to
    native code and run the compiled version. As execution progresses, Hotspot
    may revert compiled parts back to interpreted bytecode, depending on runtime
    situations.

    --
    Lew
     
    Lew, Jan 19, 2010
    #12
  13. John B. Matthews

    Roedy Green Guest

    On Sun, 17 Jan 2010 20:56:29 -0800, Peter Duniho
    <> wrote, quoted or indirectly quoted
    someone who said :

    >Profiling is definitely important for performance-critical code. It can
    >uncover lots of important architecture-independent problems. But it has
    >limited value in generalizing solutions for architecture-specific
    >issues. Only if you can restrict your installation to the same hardware
    >you used for profiling can you address those kinds of problems.


    I would have thought by now distributed code would be optimised at the
    customer's machine to suit the specific hardware, not by the
    application, but by the OS using code provided by the CPU maker.
    Presumably you could afford to spend more time in analysis than you
    can on the fly in hardware while the code is running.
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    I decry the current tendency to seek patents on algorithms. There are better ways to earn a living than to prevent other people from making use of one’s contributions to computer science.
    ~ Donald Ervin Knuth (born: 1938-01-10 age: 72)
     
    Roedy Green, Jan 19, 2010
    #13
  14. John B. Matthews

    Arne Vajhøj Guest

    On 18-01-2010 08:39, Tom Anderson wrote:
    > On Sun, 17 Jan 2010, Arne Vajh?j wrote:
    >
    >> On 17-01-2010 22:10, Roedy Green wrote:
    >>> On Sun, 17 Jan 2010 18:20:31 -0500, "John B. Matthews"
    >>> <> wrote, quoted or indirectly quoted someone who
    >>> said :
    >>>> * We need to update our mental performance models as the hardware
    >>>> evolves
    >>>
    >>> I did not realise how important locality had become. A cache miss
    >>> going to RAM costs 200 to 300 clock cycles! This penalty dominates
    >>> everything else. This suggests that interpretive code with a tight
    >>> core might run faster than "highly optimised" machine code since you
    >>> could arrange that the core of it was entirely in cache.

    >>
    >> Why?
    >>
    >> The data fetched would still be the same.

    >
    > Not if the bytecode was more compact than the native code.


    When I wrote data I actually meant data.

    >> And the CPU intensive loop like inner loops seems more likely to fit
    >> into I cache than the relevant part of the interpreter.

    >
    > If you have a single inner loop, then yes, the machine code will fit in
    > the cache, and there's no performance advantage to bytecode. But if you
    > have a large code footprint - something like an app server, say - then
    > it's quite possible that more of the code will fit in the cache with
    > bytecode than with native code.


    It is possible.

    Well - it is almost certain that it will be the case for some apps.

    But in most cases I would expect most of the time being spend
    on executing relative small pieces of code. 80-20 or 90-10 rule.

    Arne
     
    Arne Vajhøj, Jan 19, 2010
    #14
  15. On 18-01-2010 19:54, Lew wrote:
    > Donkey Hottie wrote:
    >>> I thought the bytecode is nowadays always converted to native code by
    >>> the JIT. Am I wrong?

    >
    > Yes.
    >
    > John B. Matthews wrote:
    >> Some, but not all: "The Java Hotspot[VM] does not include a plug-in
    >> JIT compiler but instead compiles and inline methods that appear
    >> [to be] the most used in the application."
    >>
    >> <http://java.sun.com/developer/onlineTraining/Programming/JDCBook/perf2.html>
    >>

    >
    > Hotspot runs bytecode altogether, at first (JNI excluded from
    > consideration here). Based on actual runtime heuristics, it might
    > convert some parts to native code and run the compiled version. As
    > execution progresses, Hotspot may revert compiled parts back to
    > interpreted bytecode, depending on runtime situations.


    Nothing in any spec prevents it from doing so, but I am skeptical
    about whether any implementations would do so.

    If it actually has spend time JIT compiling why should it go
    back to interpreting?

    Arne
     
    Arne Vajhøj, Jan 19, 2010
    #15
  16. John B. Matthews

    Arne Vajhøj Guest

    On 17-01-2010 23:56, Peter Duniho wrote:
    > Roedy Green wrote:
    >> [...]
    >> Hyperthreading is a defence. If you have many hardware threads
    >> running in the same CPU, when one thread blocks to fetch from RAM, the
    >> other threads can keep going and keep multiple adders, instruction
    >> decoders etc chugging.

    >
    > Actually, hyperthreading and even, in some architectures, multi-core
    > CPUs can actually make things worse.
    >
    > I've read claims that Intel has improved things with the Nehalem
    > architecture. But the shared-cache design of early hyperthreaded
    > processors could easily cause naïve multi-threading implementations to
    > perform _much_ worse than a single-threaded implementation. That's
    > because having multiple threads all with the same entry point caused
    > those threads to often operate with a stack layout identical to each
    > other, which in turned caused aliasing in the cache.
    >
    > The two threads running simultaneously on the same CPU, sharing a cache,
    > would spend most of their time alternately trashing the other thread's
    > cached stack data and waiting for their own stack data to be brought
    > back in to the cache from system RAM after the other thread trashed it.
    >
    > Hyperthreading is far from a panacea, and I would not call it even a
    > defense. Specifically _because_ of how caching is so critical to
    > performance today, hyperthreading can cause huge performance problems on
    > certain CPUs, and even when it's used properly doesn't produce nearly as
    > big a benefit as actual multiple CPU cores would.


    SMT capability is obviously not as fast as full cores.

    But given that most of the major server CPU's (Xeon, Power and SPARC)
    uses the technique, then there seems to be agreement that it is a good
    thing.

    Arne
     
    Arne Vajhøj, Jan 19, 2010
    #16
  17. John B. Matthews

    Roedy Green Guest

    On 18 Jan 2010 22:10:36 GMT, Thomas Pornin <> wrote,
    quoted or indirectly quoted someone who said :

    >These games are expensive, not in clock cycles but in RAM: the JIT
    >compiler must use more bytes than what a C compiler would do on the
    >equivalent C source code.


    An alternative is static compilation using Jet.

    See http://mindprod.com/jgloss/jet.html

    I notice though that it does a lot of unraveling and a loop versioning
    where several variant loop bodies are created without ifs in them, and
    selection at the top to choose which body to use.

    Ironically, all this work may be slowing things down on the latest
    CPUs.
    --
    Roedy Green Canadian Mind Products
    http://mindprod.com
    I decry the current tendency to seek patents on algorithms. There are better ways to earn a living than to prevent other people from making use of one’s contributions to computer science.
    ~ Donald Ervin Knuth (born: 1938-01-10 age: 72)
     
    Roedy Green, Jan 19, 2010
    #17
  18. John B. Matthews

    Lew Guest

    Lew wrote:
    >> Hotspot runs bytecode altogether, at first (JNI excluded from
    >> consideration here). Based on actual runtime heuristics, it might
    >> convert some parts to native code and run the compiled version. As
    >> execution progresses, Hotspot may revert compiled parts back to
    >> interpreted bytecode, depending on runtime situations.


    Arne Vajhøj wrote:
    > Nothing in any spec prevents it from doing so, but I am skeptical
    > about whether any implementations would do so.


    Well, either Sun is a bunch of big, fat liars, or you can set your skepticism
    aside:
    <http://java.sun.com/products/hotspot/whitepaper.html#dynamic>
    "Both the Java HotSpot Client and Server compilers fully support dynamic
    deoptimization."

    > If it actually has spend time JIT compiling why should it go
    > back to interpreting?


    Some of the reasoning is explained in
    <http://java.sun.com/products/hotspot/whitepaper.html#3>

    There's more detail in
    <http://java.sun.com/products/hotspot/docs/general/hs2.html>
    "The Java HotSpot Server VM can revert to using the interpreter whenever
    compiler deoptimizations are called for because of dynamic class loading. When
    a class is loaded dynamically, HotSpot checks to ensure that the inter-class
    dependecies [sic] of inlined methods have not been altered. If any
    dependencies are affected by dynamically loaded class [sic], HotSpot can back
    out affected inlined code, revert to interpreting for a while, and re-optimize
    later based on the new class dependencies."

    One of my favorite experts, Brian Goetz, wrote about this back in 2004:
    <http://www.ibm.com/developerworks/library/j-jtp12214/>
    "[T]he JVM continues profiling, and may recompile the code again later with a
    higher level of optimization if it decides the code path is particularly hot
    or future profiling data suggests opportunities for additional optimization.
    The JVM may recompile the same bytecodes many times in a single application
    execution."

    and later, discussing inlining,
    "... the JVM can figure this out, and will invalidate the generated code that
    is based on the now-invalid assumption and revert to interpretation (or
    recompile the invalidated code path)."

    Despite your skepticism, not only has one (in fact, the) implementation done
    dynamic reversion to interpreted bytecode, but it's been doing so for quite
    some years.

    --
    Lew
     
    Lew, Jan 19, 2010
    #18
  19. On 18-01-2010 21:45, Lew wrote:
    > Lew wrote:
    >>> Hotspot runs bytecode altogether, at first (JNI excluded from
    >>> consideration here). Based on actual runtime heuristics, it might
    >>> convert some parts to native code and run the compiled version. As
    >>> execution progresses, Hotspot may revert compiled parts back to
    >>> interpreted bytecode, depending on runtime situations.

    >
    > Arne Vajhøj wrote:
    >> Nothing in any spec prevents it from doing so, but I am skeptical
    >> about whether any implementations would do so.

    >
    > Well, either Sun is a bunch of big, fat liars, or you can set your
    > skepticism aside:
    > <http://java.sun.com/products/hotspot/whitepaper.html#dynamic>
    > "Both the Java HotSpot Client and Server compilers fully support dynamic
    > deoptimization."
    >
    >> If it actually has spend time JIT compiling why should it go
    >> back to interpreting?

    >
    > Some of the reasoning is explained in
    > <http://java.sun.com/products/hotspot/whitepaper.html#3>
    >
    > There's more detail in
    > <http://java.sun.com/products/hotspot/docs/general/hs2.html>
    > "The Java HotSpot Server VM can revert to using the interpreter whenever
    > compiler deoptimizations are called for because of dynamic class
    > loading. When a class is loaded dynamically, HotSpot checks to ensure
    > that the inter-class dependecies [sic] of inlined methods have not been
    > altered. If any dependencies are affected by dynamically loaded class
    > [sic], HotSpot can back out affected inlined code, revert to
    > interpreting for a while, and re-optimize later based on the new class
    > dependencies."
    >
    > One of my favorite experts, Brian Goetz, wrote about this back in 2004:
    > <http://www.ibm.com/developerworks/library/j-jtp12214/>
    > "[T]he JVM continues profiling, and may recompile the code again later
    > with a higher level of optimization if it decides the code path is
    > particularly hot or future profiling data suggests opportunities for
    > additional optimization. The JVM may recompile the same bytecodes many
    > times in a single application execution."
    >
    > and later, discussing inlining,
    > "... the JVM can figure this out, and will invalidate the generated code
    > that is based on the now-invalid assumption and revert to interpretation
    > (or recompile the invalidated code path)."
    >
    > Despite your skepticism, not only has one (in fact, the) implementation
    > done dynamic reversion to interpreted bytecode, but it's been doing so
    > for quite some years.


    Then I learned something today. Which is not a bad thing.

    Ensuring correct behavior is of course a very good reason to
    fall back to interpretation.

    Arne
     
    Arne Vajhøj, Jan 19, 2010
    #19
  20. John B. Matthews

    Lew Guest

    Patricia Shanahan wrote:
    > Roedy Green wrote:
    > ...
    >> This suggests that interpretive code with a tight core might run
    >> faster than "highly optimised" machine code since you could arrange
    >> that the core of it was entirely in cache.

    > ...
    >
    > How would you implement an interpreter to avoid executing a totally
    > unpredictable branch for each instruction?


    This apparently rhetorical question leads to some interesting possibilities,
    e.g., the exploitation of latency. There is likely a tension between these
    possibilities and cache-locality, however since cache is a hack we can expect
    its limits to be less restrictive over time. Latency, OTOH, is likely to
    become a greater and greater issue. Hyperthreading is one technique that
    exploits latency.

    An answer to the question is to load all possible branches into the pipeline
    during the latency (-ies) involved in evaluating the "if" or other actions.
    (There is no such thing as a "totally unpredictable branch" as all branches
    can be predicted.) If the conclusion of the branch evaluation finds all, or
    at least all the most likely options already loaded up, the system can simply
    discard the unused branches. This term goes by various names; I believe one
    is "speculative execution".

    The avoidance itself is subject to definition. Do we avoid any possibility
    whatsoever of an unpredicted branch? Or do we do what CPUs already do, and
    reduce the likelihood of such a branch? Either one could be called "avoidance".

    I think Hotspot itself embodies various answers to the question. It inlines
    and compiles to native code based on run-time profiles. It undoes those
    optimizations if the assumptions behind them later fail. It optimizes the
    more likely branches.

    I don't think it's possible to keep all branches of all code, tight code or
    not, always in a limited RAM space, such as the 32KB Level 1 cache mentioned
    upthread, or even the 8MB Level 1 cache of the not-distant future. We can
    continue the existing trend of keeping most of what we mostly need mostly in
    the cache most of the time, moving "most" asymptotically toward unity.

    --
    Lew
     
    Lew, Jan 19, 2010
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    3
    Views:
    291
    The Prophet
    Nov 16, 2005
  2. buts101
    Replies:
    2
    Views:
    714
    TechBookReport
    Jan 4, 2007
  3. DarkProgrammer

    Crash course in VC++?

    DarkProgrammer, Oct 27, 2005, in forum: C++
    Replies:
    5
    Views:
    459
    Yogesh
    Oct 28, 2005
  4. rosoft

    Crash Course needed

    rosoft, Jan 20, 2008, in forum: ASP .Net
    Replies:
    5
    Views:
    412
    rosoft
    Jan 21, 2008
  5. PSULionRP

    ASP Crash Course

    PSULionRP, Aug 12, 2008, in forum: ASP General
    Replies:
    5
    Views:
    229
    Mike Brind [MVP]
    Aug 13, 2008
Loading...

Share This Page