Crash Course In Modern Hardware

J

John B. Matthews

Students of Java may enjoy this "Crash Course In Modern Hardware."

"In this presentation from the JVM Languages Summit 2009, Cliff Click
discusses the Von Neumann architecture, CISC vs RISC, the rise of
multicore, Instruction-Level Parallelism (ILP), pipelining, out-of-order
dispatch, static vs dynamic ILP, performance impact of cache misses,
memory performance, memory vs CPU caching, examples of memory/CPU cache
interaction, and tips for improving performance."

<http://www.infoq.com/presentations/click-crash-course-modern-hardware>
 
A

Arne Vajhøj

Students of Java may enjoy this "Crash Course In Modern Hardware."

"In this presentation from the JVM Languages Summit 2009, Cliff Click
discusses the Von Neumann architecture, CISC vs RISC, the rise of
multicore, Instruction-Level Parallelism (ILP), pipelining, out-of-order
dispatch, static vs dynamic ILP, performance impact of cache misses,
memory performance, memory vs CPU caching, examples of memory/CPU cache
interaction, and tips for improving performance."

<http://www.infoq.com/presentations/click-crash-course-modern-hardware>

Very interesting.

Arne
 
A

Arne Vajhøj

Sun uses a great new presentation technology for this.

You see the guy talking in a little window, but the slides render as
HTML (or something similar) on the bulk of your screen.

I have not figured out how to just scan the slides.

Usually on a video presentation, you can't read the slides. Here you
can see them in perfect clarity.

If you want the slides then you can find them at:

http://developers.sun.com/learning/javaoneonline/j1sessn.jsp?sessn=TS-5496&yr=2009&track=javase

Arne
 
J

John B. Matthews

Arne Vajhøj said:
If you want the slides then you can find them at:

http://developers.sun.com/learning/javaoneonline/j1sessn.jsp?sessn=TS-5496&yr=2009&track=javase

Thanks! My favorite talking points (pp 68, 69):

* Dominant operations
1985: page faults
Locality is critical
1995: instructions executed
Multiplies are expensive, loads are cheap
Locality not so important
2005: cache misses
Multiplies are cheap, loads are expensive!
Locality is critical again!

* We need to update our mental performance models as the hardware evolves

* Unless you profile (deeply) you just don't know
 
A

Arne Vajhøj

http://developers.sun.com/learning/javaoneonline/j1sessn.jsp?sessn=TS-5496&yr=2009&track=javase

Thanks! My favorite talking points (pp 68, 69):

* Dominant operations
1985: page faults
Locality is critical
1995: instructions executed
Multiplies are expensive, loads are cheap
Locality not so important
2005: cache misses
Multiplies are cheap, loads are expensive!
Locality is critical again!

* We need to update our mental performance models as the hardware evolves

* Unless you profile (deeply) you just don't know

I think he start by saying that the slides has been slightly modified
compared to the Java One version.

But they look very similar to me.

Arne
 
R

Roedy Green

* We need to update our mental performance models as the hardware evolves

I did not realise how important locality had become. A cache miss
going to RAM costs 200 to 300 clock cycles! This penalty dominates
everything else.

This suggests that interpretive code with a tight core might run
faster than "highly optimised" machine code since you could arrange
that the core of it was entirely in cache.

It also suggests FORTH-style coding with tiny methods and extreme
reusability would give you speed boost because more of your code could
fit in cache. We are no longer trying to reduce the number of
instructions executed. We are trying to fit the entire program into
cache. Techniques like loop unraveling could be counter productive
since they increase the size of the code.

Hyperthreading is a defence. If you have many hardware threads
running in the same CPU, when one thread blocks to fetch from RAM, the
other threads can keep going and keep multiple adders, instruction
decoders etc chugging.
 
A

Arne Vajhøj

I did not realise how important locality had become. A cache miss
going to RAM costs 200 to 300 clock cycles! This penalty dominates
everything else.

That is what he says.

Note though that for some problems cache missed are given by data sizes.
This suggests that interpretive code with a tight core might run
faster than "highly optimised" machine code since you could arrange
that the core of it was entirely in cache.

Why?

The data fetched would still be the same.

And the CPU intensive loop like inner loops seems more
likely to fit into I cache than the relevant part of the
interpreter.
It also suggests FORTH-style coding with tiny methods and extreme
reusability would give you speed boost because more of your code could
fit in cache. We are no longer trying to reduce the number of
instructions executed. We are trying to fit the entire program into
cache. Techniques like loop unraveling could be counter productive
since they increase the size of the code.

loop unraveling == loop unrolling?

With L1 cache in the 128-256KB size, then it requires a lot
of unrolled loops to fill up the I cache.

Arne
 
T

Tom Anderson

Why?

The data fetched would still be the same.

Not if the bytecode was more compact than the native code.
And the CPU intensive loop like inner loops seems more likely to fit
into I cache than the relevant part of the interpreter.

If you have a single inner loop, then yes, the machine code will fit in
the cache, and there's no performance advantage to bytecode. But if you
have a large code footprint - something like an app server, say - then
it's quite possible that more of the code will fit in the cache with
bytecode than with native code.

tom
 
D

Donkey Hottie

Not if the bytecode was more compact than the native code.


If you have a single inner loop, then yes, the machine code will fit in
the cache, and there's no performance advantage to bytecode. But if you
have a large code footprint - something like an app server, say - then
it's quite possible that more of the code will fit in the cache with
bytecode than with native code.

I thought the bytecode is nowadays always converted to native code by
the JIT. Am I wrong?
 
L

Lew

Yes.
Some, but not all: "The Java Hotspot[VM] does not include a plug-in JIT
compiler but instead compiles and inline methods that appear [to be]
the most used in the application."

<http://java.sun.com/developer/onlineTraining/Programming/JDCBook/perf2.html>


Hotspot runs bytecode altogether, at first (JNI excluded from consideration
here). Based on actual runtime heuristics, it might convert some parts to
native code and run the compiled version. As execution progresses, Hotspot
may revert compiled parts back to interpreted bytecode, depending on runtime
situations.
 
R

Roedy Green

Profiling is definitely important for performance-critical code. It can
uncover lots of important architecture-independent problems. But it has
limited value in generalizing solutions for architecture-specific
issues. Only if you can restrict your installation to the same hardware
you used for profiling can you address those kinds of problems.

I would have thought by now distributed code would be optimised at the
customer's machine to suit the specific hardware, not by the
application, but by the OS using code provided by the CPU maker.
Presumably you could afford to spend more time in analysis than you
can on the fly in hardware while the code is running.
 
A

Arne Vajhøj

Not if the bytecode was more compact than the native code.

When I wrote data I actually meant data.
If you have a single inner loop, then yes, the machine code will fit in
the cache, and there's no performance advantage to bytecode. But if you
have a large code footprint - something like an app server, say - then
it's quite possible that more of the code will fit in the cache with
bytecode than with native code.

It is possible.

Well - it is almost certain that it will be the case for some apps.

But in most cases I would expect most of the time being spend
on executing relative small pieces of code. 80-20 or 90-10 rule.

Arne
 
A

Arne Vajhøj

Yes.
Some, but not all: "The Java Hotspot[VM] does not include a plug-in
JIT compiler but instead compiles and inline methods that appear
[to be] the most used in the application."

<http://java.sun.com/developer/onlineTraining/Programming/JDCBook/perf2.html>


Hotspot runs bytecode altogether, at first (JNI excluded from
consideration here). Based on actual runtime heuristics, it might
convert some parts to native code and run the compiled version. As
execution progresses, Hotspot may revert compiled parts back to
interpreted bytecode, depending on runtime situations.


Nothing in any spec prevents it from doing so, but I am skeptical
about whether any implementations would do so.

If it actually has spend time JIT compiling why should it go
back to interpreting?

Arne
 
A

Arne Vajhøj

Roedy said:
[...]
Hyperthreading is a defence. If you have many hardware threads
running in the same CPU, when one thread blocks to fetch from RAM, the
other threads can keep going and keep multiple adders, instruction
decoders etc chugging.

Actually, hyperthreading and even, in some architectures, multi-core
CPUs can actually make things worse.

I've read claims that Intel has improved things with the Nehalem
architecture. But the shared-cache design of early hyperthreaded
processors could easily cause naïve multi-threading implementations to
perform _much_ worse than a single-threaded implementation. That's
because having multiple threads all with the same entry point caused
those threads to often operate with a stack layout identical to each
other, which in turned caused aliasing in the cache.

The two threads running simultaneously on the same CPU, sharing a cache,
would spend most of their time alternately trashing the other thread's
cached stack data and waiting for their own stack data to be brought
back in to the cache from system RAM after the other thread trashed it.

Hyperthreading is far from a panacea, and I would not call it even a
defense. Specifically _because_ of how caching is so critical to
performance today, hyperthreading can cause huge performance problems on
certain CPUs, and even when it's used properly doesn't produce nearly as
big a benefit as actual multiple CPU cores would.

SMT capability is obviously not as fast as full cores.

But given that most of the major server CPU's (Xeon, Power and SPARC)
uses the technique, then there seems to be agreement that it is a good
thing.

Arne
 
R

Roedy Green

These games are expensive, not in clock cycles but in RAM: the JIT
compiler must use more bytes than what a C compiler would do on the
equivalent C source code.

An alternative is static compilation using Jet.

See http://mindprod.com/jgloss/jet.html

I notice though that it does a lot of unraveling and a loop versioning
where several variant loop bodies are created without ifs in them, and
selection at the top to choose which body to use.

Ironically, all this work may be slowing things down on the latest
CPUs.
 
L

Lew

Nothing in any spec prevents it from doing so, but I am skeptical
about whether any implementations would do so.

Well, either Sun is a bunch of big, fat liars, or you can set your skepticism
aside:
<http://java.sun.com/products/hotspot/whitepaper.html#dynamic>
"Both the Java HotSpot Client and Server compilers fully support dynamic
deoptimization."
If it actually has spend time JIT compiling why should it go
back to interpreting?

Some of the reasoning is explained in
<http://java.sun.com/products/hotspot/whitepaper.html#3>

There's more detail in
<http://java.sun.com/products/hotspot/docs/general/hs2.html>
"The Java HotSpot Server VM can revert to using the interpreter whenever
compiler deoptimizations are called for because of dynamic class loading. When
a class is loaded dynamically, HotSpot checks to ensure that the inter-class
dependecies [sic] of inlined methods have not been altered. If any
dependencies are affected by dynamically loaded class [sic], HotSpot can back
out affected inlined code, revert to interpreting for a while, and re-optimize
later based on the new class dependencies."

One of my favorite experts, Brian Goetz, wrote about this back in 2004:
<http://www.ibm.com/developerworks/library/j-jtp12214/>
"[T]he JVM continues profiling, and may recompile the code again later with a
higher level of optimization if it decides the code path is particularly hot
or future profiling data suggests opportunities for additional optimization.
The JVM may recompile the same bytecodes many times in a single application
execution."

and later, discussing inlining,
"... the JVM can figure this out, and will invalidate the generated code that
is based on the now-invalid assumption and revert to interpretation (or
recompile the invalidated code path)."

Despite your skepticism, not only has one (in fact, the) implementation done
dynamic reversion to interpreted bytecode, but it's been doing so for quite
some years.
 
A

Arne Vajhøj

Nothing in any spec prevents it from doing so, but I am skeptical
about whether any implementations would do so.

Well, either Sun is a bunch of big, fat liars, or you can set your
skepticism aside:
<http://java.sun.com/products/hotspot/whitepaper.html#dynamic>
"Both the Java HotSpot Client and Server compilers fully support dynamic
deoptimization."
If it actually has spend time JIT compiling why should it go
back to interpreting?

Some of the reasoning is explained in
<http://java.sun.com/products/hotspot/whitepaper.html#3>

There's more detail in
<http://java.sun.com/products/hotspot/docs/general/hs2.html>
"The Java HotSpot Server VM can revert to using the interpreter whenever
compiler deoptimizations are called for because of dynamic class
loading. When a class is loaded dynamically, HotSpot checks to ensure
that the inter-class dependecies [sic] of inlined methods have not been
altered. If any dependencies are affected by dynamically loaded class
[sic], HotSpot can back out affected inlined code, revert to
interpreting for a while, and re-optimize later based on the new class
dependencies."

One of my favorite experts, Brian Goetz, wrote about this back in 2004:
<http://www.ibm.com/developerworks/library/j-jtp12214/>
"[T]he JVM continues profiling, and may recompile the code again later
with a higher level of optimization if it decides the code path is
particularly hot or future profiling data suggests opportunities for
additional optimization. The JVM may recompile the same bytecodes many
times in a single application execution."

and later, discussing inlining,
"... the JVM can figure this out, and will invalidate the generated code
that is based on the now-invalid assumption and revert to interpretation
(or recompile the invalidated code path)."

Despite your skepticism, not only has one (in fact, the) implementation
done dynamic reversion to interpreted bytecode, but it's been doing so
for quite some years.

Then I learned something today. Which is not a bad thing.

Ensuring correct behavior is of course a very good reason to
fall back to interpretation.

Arne
 
L

Lew

Patricia said:
Roedy Green wrote:
...
...

How would you implement an interpreter to avoid executing a totally
unpredictable branch for each instruction?

This apparently rhetorical question leads to some interesting possibilities,
e.g., the exploitation of latency. There is likely a tension between these
possibilities and cache-locality, however since cache is a hack we can expect
its limits to be less restrictive over time. Latency, OTOH, is likely to
become a greater and greater issue. Hyperthreading is one technique that
exploits latency.

An answer to the question is to load all possible branches into the pipeline
during the latency (-ies) involved in evaluating the "if" or other actions.
(There is no such thing as a "totally unpredictable branch" as all branches
can be predicted.) If the conclusion of the branch evaluation finds all, or
at least all the most likely options already loaded up, the system can simply
discard the unused branches. This term goes by various names; I believe one
is "speculative execution".

The avoidance itself is subject to definition. Do we avoid any possibility
whatsoever of an unpredicted branch? Or do we do what CPUs already do, and
reduce the likelihood of such a branch? Either one could be called "avoidance".

I think Hotspot itself embodies various answers to the question. It inlines
and compiles to native code based on run-time profiles. It undoes those
optimizations if the assumptions behind them later fail. It optimizes the
more likely branches.

I don't think it's possible to keep all branches of all code, tight code or
not, always in a limited RAM space, such as the 32KB Level 1 cache mentioned
upthread, or even the 8MB Level 1 cache of the not-distant future. We can
continue the existing trend of keeping most of what we mostly need mostly in
the cache most of the time, moving "most" asymptotically toward unity.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top