Crash Course In Modern Hardware

R

Roedy Green

If it actually has spend time JIT compiling why should it go
back to interpreting?

Let us say you dynamically load a class that overrides methods that
the JIT had provisionally treated as final and had inlined.

It has to do some pretty fancy footwork. It has to UN-inline all that
code, turn it back into byte code, then rejit it.

The problem has been solved, but it seems to me to be intractable.
There is no simple correspondence between machine code and byte code.
Data could be cached in registers. I am blown away that it works at
all, much less works reliably.

You'd think the one saving grace is the points where you have to rejit
always occur at a call boundary. But there is no such guarantee on the
other threads.

I love to see a webinar on how they pulled this off. Perhaps the JIT
machine code is quite constrained to make this possible.
 
R

Roedy Green

How would you implement an interpreter to avoid executing a totally
unpredictable branch for each instruction?

In my Forth interpreter, I arranged things so that branches fell
through on the usual case.

Think in terms of FORTH chips, that have the interpreter in hardware.
They can do things like maintain branch history, and overlap RET on
any instruction.

A Java Byte code machine with most of the interpreter in hardware
might be a better architecture since the code is so much more compact.
 
R

Roedy Green

L1 cache is more in the 32-64KB range. Basically 32 KB for a Core2
Intel, 64 KB for the AMD equivalent. That's for code; you have the
same amount in data.

This suggests the CPU makers should make simpler CPUs, and turn the
real estate over to a bigger cache, or focus all the smarts in the CPU
on carrying on while a load is stalled.

Perhaps it is a marketing problem. People don't realised the value
of extra static RAM cache. They go for incremental Ghz numbers.
 
L

Lew

Roedy said:
It has to do some pretty fancy footwork. It has to UN-inline all that
code, turn it back into byte code, then rejit it.
...
I love to see a webinar on how they pulled this off. Perhaps the JIT
machine code is quite constrained to make this possible.

They never lose the bytecode, so they don't have to "turn it back" at all;
it's already there. The clever thing they did was let it happen in mid-stroke
on the stack, that is, even in the midst of a loop it can jump back to the
interpreted version or to a newly-compiled version. If the latter, the new
compilation occurs in the background from the never-erased and therefore
never-needs-to-be-reconstructed bytecode just like the original one did. The
same quick-jump trick moves execution from the interpreted or
previously-compiled version to the new one.

The links I provided upthread touch on this feature.
 
D

Dacong Yan

Lew said:
Nothing in any spec prevents it from doing so, but I am skeptical
about whether any implementations would do so.

Well, either Sun is a bunch of big, fat liars, or you can set your
skepticism aside:
<http://java.sun.com/products/hotspot/whitepaper.html#dynamic>
"Both the Java HotSpot Client and Server compilers fully support dynamic
deoptimization."
If it actually has spend time JIT compiling why should it go
back to interpreting?

Some of the reasoning is explained in
<http://java.sun.com/products/hotspot/whitepaper.html#3>

There's more detail in
<http://java.sun.com/products/hotspot/docs/general/hs2.html>
"The Java HotSpot Server VM can revert to using the interpreter whenever
compiler deoptimizations are called for because of dynamic class
loading. When a class is loaded dynamically, HotSpot checks to ensure
that the inter-class dependecies [sic] of inlined methods have not been
altered. If any dependencies are affected by dynamically loaded class
[sic], HotSpot can back out affected inlined code, revert to
interpreting for a while, and re-optimize later based on the new class
dependencies."

One of my favorite experts, Brian Goetz, wrote about this back in 2004:
<http://www.ibm.com/developerworks/library/j-jtp12214/>
"[T]he JVM continues profiling, and may recompile the code again later
with a higher level of optimization if it decides the code path is
particularly hot or future profiling data suggests opportunities for
additional optimization. The JVM may recompile the same bytecodes many
times in a single application execution."

and later, discussing inlining,
"... the JVM can figure this out, and will invalidate the generated code
that is based on the now-invalid assumption and revert to interpretation
(or recompile the invalidated code path)."

Despite your skepticism, not only has one (in fact, the) implementation
done dynamic reversion to interpreted bytecode, but it's been doing so
for quite some years.

The Brian Goetz article is quite good. I really learned something from
it. Thanks, Lew! And btw, besides Brian Goetz, what other experts do
you have in mind can help people understand JIT compiling?

Tony
 
M

Martin Gregorie

This suggests the CPU makers should make simpler CPUs, and turn the real
estate over to a bigger cache, or focus all the smarts in the CPU on
carrying on while a load is stalled.
Sounds like a return to RISC to me. Time to revisit the Motorola 88000
chipset, or at least its cache handling?
 
L

Lew

Peter said:
And that's only theoretically possible. I've never heard any
suggestions that Java actually does include architecture-specific
optimizations, either in the JVM itself, or as part of the optimizer in
the JIT [HotSpot?] compiler.

HotSpot most definitely does do architecture-specific optimizations.
<http://java.sun.com/products/hotspot/whitepaper.html#optimizations>
"System-specific runtime routines generated at VM startup time"

<http://java.sun.com/products/hotspo...tspot_v1.4.1/Java_HSpot_WP_v1.4.1_1002_4.html>
"The [Java HotSpot Server] compiler is highly portable, relying on a machine
description file to describe all aspects of the target hardware."

Things that differ between architectures include register allocation.
 
T

Tom Anderson

When I wrote data I actually meant data.

Doh! Sorry, Arne, i completely failed to understand there. You're quite
right, of course. And i would imagine that in most applications, reads of
data far outweigh reads of code (once you account for the caches). I would
be very interested to see numbers for that across different kinds of
program, though.
It is possible.

Well - it is almost certain that it will be the case for some apps.

But in most cases I would expect most of the time being spend
on executing relative small pieces of code. 80-20 or 90-10 rule.

Right. And if your cache can hold 20% of the bytecode but not 20% of the
machine code, it's a win.

tom
 
M

Martin Gregorie

Doh! Sorry, Arne, i completely failed to understand there. You're quite
right, of course. And i would imagine that in most applications, reads
of data far outweigh reads of code (once you account for the caches). I
would be very interested to see numbers for that across different kinds
of program, though.
It depends what you mean by 'read'.
If you look at the instruction flow into the CPU, i.e. out of any caches
and into the CPU proper, the instruction flow is considerably larger than
the data flow in almost any architecture.

For the optimised code:

int a, b, c;

if (a > 0) then
c = a + b;
else
c = -1;

Assembler examples are:

ICL 1900 (24 bit f/l instructions. 24 bit words) c = a + b

LDA 7 A # Read A
LDN 6 0 # load literal zero
BLE 7 L1 # Jump if A <= 0
ADD 7 B # Read and add B
BRN L2
L1 LDN 7 -1 # Set result to -1
L2 STO 7 C # Save the result

Instructions: 7 read
Data: 2 words read, 1 word written
Ratio I:D 7:2

68020 (32 bit MPU and addresses, v/l instructions, 16 bit words)

MOVE.W A,D1 # Read A 5 bytes [1]
BMI.S L1 # Jump if negative 2 bytes
BEQ.S L1 # Jump if zero 2 bytes
ADD.W B,D1 # Read and add B 5 bytes
BRA.S L2 # 2 bytes
L1 MOVE.W #0,D1 # Set result to -1 5 bytes
L2 MOVE.W D1,C # Save the result 5 bytes

Instructions: 26 bytes read
Data: 4 bytes read, 2 bytes written
Ratio I:D 26:6

[1] I won't swear to these MOVE and ADD instruction lengths (my handbook
doesn't give them and my 68020 isn't running at present, but even if I'm
wrong and they're only 3 bytes, the ratio is still 18:6.

You don't have to throw in much in the way of overflow checking, address
arithmetic, etc to increase the Instruction:Data ratio quite considerably.

Both my examples are of processors with a decently sized register set but
I don't think entirely stack-oriented machines would do much better.

The ICL 2900 had the most sophisticated architecture I've seen (entirely
stack-based, descriptors for all but primitive data types, software-
controlled register length) and averaged 3 instructions per COBOL
sentence v.s the 6+ per sentence of the 1900, but its instruction flow
through the OCP (Order Code Processor) was higher than its data flow and
the hardware was optimised to reflect that fact.

If anybody knows of hardware where the data flow is larger then the
instruction flow and can provide an equivalent example I'd be fascinated
to see it.
 
A

Arne Vajhøj

Doh! Sorry, Arne, i completely failed to understand there. You're quite
right, of course. And i would imagine that in most applications, reads
of data far outweigh reads of code (once you account for the caches). I
would be very interested to see numbers for that across different kinds
of program, though.


Right. And if your cache can hold 20% of the bytecode but not 20% of the
machine code, it's a win.

Yep.

Arne
 
A

Arne Vajhøj

It depends what you mean by 'read'.
If you look at the instruction flow into the CPU, i.e. out of any caches
and into the CPU proper, the instruction flow is considerably larger than
the data flow in almost any architecture.

Yes.

But the flow from main memory and L3 which are the real slow
ones should have a good chance of reading more data than code.

Arne
 
M

Martin Gregorie

Yes.

But the flow from main memory and L3 which are the real slow ones should
have a good chance of reading more data than code.
Agreed, but optimising that is a property of the cache rather than
anything else. The use of Intel-style multi-level caching doesn't affect
the argument.

In a multi-CPU system cache management can become a nightmare: consider
the situation where CPU A gets a cache miss that triggers a cache read
from main memory for a piece of data that has been read and modified by
CPU B, but whose cache has not yet been flushed. This is a common
situation with copy-back caching. In general copy-back caching is faster
than write-through though at the cost of added complexity: all caches
must sniff the address bus and be capable of grabbing it for an immediate
write-back if another CPU is asking for modified data which it holds but
hasn't yet written.
 
A

amit1310

Hi,

Slides are no longer available at this link nor at Azul's webside. Did spend a lot of time yesterday looking for them?

Do any of you have this copy on your laptops/workstations?
Please share if you have it.

Regards,
Amit Malhotra.
 
F

Fredrik Jonson

Slides are no longer available at this link nor at Azul's webside.

This isn't it?

This Is Not Your Father's Von Neumann Machine;
How Modern Architecture Impacts Your Java Apps
Presentation by Cliff Click, Jr. of Azul Systems and Brian Goetz
of Sun Microsystems.
Originally presented at JavaOne 2009

http://www.azulsystems.com/teal-azul/about_us/presentations/hardware-crash
http://www.azulsystems.com/sites/www.azulsystems.com/2009_J1_HardwareCrashCourse.pdf

And the talk, as I know it, is this one:

A Crash Course in Modern Hardware
Did spend a lot of time yesterday looking for them?

You must improve your Google-fu, m'kay?

Thanks though for the reminder of an exelent presentaion.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top