"The Unreasonable Effectiveness of C" by Damien Katz

B

BGB

With my interpreter, and a combined bunch of benchmarks, I get a time of
51 seconds using these label pointers (for opcode dispatch). Using plain
switch it was 62 seconds. With a simple loop calling function pointers
it was 75 seconds (because most inlining ability is lost).

in my case (using MSVC), the function pointers are faster than the
switch. part of the issue seems to be mostly that MSVC spits out
switches with a long lead-in sequence, whereas GCC tends to use a
shorter lead-in sequence for switches.

the main selling point for the function-pointers (vs switch) though is
that I can dispatch more directly to the logic I need to execute,
effectively being able to shave a lot more logic out of the execution path.

sadly, MSVC doesn't have label pointers or computed goto.

So there's not much in it, and still a long way from native code. (And
when I tested on an ARM processor, there seemed little difference
between indirect gotos, and switch.)

yep.


(Plugging in an actual ASM module to do the dispatch, but initially just
calling the C handlers, it was 82 seconds! So the same time as the
function dispatch method, plus extra overheads of the more complex ASM
threaded call-chain (plus save/restore registers). However, it took a
just a few hours to write enough inline functions in the ASM module to
beat the best 51-second C timing, but only by 10%, so not worth the
trouble or the loss of flexibility.)

yep.

in cases I have used ASM, it has more often been because whatever can't
actually be readily accomplished in C.

in other cases, it is more because I can generate a specialized
instruction sequence at run-time.

For fib(40) I get 18 seconds using the function pointer loop, and 9.5
seconds using label pointers. I've achieved about 5 seconds in the past,
but there were too many complications. (Native C was 0.8 to 1.4 seconds.)

I am getting considerably worse times here (for fib), but a lot of this
seems to be that my interpreters' logic for pushing/popping call frames
seems to be a bit expensive (but, this would require reworking how
call-frames are handled, which would be a bit of a hassle). as-is,
pushing/popping call frames tends to become the main point of time-use
in the fib test.

script: 3 minutes, 19 seconds.
native: 0.75s.

so, yeah, 250x slower than native, performance is sort of in the toilet
on this one...


I get much better speeds at the selection-sort test though...
(more along the lines of around 4 minutes vs 15 seconds).

in this case, most of the time is spent dispatching to opcodes and
executing opcode-specific logic, mostly as there aren't really any
calls/returns to worry about.


the main issue seems to be mostly that the current call-frames work
under a save/restore state system (copy off a bunch of variables to a
frame, setup variables for the new frame, and copy them back on return).
it doesn't help matters that there is currently a lot of state to
save/restore (due to a lot of "potentially-used" VM features).

in other interpreters, I have gotten generally better results here by
using the current frame for much of the "active" state, so pushing a new
frame consists mostly of creating a new blank/empty frame structure.

this part may be fixed up eventually.

I've given up trying to get anywhere near the speed of native code
(which is not going to happen with the sort of approaches I use);
instead I have in mind an intermediate language to do that!

yeah.

I am mostly working on trying to make the plain C interpreter less slow.

JITs and code-generators are not beyond my abilities, but they are much
more involved to implement and maintain, and also generally
target-specific, whereas the interpreter can be an "always available"
feature.


in the past, I have already written JITs which were
performance-competitive with natively-compiled C.

sadly, keeping a JIT in good working order is a bit more work,
especially if ones' VM is a moving target.

this is also another part of why I have been moving in the direction I
have been, mostly to lower the burden by allowing (optional) use of more
piecewise code-generators, such that holes in the JIT can more easily
fall back to invoking parts of the interpreter.


so, my next JIT will probably aim a bit lower, being probably more a
mixture of direct threaded code and directly generating machine-code
sequences for certain operations, and operating at the level of
individual traces (rather than compiling function-at-a-time, as my past
JITs have tended to do).

That's the thing; for real programs, differences in speed are smaller
(while a few things are faster than C, for a given amount of effort,
thanks to the use of counted strings for example).

yeah...

my scripting language is mostly used for light-duty work, and mostly
calling C functions for the rest.

so, practically, its ability to match native-compiled C performance wise
hasn't been a major issue, though it has shown up a few times mostly as
more logic in my 3D engine starts ending up being written in script code.


then, again, I guess it isn't too bad:
execution speeds for my interpreter are, in many cases, still generally
faster than the native hardware speeds I was dealing with back in the 90s...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,571
Members
45,045
Latest member
DRCM

Latest Threads

Top