Improving efficiency of a bytecode interpreter

Mark R. Bannister · Oct 27, 2009

I'm developing a new programming language, and at the core of this
language is a bytecode interpreter (yes, I've invented the bytecode
too, as well as assembly code and an assembler ...)

It's all written in C.

My question is, can anyone think of a way of making engine.c (see link
below) smaller and more efficient? This is the heart of the
interpreter, it's a big switch with switches inside it, a bit like
this pseudo-code:

switch (opcode)
{
case for each bytecode instruction with similar arguments:
process arguments
perform command
}

What troubles me is that a lot of the code for processing arguments is
very similar, but because different instructions take different
arguments I can't readily process them up-front. Take a look at the
code itself and if you have any useful suggestions please send them my
way.

http://prose.cvs.sourceforge.net/viewvc/prose/prose/src/prose/engine.c?view=markup

Thanks,
Mark Bannister.
(cambridge at sourceforge dot net)

Mark R. Bannister · Oct 27, 2009

On Oct 27 said:
So your first
step should be to identify logical groupings of code that you can
move off into their own functions. Don't worry for now about
refactoring for commonality - just refactor for function size!

<snip>

Thanks Richard, yes I have considered this. However, what is the
overhead of adding a function? I obviously want my bytecode to
execute as efficiently as possible. If every single bytecode
instruction launches a function to perform its role, it's not going to
be so quick is it? That's why one big function monolith. The other
problem with adding functions is it then requires all kinds of
different variables required to keep the engine running to be passed
to the function. I could encapsulate these in one big structure and
pass a single pointer, but then I'm making more data available to each
function than it really needs, this goes against my nature (usually I
only want to pass the data I know the function will need).

So, functions aside, are there any other suggestions from anyone
else? Speed is of the essence, I don't want to do anything that will
slow it down (execution speed before style, Richard).

jacob navia · Oct 27, 2009

Mark R. Bannister a écrit :

I'm developing a new programming language, and at the core of this
language is a bytecode interpreter (yes, I've invented the bytecode
too, as well as assembly code and an assembler ...)

It's all written in C.

My question is, can anyone think of a way of making engine.c (see link
below) smaller and more efficient? This is the heart of the
interpreter, it's a big switch with switches inside it, a bit like
this pseudo-code:

switch (opcode)
{
case for each bytecode instruction with similar arguments:
process arguments
perform command
}

What troubles me is that a lot of the code for processing arguments is
very similar, but because different instructions take different
arguments I can't readily process them up-front. Take a look at the
code itself and if you have any useful suggestions please send them my
way.

http://prose.cvs.sourceforge.net/viewvc/prose/prose/src/prose/engine.c?view=markup

Thanks,
Mark Bannister.
(cambridge at sourceforge dot net)

I think the first step is to build an array of 255 function pointers. For each position
in the array, write a function that does the work of the corresponding byte code.

Then, instead of a switch you write:

(functionTable[bytecode])();

Then, to avoid the overhead of returning to call again a function write as the last
statement of each function

(functionTable[bytecode])();

That will jump to the next function. Problem is, that is not a jump, but a call.
Eventually, you could run out of stack space. You can only get away with this schema
if you get a compiler that allows taking the address of a label and putting them
in an array. Gcc does that.

Another way is to write the core in assembler. I did that for my byte interpreter of lisp,
and the speed was way beyond what C can ever achieve.

James Kuyper · Oct 27, 2009

Mark said:
<snip>

Thanks Richard, yes I have considered this. However, what is the
overhead of adding a function? I obviously want my bytecode to
execute as efficiently as possible.

If you've got a single function that is almost 10,000 lines long, you've
almost certainly got lots of design errors and other bugs to work out.
Simplify the code by breaking it into smaller pieces, each of which can
be properly understood. Only after you have a coherent overall design up
and working correctly should you worry about details of efficiency.

be so quick is it? That's why one big function monolith. The other
problem with adding functions is it then requires all kinds of
different variables required to keep the engine running to be passed
to the function. I could encapsulate these in one big structure and
pass a single pointer, but then I'm making more data available to each
function than it really needs, this goes against my nature (usually I
only want to pass the data I know the function will need).

That's why that's the wrong thing to do. Created many different data
structures, each of them collecting together functionally related pieces
of data. Then any given function will generally need only a few of those
structures, and most functions will make use of a fair fraction of all
of the members of any structure they do use.

So, functions aside, are there any other suggestions from anyone
else? Speed is of the essence, I don't want to do anything that will
slow it down (execution speed before style, Richard).

That's exactly the wrong attitude: good programming style leads to
elegant design and accurate code, and makes the code easier to maintain.
Don't worry about efficiency until you've got other, more important
matters such as correctness and clarity dealt with.

BGB / cr88192 · Oct 27, 2009

On Oct 27 said:
So your first
step should be to identify logical groupings of code that you can
move off into their own functions. Don't worry for now about
refactoring for commonality - just refactor for function size!

<snip>

<--
Thanks Richard, yes I have considered this. However, what is the
overhead of adding a function? I obviously want my bytecode to
execute as efficiently as possible. If every single bytecode
instruction launches a function to perform its role, it's not going to
be so quick is it? That's why one big function monolith. The other
problem with adding functions is it then requires all kinds of
different variables required to keep the engine running to be passed
to the function. I could encapsulate these in one big structure and
pass a single pointer, but then I'm making more data available to each
function than it really needs, this goes against my nature (usually I
only want to pass the data I know the function will need).

So, functions aside, are there any other suggestions from anyone
else? Speed is of the essence, I don't want to do anything that will
slow it down (execution speed before style, Richard).
-->

with this code, you will not lose much.

noting as how damn near every case is a full block (with variables, if
statements, loops, and such...), the relative cost of adding a function is
trivial.

it can be noted that, on average, if/for/while/... are fairly expensive vs
other operations (whereas your code uses them endlessly).

putting state in a big structure:
I, and probably most others I know of, also do this...

other ideas:
don't do high-level opcode decoding for each opcode, or if you really must
have high-level opcode decoding, maybe put it in its own code. this is what
I do for x86 machine code, where 1 piece of code is responsible for decoding
opcodes, and another piece of code for executing them.

so, in this case, I decode opcodes, and then run them through a switch based
on argument configuration (Reg, RM, Imm, RegRM, RMReg, Imm, RegImm, RMImm,
....), and then through another big switch (per-opcode number).

typically, each of these has its own functions, so the argument
configuration switch calls the function for handling the particular
configuration, which does a big switch for handling the various opcodes.

another route is to make each opcode have fixed-form arguments (sort of like
in the JVM's JBC, MSIL, ...). where, if one has different arguments, they
also have different opcode names and numbers.

another thing:
if possible, try to make operations fairly close to atomic.
if it is an atomic operation, then maybe it goes in the switch;
if it is not (as in, involves a big ugly chunk of logic code), then maybe it
is better served by a function.

FWIW, I handled pretty much the entire integer subset of x86 in around 2
kloc (for the per-instruction code), with about 700 lines going just into
dealing with REP/REPE/REPZ opcode forms (I give them a lot of special
treatment...).

most of the rest is simple operations, although a lot of this function calls
(given x86 has fairly complicated register-handling logic, and the great
terror known as 'eflags' which is awkward to simulate effectively...), so a
lot of code for dealing with registers and eflags is done elsewhere (as well
as the code for managing the virtual address space, which is essentially
independent of the interpreter, mostly so that they don't step on each
other, ...).

it is then about 1.5 kloc for the code to simulate the x87 FPU (including
conversion code, 80-bit float operations being done primarily via integer
math, ...).

granted, there is a 1.2 kloc chunk of code dedicated primarily to
getting/setting registers (various types, sizes, sign vs zero extension,
....).

....

and, in all this, I realize that x86 is very difficult to interpret
efficiently, vs nearly any other bytecode.

most of my bytecodes could be interpreted almost entirely with a big dumb
switch and maybe 1 or 2 statements for each case...

or such...

BGB / cr88192 · Oct 27, 2009

Richard Heathfield said:
In

Almost 10,000 lines of C, which means you'll struggle to find anyone
prepared to do an in-depth analysis for you for free. This is even
more true because it's awkward to extract your C from your HTML so
that it can be loaded easily into a decent editor.

It doesn't help that your code is written in a rather archaic style.
Nor is it particularly helpful to have your macro definitions
scattered throughout the code.

FFS this code is terrible, IMHO...

this is also another example of code which would probably break my compiler,
oh well...

Firstly, despite the 10,000 lines, "smaller" isn't really your problem
(in terms of line count, that is). Your problem is not so much size
as complexity. Many project teams mandate a limit of 1000 lines per
source file, which is perhaps a little restrictive, but you have a
single *function* that is almost ten times that long! So your first
step should be to identify logical groupings of code that you can
move off into their own functions. Don't worry for now about
refactoring for commonality - just refactor for function size! Once
you have your functions down to a reasonable size (and have re-tested
to identify bugs introduced in the process), you can start to work on
efficiency. A profiler is a lot more useful to you if you have many
small functions rather than one colossal monolith.

agreed...

when each switch statement includes its own block, functions are advised.
typically (in C at least), function call overhead is only really noticed
when the functions are trivially simple, as in:

int foo()
{ return(bar()); }

even then, it is small...

and, apparently a bit larger on Intel than AMD CPU's, although it seems
Intel CPUs are faster per-clock for unbranching code, which in my test has a
2.10 GHz Intel CPU doing slightly better on average than a 2.31 GHz AMD CPU,
except for short functions, such as mutex locking and unlocking, ... which
apparently manage to 'pwn' the Intel...

but, if one's code has, say, variables and if statements, ... then a
function call is likely to make little or no real difference (and may
actually improve performance for reasons internal to the compiler...).

or such...

Mark R. Bannister · Oct 27, 2009

On Oct 27 said:
another thing:
if possible, try to make operations fairly close to atomic.
if it is an atomic operation, then maybe it goes in the switch;
if it is not (as in, involves a big ugly chunk of logic code), then maybe it
is better served by a function.

I think you've hit the nail on the head. Thank you everyone for your
comments, they have been helpful and given me something to think
about. I know engine.c is horrible at the moment. It didn't start
off that way, it's just grown slowly over the years.

I will try to piece my engine variables into a few more structures
(ultimately they need to be anyway so I can easily save and restore
the current execution thread). Any operations that are not atomic
will go into functions, that I will get the compiler to inline if
possible.

Hey ho, more heavy shifting work to do, and just when I thought I was
on the verge of the next exciting breakthrough .....

Seebs · Oct 27, 2009

So, functions aside, are there any other suggestions from anyone
else? Speed is of the essence, I don't want to do anything that will
slow it down (execution speed before style, Richard).

Actually, he's probably right.

Here's the thing. Functions MIGHT have overhead. But they also might help
the compiler make good optimization decisions. I wouldn't bet on the code
slowing down AT ALL from a conversion to functions. It might actually be
faster.

As an example: Imagine that you currently have a giant switch table. And you
replace it with a table of function pointers. There have existed compilers
such that, when you do this, and write:

*(func[opcode])(...);

that your performance will INCREASE. Dramatically.

Because some compilers have tended to implement switch statements as a huge
string of if/else if/else if/else if, in practice.

The thing is... No, you're wrong. Not execution speed before style.

The order is:
1. Clarity.
2. Good profiling results.
3. NOW you can try to optimize.

Please review Henry Spencer's 10 Commandments for C Programmers. In
particular, the stuff about premature optimization. You have committed
some EXTREMELY premature optimization here, and it's going to make your
life hell.

-s

BGB / cr88192 · Oct 27, 2009

pete said:
If you are able to replace a giant switch
with an array of function pointers,
then that may or may not be a speed increase.

replacing a giant switch with an array of function pointers:
no, this will probably not improve speed.

since an array of pointers is also very position sensitive, it may also be
much more difficult to work with (if manually produced), whereas one can
fairly easily work with a switch (adding/removing items, ...).
internally, compilers tend to handle switches as an array of pointers
anyways, although it is typically to specific statements (usually a big glob
of statements placed after the dispatch logic for the switch AFAIK).

however, if each switch item is simply a function call, then it is possible
there might be a slight speed increase (as well as a reducing the
machine-code size, due to not having a whole bunch of function-call related
instruction sequences), but the speed increase is not likely to be
significant (from the POV of the CPU, both sets of instruction sequences
would likely be fairly similar either way).

(well, except in my compiler, which is kind of lame in that it does not use
jump tables, as I never got around to this...).

Mark R. Bannister · Oct 27, 2009

As an example: Imagine that you currently have a giant switch table. And you
replace it with a table of function pointers. There have existed compilers
such that, when you do this, and write:

*(func[opcode])(...);

that your performance will INCREASE. Dramatically.

If I did this, I'd have to pass all the same variables to every
function, wouldn't I? I have a large number of variables used by the
processing engine, which different instructions may or may not
manipulate. How could I implement a table of function pointers, but
still have the flexibility to send only the right arguments to the
right functions?

BGB / cr88192 · Oct 27, 2009

James Kuyper said:
If you've got a single function that is almost 10,000 lines long, you've
almost certainly got lots of design errors and other bugs to work out.
Simplify the code by breaking it into smaller pieces, each of which can be
properly understood. Only after you have a coherent overall design up and
working correctly should you worry about details of efficiency.

it was said elsethread that this was produced over a period of years,
meaning that likely it is at least acceptably stable in this respect...

That's why that's the wrong thing to do. Created many different data
structures, each of them collecting together functionally related pieces
of data. Then any given function will generally need only a few of those
structures, and most functions will make use of a fair fraction of all of
the members of any structure they do use.

my usual approach is to have a central "context" which may or may not
contain sub-structures.
in my JBC interpreter, the 'frame' structure was generally passed instead,
since oddly the frame tended to be a more central structure than the main
context...

That's exactly the wrong attitude: good programming style leads to elegant
design and accurate code, and makes the code easier to maintain. Don't
worry about efficiency until you've got other, more important matters such
as correctness and clarity dealt with.

more so, smaller and cleaner code is usually much easier to optimize.

since, although a function call is not free, it is much easier to apply
fixes and optimizations to cleanly centralized chunks of code than to go and
find and fix all of these repetitions.

I will qualify something here:
IMO this applies primarily to the small scale, and in the case of very large
masses of code (by this I mean projects > maybe 200 kloc or so), it may be
better to actually use a copy-paste-specialize approach, since
modularization and isolation becomes a much more effective strategy for
"distant" regions of code than does centralization (sharing code in this
case makes a mess of modularization).

or, stated another way:
centralize and integrate at the small scale (< 25-50 kloc);
modularize and isolate at larger scales.

similarly, centralization improves flexibility at the small scale (by
reducing redundant code, ...);
whereas, centralization hurts flexibility at larger scales (by essentially
"crystalizing" the codebase);
....

however, on a much larger scale, and in a different way, this pattern seems
to repeat (me considering more "massive" projects, such as the GNU and Linux
world-space, ...). although, this is more in terms of disjoint projects and
inter-project dependencies.

or, at least, this is my personal experience...

Mark R. Bannister · Oct 27, 2009

"James Kuyper" <[email protected]> wrote in message

it was said elsethread that this was produced over a period of years,
meaning that likely it is at least acceptably stable in this respect...

I've been working on the project since 2002, although engine.c, the
subject of this thread, began life in Oct 2006. Yes it's very stable.

If I were to functionise the code for each instruction, how clever is
the compiler when it comes to inlining functions? What I mean is, say
I have a variable called program_ptr. This is, by no surprise, the
program pointer. Each instruction needs to read the data at the
program pointer and advance the pointer, because each instruction may
have a different number of arguments (some instructions just keep
reading arguments until there are no more - you could literally pass
hundreds of arguments to some instructions if so required).

Currently program_ptr is a register. In the current scenario, it
remains as such. When functionised, of course, it can no longer be a
register as I need to pass it to a function. But what happens when
that function is inlined by an optimiser? If I've gone to the bother
of wrapping the program_ptr up into a structure to pass it to a
function that has been inlined, am I not wasting time wrapping up
something that only then needs to be immediately unwrapped? Or are
compilers clever enough to miss out the wrapping/unwrapping stage when
inlining? I suppose I'll only know for sure by trying it and then
disassembling the result....

There again I could take the approach I've taken for ages, which is
"who cares it's 9,942 lines today, it works, leave it alone".
Especially when one has a 3-month old baby and less & less time for
hobbyist programming

BGB / cr88192 · Oct 27, 2009

As an example: Imagine that you currently have a giant switch table. And
you
replace it with a table of function pointers. There have existed compilers
such that, when you do this, and write:

*(func[opcode])(...);

that your performance will INCREASE. Dramatically.

<--
If I did this, I'd have to pass all the same variables to every
function, wouldn't I? I have a large number of variables used by the
processing engine, which different instructions may or may not
manipulate. How could I implement a table of function pointers, but
still have the flexibility to send only the right arguments to the
right functions?
-->

in my interpreter, I split it up like this:
there is the 'Context' which holds the general state (actually, it is
composed partly of references to other structures, which hold more specific
state for specific subsystems, such as the MMU, ...);
then there is 'DecodeOp', which basically holds info relevant to the current
opcode (its opcode number, arguments, ...).

so, if you want to get, say, EAX (or maybe R11W, CR4, ...), then this info
is grabbed from the context.
it you want to know, say, which register is in 'Reg', or where RM points,
then this is figured from the DecodeOp (note: I have a special function to
figure out the value of RM at a given moment).

(it is worth noting as a technical impact of this: because I am only using a
single value for a field which can handle both registers and memory
operands, internally I ended up memory-mapping the registers, although not
as byte-addressable memory...).

actually, I ended up reserving the whole 3GB-4GB region for "internal" uses,
so:
0x00000000-0x7FFFFFFF: reserved for a simulated process;
0x80000000-0xBFFFFFFF: reserved for shared memory / ... (maybe, could be
made general-use);
0xC0000000-0xFFFFFFFF: reserved for internal use (current largest use:
swizzling references between disjoint address spaces, handles, ...).

0x100000000-... : only available for x86-64 simulation (yet, the prior
mapping remains in this case).

note: all of this is not strictly authentic (AKA: different from a real
CPU), however, the point of this interpreter is for userspace code, not as
an emulator (I can then safely diverge on points which would not effect
userspace code...).

BGB / cr88192 · Oct 27, 2009

jacob navia said:
Mark R. Bannister a écrit :

I'm developing a new programming language, and at the core of this
language is a bytecode interpreter (yes, I've invented the bytecode
too, as well as assembly code and an assembler ...)

It's all written in C.

My question is, can anyone think of a way of making engine.c (see link
below) smaller and more efficient? This is the heart of the
interpreter, it's a big switch with switches inside it, a bit like
this pseudo-code:

switch (opcode)
{
case for each bytecode instruction with similar arguments:
process arguments
perform command
}

What troubles me is that a lot of the code for processing arguments is
very similar, but because different instructions take different
arguments I can't readily process them up-front. Take a look at the
code itself and if you have any useful suggestions please send them my
way.

http://prose.cvs.sourceforge.net/viewvc/prose/prose/src/prose/engine.c?view=markup

Thanks,
Mark Bannister.
(cambridge at sourceforge dot net)

Click to expand...

I think the first step is to build an array of 255 function pointers. For
each position
in the array, write a function that does the work of the corresponding
byte code.

Then, instead of a switch you write:

(functionTable[bytecode])();

Then, to avoid the overhead of returning to call again a function write as
the last
statement of each function

(functionTable[bytecode])();

That will jump to the next function. Problem is, that is not a jump, but a
call.
Eventually, you could run out of stack space. You can only get away with
this schema
if you get a compiler that allows taking the address of a label and
putting them
in an array. Gcc does that.

Another way is to write the core in assembler. I did that for my byte
interpreter of lisp,
and the speed was way beyond what C can ever achieve.

yep.

this is also an advantage of having a dynamic assembler available, as then
one can procedurally generate fine-tuned interpreter logic.

however, I did not do this for my x86 interpreter, choosing instead to write
it in C to have a working version first (ASM is all or nothing, and if one
messes up somewhere in terms of overall design/... well then, there is a bit
of a mess...).

bartc · Oct 27, 2009

Mark R. Bannister said:
I'm developing a new programming language, and at the core of this
language is a bytecode interpreter (yes, I've invented the bytecode
too, as well as assembly code and an assembler ...)

It's all written in C.

My question is, can anyone think of a way of making engine.c (see link
below) smaller and more efficient? This is the heart of the
interpreter, it's a big switch with switches inside it, a bit like
this pseudo-code:

switch (opcode)
{
case for each bytecode instruction with similar arguments:
process arguments
perform command
}

What troubles me is that a lot of the code for processing arguments is
very similar, but because different instructions take different
arguments I can't readily process them up-front. Take a look at the
code itself and if you have any useful suggestions please send them my
way.

The main structural problems have already been addressed; the code is so
inefficient at present that offloading the execution of each main switch
case to a function is not a significant overhead. (And you should use enums
for opcodes not hex constants.)

You haven't seen said what performance speedups you're looking for. How fast
is it on a scale of C to Python at present? (Assuming your language does
similar things.)

For speed, you want to make as few runtime decisions as possible. From what
I can see, after each opcode is decoded, you have a series of if statements
and while loops to determine what to do next. And a call to a macro
PROGRAM_NEXT_ARG which looks a mess of code, which if it doesn't simplify
down is going to be a killer.

Does your bytecode have multiple operands (number not known until runtime)
and of a category (not type; that's a separate matter) not known until
runtime either? That will be slow to execute. Dedicated opcodes for each
combination is faster.

If your bytecode does use dynamic stuff like this, you need to be able to
determine things using a simple int value, to index a switch/array/function
array, not use random logic.

There are many other things, but at present the code seems very inefficient;
you want to completely minimise the number of lines executed per opcode
(unless it is a slow opcode anyway, such as implementing a=b where b is some
huge data).

When things get a bit tighter, you might want to look at changing how
opcodes are stored in the bytecode. Replace the index codes (00, 01, etc),
with the actual addresses of functions that deal with that opcode. Then the
main loop might look like this:

typedef void (*(*FFF))(void);

do {
(**(FFF)(pcptr))();
} while (1);

if you can find of neat way of exiting, otherwise it's more like: do ...
while (!stopped)

(At present I use asm code for the main dispatch code in my interpreters. To
execute the equivalent of your noop opcode takes just two x86 instructions
in overhead. In C code it would be an empty function with no parameters or
locals.)

bartc · Oct 27, 2009

Mark said:
As an example: Imagine that you currently have a giant switch table.
And you replace it with a table of function pointers. There have
existed compilers such that, when you do this, and write:

*(func[opcode])(...);

that your performance will INCREASE. Dramatically.

Click to expand...

If I did this, I'd have to pass all the same variables to every
function, wouldn't I? I have a large number of variables used by the
processing engine, which different instructions may or may not
manipulate. How could I implement a table of function pointers, but
still have the flexibility to send only the right arguments to the
right functions?

Well, you just use globals, although that is frowned upon in this group (but
then so are 10000-line functions..).

Parameters only slow things down anyway.

One problem with switch (opcode) is that there will be code generated to
ensure opcode is in range (there may or may not be a way of turning this
off). And of course it needs to be in a dispatch loop.

One approach I tried with gcc was to make use of it's label addresses, then
this line:

goto *pcptr;

would jump to the next handler for this opcode. That handler would then need
to step pcptr to the next instruction (my bytecode interleaved 'opcodes' and
operands; opcodes were replaced by handler addresses). It could then also
execute goto *pcptr itself, the dispatch loop disappears!

To structure the code, I put each opcode handler into a function (with no
parameters or automatic locals), and try to goto that function, however that
didn't work, since gcc did some weird things with function prologs.

In the end I had to go outside of C to achieve this: fast threaded code and
encapsulating handlers inside 'functions' even though the functions are
never called and never returned from!

Gene · Oct 27, 2009

I'm developing a new programming language, and at the core of this
language is a bytecode interpreter (yes, I've invented the bytecode
too, as well as assembly code and an assembler ...)

It's all written in C.

My question is, can anyone think of a way of making engine.c (see link
below) smaller and more efficient? This is the heart of the
interpreter, it's a big switch with switches inside it, a bit like
this pseudo-code:

switch (opcode)
{
case for each bytecode instruction with similar arguments:
process arguments
perform command

}

What troubles me is that a lot of the code for processing arguments is
very similar, but because different instructions take different
arguments I can't readily process them up-front. Take a look at the
code itself and if you have any useful suggestions please send them my
way.

http://prose.cvs.sourceforge.net/viewvc/prose/prose/src/prose/engine....

Thanks,
Mark Bannister.
(cambridge at sourceforge dot net)

Optimizations at this level are always extremely dependent on the
compiler/processor/cache combination. A good compiler goes through a
lot of trouble to make switch statements efficient. For most
processors, a switch with many cases having similar values will be
translated to a jump table. This allows the dispatch to occur with
(usually) a couple of instructions. It's hard to do better with
dispatching, even hand-coding in assembly language. You might get a
tiny improvement by making sure your opcodes are in a continguous
stretch starting at zero.

You should handle the common argument processing with functions
declared static (local) in the same file. Such functions are nicely
inlinable by nearly all modern compilers. Then it's easy to test with
inlining compiler flags (like -finline-functions in gcc) set both on
and off. Now and then, you will actually get speedier execution with
the functions _not_ inlined for code cache reasons. If you open code
all the argument processing, you don't get to try this.

You'll note that performance-critical systems will generally have an
"arch" directory containing a gazzilion versions of inner loop code--
sometimes in assembly langage--customized for each operating
environment.

One other suggestion is to look at open source bytecode interpreters
for ideas. From past reading, Perl and Lua seem to be well-regarded
for speed, but I have no data to support this. You could look at one
of the "shoot out" sites for comparisons.

Mark R. Bannister · Oct 27, 2009

On Oct 27 said:
Well, you just use globals, although that is frowned upon in this group (but
then so are 10000-line functions..).

Can't use globals, because ultimately the engine will be multi-
threaded.

So many people are offering so many useful suggestions, thank you.
You've all given me lots of ideas. Some of them are not feasible,
some of them I'll try. I'll see how I get on.

bartc · Oct 27, 2009

Mark said:
Can't use globals, because ultimately the engine will be multi-
threaded.

In that case then perhaps just one parameter pointing to a complete
environment for that activation of the interpreter (I don't understand
threads).

In C terms, that probably means putting everything (that is specific to the
activation) into a giant struct. In practice that can mean a very slight
performance hit from having this extra pointer access.

Eric Sosman · Oct 27, 2009

James said:
If you've got a single function that is almost 10,000 lines long, you've
almost certainly got lots of design errors and other bugs to work out.
Simplify the code by breaking it into smaller pieces, each of which can
be properly understood. Only after you have a coherent overall design up
and working correctly should you worry about details of efficiency.

It's somewhat a matter of opinion, but I'll venture to
disagree with both James Kuyper and Richard Heathfield in this
instance. Ordinarily, I'd agree that a ten thousand-line function,
even a five thousand-line function, is too big and almost sure to
be too snarled. But for a function of the kind Mark Bannister
describes, I think the objection based on size largely (sorry)
disappears: Total size is no longer a good measure of complexity.

What Mark has is a function body that looks something like

{
preamble();
while (something) {
switch (doodad) {
case 0: ...
case 1: ...
...
case 999: ...
default: abort();
}
}
postamble();
}

.... and with a structure like this the "size" to consider is
roughly the total size of preamble(), the gnarliest `case', and
postamble(). One thousand `case's of about ten lines each,
and there you have your ten thousand-line function, but the
conceptual burden is light.

I've come across (and written) big `switch' constructs like
this in three contexts: Lexical scanners, interpreters, and
state machines. (There are certainly others, but these three
seem most susceptible to `case' proliferation.) In none of these
situations does the total number of `case's significantly affect
the conceptual complexity -- state machines can get hairy, but
most of the hair grows on the state-to-state transitions, not on
the relatively bald bulk of the function containing them all.

Size doesn't *always* matter, lads.

That's exactly the wrong attitude: good programming style leads to
elegant design and accurate code, and makes the code easier to maintain.
Don't worry about efficiency until you've got other, more important
matters such as correctness and clarity dealt with.

"Elegant" is a word sometimes applicable, but more often
used as a synonym for "The Devil's in the details, and I shun
the Devil." There's an old joke about an engineer, a physicist,
and a mathematician asked to solve an aerodynamic drag problem:

The engineer proposes to build a detailed full-scale model
of the automobile, put it in a wind tunnel, and measure the
drag directly. It'll take several weeks, but it's The Way.

The physicist pooh-poohs the engineer's naÃ¯vetÃ© and says
he can write a huge system of nonlinear partial differential
equations for the drag and solve them on a supercomputer. It'll
take a solid week to run the solution (two, counting the re-run
after the bug is fixed), but the answer will come out Sooner.

The mathematician smiles indulgently at his colleagues'
folly, and says "Consider a spherical Ferrari ..."

The mathematician's approach was elegant. IMHO, we can
do without that sort of elegance.

a basic bytecode to machine code compiler	10	Mar 31, 2011
Writing a bytecode interpreter (for TeX dvi files)	1	May 26, 2005
Can multiprocessing.Process() use a different version of the interpreter?	6	Oct 4, 2010
[SUMMARY] Bytecode Compiler (#100)	0	Nov 9, 2006
ANN: byteplay - a bytecode assembler/disassembler	0	Aug 14, 2006
Implementing my own Python interpreter	2	Oct 13, 2008
Understanding Python's interpreter	6	Apr 6, 2007
Attaching a live interpreter to a script?	0	Jan 13, 2009

Improving efficiency of a bytecode interpreter

Mark R. Bannister

Mark R. Bannister

jacob navia

James Kuyper

BGB / cr88192

BGB / cr88192

Mark R. Bannister

Seebs

BGB / cr88192

Mark R. Bannister

BGB / cr88192

Mark R. Bannister

BGB / cr88192

BGB / cr88192

bartc

bartc

Gene

Mark R. Bannister

bartc

Eric Sosman

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads