Tiny VM in C

Discussion in 'C Programming' started by tekk.nolagi, Feb 11, 2014.

  1. I also posted my Debi disassembler ported to C/C++ using the CodeLite IDE
    by Eran Ifrah. I haven't tested it in a few years, but so far as I can
    remember it worked correctly on all instructions I supported. You can see
    the opcode map there as well, for the one-byte, and two-byte instructions.

    See stand-alone DebiX86Disassembler:
    https://github.com/RickCHodgin/libsf/tree/master/unsorted

    The format it generates is similar to that which you see in the Debi
    debugger video (IIRC).

    Best regards,
    Rick C. Hodgin
     
    Rick C. Hodgin, Feb 13, 2014
    #41
    1. Advertisements

  2. tekk.nolagi

    BartC Guest

    I'm developing an interpreter for a stack-based bytecode language, which
    uses dynamic types. It's a simple interpreter with no JIT, but manages about
    4-5x slower than un-optimised C code. This is for simple numeric benchmarks;
    with anything more elaborate, the difference seems to narrow (but I haven't
    yet got anything significant to test can be directly compared).

    However, to get this I needed to use optimised C in the interpreter plus
    some non-portable features (ASM for example), otherwise it might be nearer
    10x slower in standard (but still optimised) C.

    Of course for real applications, full of string processing, file-handling
    and library calls, the difference between executing this bytecode and even
    the optimised C equivalent, is generally not significant (it takes 150 msec
    or so to compile the largest source file I have; C could do it in 0 msec,
    but I wouldn't really notice the difference!)
     
    BartC, Feb 13, 2014
    #42
    1. Advertisements

  3. Way too late now, but in the days of IA32 processors with 36 bit
    physical addressing, but only a 32 bit MMU, it would not have taken
    much extra to allow good use for the 36 bits.

    What was missing from the 80286 and later was a segment descriptor
    cache, which would have sped up access to multiple segments.
    Large model isn't all that hard to use, especially with 32 bit
    offsets, and could have allowed many more years of 32 bit processors.
    Even now, with the popularity of 64 bits, systems with more than
    64GB RAM are extrememly rare.

    But the value of "64" on the outside is much more than the actual
    need for 64 bit addressing in selling chips.

    -- glen
     
    glen herrmannsfeldt, Feb 13, 2014
    #43
  4. tekk.nolagi

    BGB Guest

    my faster plain interpreters are typically still around 10x slower than
    native, but this is generally with a plain C interpreter (usually, I am
    doing an interpreter for the portable backend, with a JIT on certain
    targets to make things a little faster).


    most of my recent JITs tend to produce a mix of threaded code (function
    calls) and short segments of direct machine code (mostly using
    prefabricated ASM code globs), hence, "naive".

    while fancier JITs are possible, they are a lot more complicated and a
    lot more painful to maintain and debug (especially if multiple targets
    are involved).


    a lot of my benchmarks had been things like running sorting algorithms
    and similar (like sorting largish arrays of numbers, ...).

    as well as some numbers of micro-benchmarks, ... (calling functions or
    methods in a loop, ...).


    for things like array-sorting and similar, it is about 2x-3x slower than
    C, though there are other factors which could influence things (32-bit
    raw pointers and no bounds-checks vs 64-bit array references and
    bounds-checked arrays, ...), in addition to things like extra operations
    related to variable load/store and stack-management stuff.

    cross-language function calls are also a bit expensive due to various
    factors (mostly involving argument marshaling and so on).


    but, anyways, consider you have something like: "y=m*x+b;"

    in a stack-based IL, this might look like:
    LLOAD_I m
    LLOAD_I x
    MUL_I
    LLOAD_I b
    ADD_I
    LSTORE_I y

    whereas a register IR could do:
    MUL_I t0, m, x
    ADD_I y, t0, b

    and, comparatively, a naive JIT could produce fewer instructions for the
    register case, ...


    for example, naive stack-JIT output:
    mov eax, [ebp-12] ;LLOAD_I
    mov [esp+4], eax
    mov eax, [ebp-16] ;LLOAD_I
    mov [esp+0], eax
    mov eax, [esp+4] ;MUL_I
    mov ecx, [esp+0]
    imul eax, ecx
    mov [esp+4], eax
    mov eax, [ebp-8] ;LLOAD_I
    mov [esp+0], eax
    mov eax, [esp+4] ;ADD_I
    mov ecx, [esp+0]
    add eax, ecx
    mov [esp+4], eax
    mov eax, [esp+4] ;LSTORE_I
    mov [ebp-20], eax

    vs, naive register-JIT output:
    mov eax, [ebp-12] ;MUL_I
    mov ecx, [ebp-16]
    imul eax, ecx
    mov [ebp-24], eax
    mov eax, [ebp-24] ;ADD_I
    mov ecx, [ebp-8]
    add eax, ecx
    mov [ebp-20], eax


    these differences largely disappear if the JIT is smart enough to use a
    register allocator and peephole optimizer, but assumed here is a JIT
    that is effectively too stupid to use these.


    though, granted, a person can be "clever" and squeeze a little more
    speed out of the stack-machine, say, via adding magic like:
    LL2_MUL_I m, x ;load and multiply m and x, pushing result
    LL_ADD_I b ;add b to top-of-stack
    LSTORE_I y ;store into y

    which can make things a little faster, mostly at the cost of adding
    large numbers of special cases.

    this works, but makes things a lot more hairy.

    yeah.

    it depends on what one is doing.


    performance isn't really a huge issue at present for using it mostly for
    things like game logic tasks (enemy AIs, ...) and some UI stuff, and
    things like 2D and 3D animation tasks, ...

    mostly this consists of function calls, messing with objects and arrays,
    some amount of vector math, ...

    the present form of the language more-or-less resembles ActionScript3
    (mostly similar syntax and semantics, ...).


    its current backend is a (mostly) statically-typed stack-machine (it
    started out dynamically-typed, but static types were retrofitted onto it
    afterwards).

    a few efforts had been considered to move the main script language/VM
    over to a register-based fully-statically-typed backend, but not enough
    effort has been made on this front to actually go somewhere with this.



    a place where speed started looking to be an issue though was for
    possibly using script-code for video processing, where things have to be
    pretty fast when manipulating pixels not to drive the performance into
    the ground.

    so, there was some effort to try to design/implement a backend that
    could hopefully be C-competitive for this use-case (built around a
    vector-oriented register IR).

    however, this fizzled out some, and most of this sort of code remains as
    plain C (generally compiled with optimizations, as well as
    micro-optimized, and using SSE intrinsics, ...), mostly as my stuff
    still tends to fall a bit behind C here, and even then C has a hard-time
    keeping up (such as running chroma-key filters and compositing multiple
    video-streams and re-encoding to another output stream, where doing this
    CPU-side tends to quickly go sub-real-time, ...).

    for real-time, mostly it ends up boiling down mostly to shoving
    everything over to the GPU and doing much of the rest of the "real work"
    via fragment shaders (IOW: moderately fast codecs and a lot of PBOs and
    similar, as per-pixel filtering / image scaling / blending / ... is just
    so much faster on the GPU).

    it started looking a bit like there was little hope of making the
    script-code performance competitive within any sane level of effort.


    this basically means a lot of performance-sensitive stuff (image/video
    processing, A/V codecs, 3D rendering, ...) mostly still needs to remain
    as C (with most real-time per-pixel stuff and effects being done on the
    GPU).

    nevermind being like:
    var fragShader:string="""
    uniform sample2D texFoo;
    uniform sample2D texBar;
    ...
    """;

    but this isn't quite the same thing...


    or such...
     
    BGB, Feb 14, 2014
    #44
  5. tekk.nolagi

    BartC Guest

    In this simple case; with more complex expressions, function calls (where
    arguments have to be stacked anyway) the stack model will take everything in
    its stride and is very simple to generate code for.

    But I have a problem anyway because of the way my execution model works: I
    only use dynamic, tagged types, I manipulate all data by moving 16-byte
    descriptors around, and don't use a garbage collector. This makes a register
    model impractical (where to store the register contents? they will end up
    being written to memory anyway, it might as well be a stack).

    Actually, this seems hopelessly inefficient on the face of it (much better
    just to manipulate pointers), but I could never make a pointer model work
    (without complex GC or reference counting or such). The results however
    still largely outperform other dynamic languages that don't use acceleration
    or advanced tracing techniques or whatever.
    I've tried lots of those approaches (type-hinted source, some type
    inference, a thousand bytecodes, then a few generic bytecodes that are
    mapped at loadtime to whatever specific ones are implemented...). But it
    always gets very complex, very hairy, and there seemed an upper limit
    (always about 2x it seemed) on how much faster it made things. I just
    accepted that this was a dynamic language and would be a magnitude slower
    than native code, for benchmarks anyway. Clean source code and a simple
    interpreter is worth something too...
    I'd like to have a go at creating an interpreter for a static language, just
    to see what I can do with it. It will need static types but dynamic
    everything else (otherwise C will do). (But today I've just finished my new
    front-end to C - writing C with my own syntax - that will have to do for a
    while before working on yet another language!)
     
    BartC, Feb 14, 2014
    #45
  6. (snip, I wrote)
    I thought there was an NT Alpha, though I never had one.

    I did have a trial version of Win2K for IA64.
    Well, MS did support PAE, but that takes a lot of extra
    work for the program.
    I suppose, but a cache would have helped more. It isn't so
    easy for a compiler to figure out which order segments
    are loaded, and so assign the segment registers.
    But a few bits would have lasted a long time.

    -- glen
     
    glen herrmannsfeldt, Feb 14, 2014
    #46
  7. I wrote my first "real" program in this assembly code: sum the numbers from 1 to 100. I'm going to keep writing simple programs to figure out what I might need.
     
    Maxwell Bernstein, Feb 14, 2014
    #47
  8. Awesome. Keep up the good work. :)

    Best regards,
    Rick C. Hodgin
     
    Rick C. Hodgin, Feb 14, 2014
    #48
  9. (big snip regarding addressing of Intel processors)
    Hmm. FS and GS came late enough that I never knew any program
    to use them. I was running OS/2 1.0 in, I believe 7MB of RAM
    in early 1990. That was when most people were running DOS in 640K,
    maybe with a virtual disk in extended memory.
    For the 80286, with a 16 bit data bus, it had to load the selector,
    plus the whole descriptor, for each one. Since you pretty much need
    to keep DS on the actual data segment, that meant loading ES for
    every access. Still, much better than you could do without it.
    But much more fun in protected mode running OS/2.

    (snip, regarding more than 32 address bits)
    The Watcom compilers would generate large model code with 32 bit
    offsets. The only OS that I know even thought about supporting
    it was OS/2, but I don't know that they did. At the time I
    don't know that any boards supported more than 4GB.
    -- glen
     
    glen herrmannsfeldt, Feb 14, 2014
    #49
  10. tekk.nolagi

    BGB Guest


    my case is a little more complex:
    there are raw/unboxed types, such as raw integers or floats or doubles;
    there are boxed types, typically a pointer to some other type (a value
    or structure, ...);
    there are raw pointers and box-pointers (a pointer to a structure
    representing a bounds-checked pointer);
    there are also "variant" types (used for dynamically-typed values),
    which in this VM, are implemented as 64-bit tagged references (*).


    *: tagging scheme described here:
    http://cr88192.dyndns.org:8080/wiki/index.php/Tagged_references


    basic type information is encoded in the bytecode, however, the VM
    backend is allowed a little freedom as to how they are represented in
    terms of more concrete types.

    in the current JIT, generally everything still gets a 64-bit spot, and
    will be converted to/from tagged references as needed (a 32-bit item
    will only use the low 32-bits of the spot).


    as for object lifetime:
    currently I am mostly using a mix of GC and manual memory-management
    (you can leave it up to the GC, or "delete" objects early);
    there are also "struct" which have C# like semantics, and "value
    classes" (basically, classes with more C++ like behavior, essentially
    structs with inheritance and methods).

    the JVM also has a few nifty tricks here (local vs global objects, ...),
    but I haven't gotten around to ripping them off.

    I don't currently use reference counting, as personally I have generally
    found it to be "more trouble than it is worth".

    I did so, and largely just lived with a lot of this sort of hair.

    this lasted until I started making some more significant changes to the
    type-system, at which point a lot of these operations either got dropped
    or partly repurposed (via prefixes).

    but, yeah, there are still a lot of these sorts of special-case
    operations in the current stack-based VM.

    it is also a bit mix/match between whether the opcode is type-specific,
    or encodes the type via a prefix (generally, type-specific opcodes are
    used for more common operations, with prefixes for less common
    operations and edge-cases).


    the cost of the stack can also be partially reduced by internally
    mapping stack-elements to fixed-position spots (basically, if one
    imposes similar constraints onto the stack as in the JVM, *2, it makes
    some things a little more straightforward and allows optimizing away a
    lot of internal stack-position adjustments and similar).

    this can also be used to map stack-machines onto a register-machine
    backend, which has been considered a few times (there has been some
    internal debate about, if moving to a fully register machine backend,
    whether this would mean also moving to a new bytecode, or just map the
    existing stack-based bytecode onto a register-machine as an intermediate
    step).


    *2: along all control paths leading to a particular label / jump-target,
    it is necessary for the stack to have the same layout (same number of
    elements with the same types).

    as can be noted, the actual high-level language in question here can use
    either static or dynamic types.

    more the static typing is done mostly by a combination of some explicit
    types, and also via type-inference, however dynamic types still remain
    available as well.

    for example, if a variable is declared without a type given, it is up to
    the compiler whether it will exist as a dynamically-typed variable, or
    be treated more like the C++ "auto" type (and converted into a
    statically-typed variable of a given type).


    it is mostly that if you take a language like JavaScript or similar, and
    run it purely with dynamic type-checks, the raw performance will be
    pretty bad. with type-inference, a lot of the types can be nailed down,
    and the performance can be made a lot better (without otherwise
    constraining the language semantics).


    however there are limits:
    cases where the type-inference logic can't figure out the types (and so
    would unnecessarily use a dynamically-typed variable);
    cases where an otherwise obvious type-error exists, where a
    statically-typed language would be like "WTF are you doing here?!" but a
    dynamically typed language would just accept it and blow up at runtime
    (bad, compile-time type-errors are better here);
    ....

    so, static type declarations can help here.

    for example:
    var s:string;
    ....
    s=42;
    the compiler can then safely blow up with a type error.


    well, also the compiler will warn about implicit conversions to smaller
    numeric types:
    var i:int, s:short;
    ....
    s=i; //warning
    s=i as short; //OK


    I guess to clarify:
    what I meant by "fully statically typed backend" isn't so much about
    fully eliminating dynamic types (from the language or VM), but rather
    about statically determining all the types (meaning all cases of variant
    types are statically known as well).

    in the current VM it is a bit more hit or miss, so the VM is like "oh
    crap, don't know the types here, better use variant." this kind of sucks
    as it leaves a lot of big holes in the type-system.
     
    BGB, Feb 15, 2014
    #50
  11. (snip, I wrote)
    The one I was working on did Successive OverRelaxation (SOR) for
    a solution to the Poisson equation in 2D. It does do a lot of load
    and store in arrays in large model, so it might be every third.

    I did sometimes look at the generated code, but there wasn't so much
    that I could do about it.

    -- glen
     
    glen herrmannsfeldt, Feb 18, 2014
    #51
  12. I am starting to represent commands (instruction+arguments) as unions in
    C. That seems at first glance to be a pretty good solution. Is there
    anything that I am missing?
     
    Maxwell Bernstein, Feb 19, 2014
    #52
  13. Sounds fine to me. You might also want to look at bit-fields. They are
    really only a space-saving measure, but the thread title suggests the
    size might matter.
     
    Ben Bacarisse, Feb 19, 2014
    #53
  14. Sounds fine to me. You might also want to look at bit-fields. They are
    I shall take a look at bit fields. The thread title was more of a
    reference to the size of the code than the memory footprint.
     
    Maxwell Bernstein, Feb 19, 2014
    #54
  15. Ah, well bit-fields probably just introduce unnecessary costs and
    complexity. Worth knowing about them, but probably not helpful here.
     
    Ben Bacarisse, Feb 20, 2014
    #55
  16. I shall take a look at bit fields. The thread title was more of a
    Aha, okay. I'll look into it for reference. As for the VM in general:
    any thoughts?
     
    Maxwell Bernstein, Feb 20, 2014
    #56
  17. tekk.nolagi

    BartC Guest

    What is it a solution for?
     
    BartC, Feb 20, 2014
    #57
  18. tekk.nolagi

    BartC Guest

    OK, I've just looked at your project. I assume you're referring to this:

    typedef union carp_argument {
    unsigned int i;
    carp_register r;
    long long ll;
    char s[CARP_NAME_LENGTH];
    } carp_argument;

    typedef struct carp_command {
    carp_instruction instr;
    carp_argument args[CARP_NUM_ARGS];
    } carp_command;

    That's fine if it's what you want to do. Provided you realise that the size
    of the carp_argument union will be that of the largest item, in this case
    the 32-byte name 's'.

    And since you have 3 of these in carp_command, that each instruction will be
    about 100 bytes long.

    I know I said, since originally each command I think was packed into a
    16-bit value, that it wasn't necessarily to pack things that tightly, you
    could make life a bit easier. But this is going to the other extreme!

    But also, this is now unlike any real machine. I'm not clear what 's' is
    for, but references to strings are usually handled by a pointer to a region
    of memory that contains the string. (And if 's' is meant to be the name of a
    variable, then in a real machine, names aren't used at all, they are symbols
    in the source code (of assembler etc) that are replaced in the machine code
    by addresses or offsets of the area in memory where the variable resides.)

    (The rest of the code seems to be nicely written, however - and this is
    purely my own preference - I had trouble dealing with the myriad little
    headers. Fewer headers, for example one containing all the constants, enums,
    and types that described how the virtual machine code is represented, would
    have made navigation a bit simpler.)
     
    BartC, Feb 20, 2014
    #58
  19. It is a solution for not worrying about having multiple fields in the
    command struct for arguments. Previously I had args and string_args.
     
    Maxwell Bernstein, Feb 20, 2014
    #59
  20. I see what you mean; things are rather large.
    How would you recommend I deal with pointers? How would that be different?
    Ah, thank you. I'll look into the header issue.
     
    Maxwell Bernstein, Feb 20, 2014
    #60
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.