Tiny VM in C

Discussion in 'C Programming' started by tekk.nolagi, Feb 11, 2014.

  1. tekk.nolagi

    BartC Guest

    For strings and names, you can just use char *s instead of char s[...]. Then
    the size of that pointer will be (most likely with a 32-bit compiler) 4
    bytes. The name will then be stored elsewhere. But the initialisation of the
    string can be the same. (With unions, it's a bit uncertain how the thing is
    initialised anyway; I think it uses the type of the first member of the
    union.)

    But being a VM, it can implement strings and names as it likes, including
    using, instead, an index into a table of names (a table of char* in
    reality). Then you don't really need 's', but can just use 'i'. A char* will
    do however.

    (Since this doesn't appear to relate to a real machine any more, I'll
    briefly describe a VM I use, which implements the byte-code of a language.
    The bytecode is represented as a linear array of int values, and might look
    something like this:

    C C X C C X Y C X Y Z ....

    Where C is a command (an opcode), and X, Y and Z are the first, second and
    third operands. So 'instructions' are variable length, but each part fits
    into a 32-bit int value.

    Each operand can be one of several kinds: when it is a 32-bit int, then it
    is directly stored in X, Y or Z. Otherwise it will be an index into a table,
    or sometimes even a pointer direct to a variable for example.

    For dealing with the different interpretation of X, Y or Z, then C casts are
    used as necessary (rather than unions; but if using pointers, you need to be
    sure they will fit into an int! So indices are the best bet.)

    The handler of each C opcode will of course need to know what type each
    operand is; also how many operands there are, to allow it to properly step
    the PC to the next bytecode. This is common also in real processors where
    instructions are of varying length.)
     
    BartC, Feb 20, 2014
    #61
    1. Advertisements

  2. Looks like I can assign the string to the pointer *after the fact*, but
    not assign it inline with {curly brackets}.
     
    Maxwell Bernstein, Feb 20, 2014
    #62
    1. Advertisements

  3. For strings and names, you can just use char *s instead of char s[...].
    When I am initializing the union, I get the following warning
    (attached), and then a segfault when I run the code.

    λ chaos tidbits → gcc pointer_union.c
    pointer_union.c:10:19: warning: incompatible pointer to integer conversion
    initializing 'int' with an expression of type 'char [6]'
    [-Wint-conversion]
    union Data d = {"hello"};
    ^~~~~~~
    1 warning generated.
    I have a hash table... hmm.
     
    Maxwell Bernstein, Feb 20, 2014
    #63
  4. tekk.nolagi

    James Kuyper Guest

    char *p = "This is initialization";
    p = "This is assignment.";

    Curly brackets come in when initializing aggregate types (arrays or
    structures):

    char array1[] = {'H', 'e', 'l', 'l', 'o', ' ', '\0'};

    But for character types it's equivalent, and simpler, to use strings:

    char array2[] = "world!\n";
     
    James Kuyper, Feb 20, 2014
    #64
  5. tekk.nolagi

    BartC Guest

    (With unions, it's a bit uncertain how the
    This was what I meant. It's looking for an int value (the first member of
    the union). If you still have your original union, then it might store some
    letters from "hello", without a zero terminator, which may well cause the
    segfault.

    A char* here would be easier to initialise; but you might have to put an
    (int) cast in front of it. However, this will fail on a system where a char*
    is wider than an int. It gets messy! Another way is to have the long long
    member first, then cast to that.

    Unions don't work well when you want to initialise certain fields.
     
    BartC, Feb 20, 2014
    #65
  6. tekk.nolagi

    James Kuyper Guest

    There's no uncertainty: the provided initializer is used to initialize
    the first member of the union.
    It would appear that the first member of "union Data" has the type
    'int'. Since that is not an array of character type, the string gets
    converted to a pointer to the first element of the array, and there's no
    implicit conversion from that pointer to int (though you could put one
    in explicitly by using a cast, if that were what you wanted to do).

    In C90, you should put the field you want to initialize most often at
    the beginning of the union. To initialize any other member of the union,
    you'd have to do so by assignment, not by initialization.

    In C99, designated initializers were added. Among other things, these
    allow you to initialize members of a union other than the first one:

    union Data d = {.greeting = "hello"};

    If greeting is a pointer, this is equivalent to

    union Data d;
    d.greeting = "hello";

    If greeting is an array of char, it's more accurately equivalent to

    union Data d;
    strncpy(d.greeting, "hello", sizeof d.greeting);
     
    James Kuyper, Feb 20, 2014
    #66
  7. tekk.nolagi

    BartC Guest

    So in the first few bytes of the char array, it gets a pointer to the string
    (instead of the first few letters)?

    Suppose the char array was the first member; how easy would it be to
    initialise the other members through casts?
     
    BartC, Feb 20, 2014
    #67
  8. tekk.nolagi

    James Kuyper Guest


    If the char array you're referring to is another member of same union,
    and sufficiently long, and if 'int' is large enough to store the
    complete representation of a pointer, and if the conversion from pointer
    to int is defined to be representation-conserving, then what you say
    would be true. The first two conditions are matters under your control.
    The second two are under the implementation's control; and it's probably
    not uncommon for both to be true - but the standard imposes no such
    requirement.

    I certainly wouldn't recommend writing code to rely upon those
    assumptions. If you wanted to do something like that, put either a
    intptr_t member or an actual pointer member in the union, and use it
    directly.
    Without designated initializers, you can't initialize anything other
    than the first member. However, by initialization of that member with a
    carefully chosen character string, you can guarantee the value that
    would be read if any of the other members were accessed - but this
    requires use of implementation-specific information about how that
    member is represented, and the resulting code would not be portable
    anywhere where that information didn't apply.

    For instance, assuming CHAR_BIT == 8 and sizeof(unsigned long)==4,

    union {
    char c4[4];
    unsigned long ul;
    } data = {"\001\002\003\004"};

    is likely to give data.ul a value of either 0x01020304 or 0x04030201,
    though there have been popular machines where other values might be
    seen: 0x02010403 and 0x03040102 being two of the most popular alternatives.

    I would NOT recommend this approach, due to being both obscure and
    non-portable. However, I've known a fair number of people who would like
    the idea a lot better than I do.
     
    James Kuyper, Feb 20, 2014
    #68
  9. A union initializer that doesn't name a member defaults to the first
    member, which apparently is of type int in your case.

    As of C99, you can use a designated initializer:

    union Data d = { .foo = "hello" };
     
    Keith Thompson, Feb 20, 2014
    #69
  10. Sorry, I meant initialize.
    I have it in a union; that is giving me trouble.
     
    Maxwell Bernstein, Feb 20, 2014
    #70
  11. This was what I meant. It's looking for an int value (the first member of
    Ah, I see. I do have a char* but it is freaking out :-/ Darn.
     
    Maxwell Bernstein, Feb 20, 2014
    #71
  12. In C99, designated initializers were added. Among other things, these
    Ah, so is there no easy way to "not care" about the type on initialization?
     
    Maxwell Bernstein, Feb 20, 2014
    #72
  13. It would appear that the first member of "union Data" has the type
    That would be interesting.
     
    Maxwell Bernstein, Feb 20, 2014
    #73
  14. tekk.nolagi

    BartC Guest

    The C99 method given at the top will work; the other examples are what it is
    equivalent to, not what you have to code.

    In your case, it would be {.s="Hello"}, which isn't too bad.

    Alternatively, you could separate the string argument from the other three,
    which can still be in a union. It will add four bytes or so to an argument,
    but then you were earlier using 32 bytes for one.

    The argument becomes roughly struct {union {i, r, ll}; s}, and its
    initialisation might be: {0, "Hello"} for a string (and {N, NULL} for
    numeric). The other three all being int types of some kind, initialising any
    of them with an int constant is simpler (but I'd put ll first, to avoid
    problems with setting the top half of ll).

    But there are many different ways of doing this. I understand the problem is
    finding a painless way of writing code sequences by hand, using C data
    initialisation.
     
    BartC, Feb 21, 2014
    #74
  15. [Please don't delete attribution lines when you post a followup.]

    Using casts to initialize other union members would almost certainly be
    a bad idea.

    Suppose you have a union like:

    union u {
    int n;
    char *s;
    };

    You can write:

    union u obj = { 42 };

    and it will initialize obj.n to 42, because n is the first member.
    You can't use the same syntax to initialize u.s.

    I think BartC is suggesting something like:

    union u obj = { (int)"hello" };

    That *might* appear to work. It takes a char* pointer value, converts
    it to int, and stores it in the int member of the union. Can you then
    access the char* member of the same union and expect it to point to the
    string "hello"? The C standard certainly doesn't guarantee that.

    A cast converts a value from one type to another; it doesn't just
    reinterpret the representation. Conversions between a pointer and
    pointer, or between a pointer and an integer of the same size,
    *commonly* do just that, but you shouldn't depend on it.

    A clearer example: if a union contains an integer and a floating-point
    value, then a cast *certainly* doesn't just reinterpret the
    representation; it converts a number from one representation to another.

    There's no need to use a cast like that. If your compiler supports
    designated initializers (standard since C99), you can directly
    initialize any member you want. If not, you can just assign to that
    member.
     
    Keith Thompson, Feb 21, 2014
    #75
  16. What's the best and painless way to go about this? I'd like to have the
    simplest code possible.
     
    Maxwell Bernstein, Feb 21, 2014
    #76
  17. tekk.nolagi

    BartC Guest

    The approach using the C99 method seems reasonable enough (and you seem to
    have adopted this method in your latest code).

    But if you're going to be writing a lot of code for this VM, you might want
    to start looking at some sort of assembly language for it. Then the code
    will change from looking like this:

    {CARP_INSTR_LOADI, {{.r=CARP_REG0}, {.ll=0}}}, // loadi r0 0

    to just this:

    loadi r0,0

    Clearly much shorter and simpler, and you don't need the comment! A language
    will also take care of labels (I think at present you have to count
    instructions and insert the index, that makes mods much harder).

    The trouble is, an assembler is a *lot* of work, probably bigger than your
    project at the moment. So forgetting that for the time being, you can at
    least shorten some of the names; the above line can be:

    {CI_LOADI, {{.r=REG0}, {.ll=0}}}, // loadi r0 0

    (I'm not familiar with how the different arguments are used, but I might
    lose the distinction between .i and .r at least, maybe also .ll; just have a
    single long long integer argument (as well as the char*), and have this
    first in the union, so that most lines don't need a prefix, and this example
    becomes:

    {CI_LOADI, {{REG0}, {0}}},

    This is now clear enough that you can dispense with the comment! A shame
    about the inner {,}, that's because these are still unions.)

    For labels, it's not clear what you do now, but I might introduce a new
    command to define a label:

    {CI_LABEL, {{.s="loop"}}}, // loop:

    and then you can use this label as:

    {CI_JUMP, {{.s="loop"}}}, // jump loop

    Some simple pre-processing can then convert label references such as "loop",
    to the index of the CI_LABEL command. (And perhaps can also remove the
    CI_LABEL, which is otherwise a NOP, although this is tricky as all indices
    will change too).

    (I've used strings for the labels, but you can just use integers too, and
    the pre_processing is simpler:

    {CI_LABEL, {{100}}}, // L100:
    {CI_JUMP, {{100}}}, // jump L100 )
     
    BartC, Feb 21, 2014
    #77
  18. An alternative is the use the pre-processor. You can use the ## token
    joining operator to write this:

    #define I2(op,rn,v) \
    {CARP_INSTR_##op, {{.r=CARP_REG##rn}, {.ll=(v)}}}

    I2(LOADI, 0, 0)

    You'd need variations on this theme depending on what sort of
    instruction is being generated. Probably good enough for simple
    testing.

    <snip>
     
    Ben Bacarisse, Feb 21, 2014
    #78
  19. The approach using the C99 method seems reasonable enough (and you seem to
    Yeah, I took your suggestion.
    Indeed. I plan on writing a lexer & parser soon. The comments are for
    other people at the moment, or future me.
    At the moment, I just use the array index.
    Yes and no; I'd like to keep the prefixes so that the namespace is
    decently clean. I'm not bothered by the length so much. The difference
    between .r and .i is that .i is used explicitly for integer operations.
    I have removed the only use and field in the union. I use the .r to
    differentiate between "register" and "value". As far as I can tell, it
    does not take up any more space in the union.
    I do like this idea, though my first thoughts are:
    a) why not just have a hash table?
    b) why not just use numbers?
    Aha.
     
    Maxwell Bernstein, Feb 21, 2014
    #79
  20. An alternative is the use the pre-processor. You can use the ## token
    Well damn, I didn't think of that.
     
    Maxwell Bernstein, Feb 21, 2014
    #80
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.