Good Representation for an Abstract Syntax Tree

Discussion in 'C Programming' started by johan.tibell@gmail.com, Jul 18, 2006.

  1. Guest

    I'm in the process of writing an interpreter for lambda calculus (i.e.
    a small functional programming language) in C. I've previously written
    one in Haskell so I understand at least some of the concepts involved
    in writing an interpreter and I know how to define a grammar and use
    lex/yacc to parse it. What I don't know is how to represent the
    abstract syntax tree (AST) of my language in C. In Haskell I used
    what's called an algebraic data type like so:

    data Exp = Lam Id Exp
    | Var Id
    | App Exp Exp
    | Lit Int

    Legend:
    Exp - expression (i.e. one of Lam, Var, App or Lit)
    Lam - lambda abstraction (i.e. a function)
    App - function application
    Lit - integer literal
    Id - variable name (i.e. a string)

    How would I represent such a structure in C, perhaps with a union with
    some sort of tag? I guess I would use a class hierarchy in an
    object-oriented language but I really want to stay in C since I
    consider this an educational experience for me (i.e. the goal is not to
    write an interpreter as quickly as possible but rather learn how to do
    it in an efficient way in C).

    Once I have an AST I can begin applying transformations such as closure
    and CPS (continuation-passing style) conversions to it and I think I
    can achieve that on my own but right now finding a good AST
    representation is holding me back.

    P.S. For those who are interested my plan is to write an all C
    interpreter for a strict functional language and use the Boehm garbage
    collector.

    Thanks in advance,

    Johan Tibell
     
    , Jul 18, 2006
    #1
    1. Advertising

  2. Simon Biber Guest

    wrote:
    > I'm in the process of writing an interpreter for lambda calculus (i.e.
    > a small functional programming language) in C. I've previously written
    > one in Haskell so I understand at least some of the concepts involved
    > in writing an interpreter and I know how to define a grammar and use
    > lex/yacc to parse it. What I don't know is how to represent the
    > abstract syntax tree (AST) of my language in C. In Haskell I used
    > what's called an algebraic data type like so:
    >
    > data Exp = Lam Id Exp
    > | Var Id
    > | App Exp Exp
    > | Lit Int
    >
    > Legend:
    > Exp - expression (i.e. one of Lam, Var, App or Lit)
    > Lam - lambda abstraction (i.e. a function)
    > App - function application
    > Lit - integer literal
    > Id - variable name (i.e. a string)
    >
    > How would I represent such a structure in C, perhaps with a union with
    > some sort of tag? I guess I would use a class hierarchy in an
    > object-oriented language but I really want to stay in C since I
    > consider this an educational experience for me (i.e. the goal is not to
    > write an interpreter as quickly as possible but rather learn how to do
    > it in an efficient way in C).


    Yes, a series of struct within a union would be reasonably efficient.
    Here is one way to lay them out:

    struct Exp;

    struct Lam
    {
    char *id;
    struct Exp *exp;
    };

    struct Var
    {
    char *id;
    };

    struct App
    {
    struct Exp *exp1;
    struct Exp *exp2;
    };

    struct Lit
    {
    int value;
    };

    enum Type
    {
    LAM,
    VAR,
    APP,
    LIT
    };

    struct Exp
    {
    enum Type type;
    union {
    struct Lam *lam;
    struct Var *var;
    struct App *app;
    struct Lit *lit;
    } form;
    };

    For those types of expression that only contain a single data member
    (Var only contains a string and Lit only contains an integer), you may
    choose to place their data directly into the union rather than adding a
    level of indirection.

    --
    Simon.
     
    Simon Biber, Jul 18, 2006
    #2
    1. Advertising

  3. Chris Dollin Guest

    wrote:

    > I'm in the process of writing an interpreter for lambda calculus (i.e.
    > a small functional programming language) in C. I've previously written
    > one in Haskell so I understand at least some of the concepts involved
    > in writing an interpreter and I know how to define a grammar and use
    > lex/yacc to parse it. What I don't know is how to represent the
    > abstract syntax tree (AST) of my language in C. In Haskell I used
    > what's called an algebraic data type like so:
    >
    > data Exp = Lam Id Exp
    > | Var Id
    > | App Exp Exp
    > | Lit Int
    >
    > Legend:
    > Exp - expression (i.e. one of Lam, Var, App or Lit)
    > Lam - lambda abstraction (i.e. a function)
    > App - function application
    > Lit - integer literal
    > Id - variable name (i.e. a string)
    >
    > How would I represent such a structure in C, perhaps with a union with
    > some sort of tag?


    I would (and, for a different language, have) do it in two layers.

    I'd have a bunch of functions expressing the expression structure, eg,

    isLambda(Exp), lambdaId(Exp), lambdaBody(Exp)
    isVar(Exp), varId(Exp)
    isApp(Exp), appFun(Exp), appArg(Exp)
    isLit(Exp), litInt(Exp)

    In the C file that defines those functions, I'd have whatever representation
    I choose. It might be

    struct ExpStruct
    {
    enum { LAM, VAR, APP, LIT } type;
    union
    {
    struct LamStruct lam;
    struct VarStruct var;
    struct AppStruct app;
    struct LitStruct lit; /* or perhaps `int lit` */
    }
    };

    (And this /wouldn't/ be in the public header file, which would have
    the minimum needed to give the function signatures. As a first
    cut, it would say things like:

    typedef struct ExpStruct *Exp;

    While some people argue against typedefing struct pointers in this
    way, it allows you, if necessary, to /completely/ change your
    representation later, eg:

    typedef int Exp;

    and have expressions encoded as integers, which might be indexes
    into an array of expression objects or immediate values or
    whatever. The access-via-functions means that the code doesn't
    care.
    )

    If this representation turned out to be naff, I'd pick a different one,
    but all being well, the rest of the code would be protected from the
    representation change by the functional API.

    If performance of the API turned out to be a problem, I might be able to
    judiciously inline or macroise the API.

    [For the language I did, I had a fiendish representation that played
    not-everso-portable games with pointers-as-integers so that I could
    reduce the space taken up by nodes, since there was some assumption
    that the code would be running in smallish devices with "sane" pointer
    semantics. Don't do that unless you /have/ to, and I bet you won't.]

    --
    Chris "seeker" Dollin
    "Reaching out for mirrors hidden in the web." - Renaissance, /Running Hard/
     
    Chris Dollin, Jul 18, 2006
    #3
  4. Chris Torek Guest

    In article <e9ip53$e3s$>
    Chris Dollin <> wrote:
    > struct ExpStruct
    > {
    > enum { LAM, VAR, APP, LIT } type;
    > union
    > {
    > struct LamStruct lam;
    > struct VarStruct var;
    > struct AppStruct app;
    > struct LitStruct lit; /* or perhaps `int lit` */
    > }
    > };


    Note that the union-member of a "struct ExpStruct" needs a name.
    For instance, the second-to-last line above should perhaps read
    "} un;".

    I occasionally wish that C had un-named union-members of outer
    "struct"s. So I sometimes fake them, using code like this:

    struct exp {
    enum { LAM, VAR, APP, LIT } exp_type;
    union {
    struct lam un_lam;
    struct var un_var;
    struct app un_app;
    struct lit un_lit;
    } exp_un;
    };
    #define exp_lam exp_un.un_lam
    #define exp_var exp_un.un_var
    #define exp_app exp_un.un_app
    #define exp_lit exp_un.un_lit

    Then, given a "struct exp *ep" pointing to a valid "exp", I can
    write:

    switch (ep->exp_type) {
    case LAM:
    do something with ep->exp_lam;
    ...
    case VAR:
    do something with ep->exp_var;
    ...
    ...
    }

    The "#define"s make these expand to ep->exp_un.un_lam, ep->exp_un.un_var,
    and so on.

    Without the intermediate "#define"s (and using the names in Chris
    Dollin's example code, with the small correction), this becomes:

    switch (ep->type) {
    case LAM:
    do something with ep->un.lam;
    ...
    ...
    }

    You have to name the union member, then write the name of the
    union member.

    Some non-C languages (including Plan 9 C) do have a way to insert
    anonymous items that "bubble out" to an outer enclosing struct,
    but neither C89 nor C99 support this.
    --
    In-Real-Life: Chris Torek, Wind River Systems
    Salt Lake City, UT, USA (40°39.22'N, 111°50.29'W) +1 801 277 2603
    email: forget about it http://web.torek.net/torek/index.html
    Reading email is like searching for food in the garbage, thanks to spammers.
     
    Chris Torek, Jul 18, 2006
    #4
  5. jacob navia Guest

    Chris Torek wrote:
    > In article <e9ip53$e3s$>
    > Chris Dollin <> wrote:
    >
    >> struct ExpStruct
    >> {
    >> enum { LAM, VAR, APP, LIT } type;
    >> union
    >> {
    >> struct LamStruct lam;
    >> struct VarStruct var;
    >> struct AppStruct app;
    >> struct LitStruct lit; /* or perhaps `int lit` */
    >> }
    >> };

    >
    >
    > Note that the union-member of a "struct ExpStruct" needs a name.
    > For instance, the second-to-last line above should perhaps read
    > "} un;".
    >
    > I occasionally wish that C had un-named union-members of outer
    > "struct"s. So I sometimes fake them, using code like this:
    >
    > struct exp {
    > enum { LAM, VAR, APP, LIT } exp_type;
    > union {
    > struct lam un_lam;
    > struct var un_var;
    > struct app un_app;
    > struct lit un_lit;
    > } exp_un;
    > };
    > #define exp_lam exp_un.un_lam
    > #define exp_var exp_un.un_var
    > #define exp_app exp_un.un_app
    > #define exp_lit exp_un.un_lit
    >
    > Then, given a "struct exp *ep" pointing to a valid "exp", I can
    > write:
    >
    > switch (ep->exp_type) {
    > case LAM:
    > do something with ep->exp_lam;
    > ...
    > case VAR:
    > do something with ep->exp_var;
    > ...
    > ...
    > }
    >
    > The "#define"s make these expand to ep->exp_un.un_lam, ep->exp_un.un_var,
    > and so on.
    >
    > Without the intermediate "#define"s (and using the names in Chris
    > Dollin's example code, with the small correction), this becomes:
    >
    > switch (ep->type) {
    > case LAM:
    > do something with ep->un.lam;
    > ...
    > ...
    > }
    >
    > You have to name the union member, then write the name of the
    > union member.
    >
    > Some non-C languages (including Plan 9 C) do have a way to insert
    > anonymous items that "bubble out" to an outer enclosing struct,
    > but neither C89 nor C99 support this.


    lcc-win32 supports this.
     
    jacob navia, Jul 18, 2006
    #5
  6. Chris Dollin Guest

    Chris Torek wrote:

    > In article <e9ip53$e3s$>
    > Chris Dollin <> wrote:
    >> struct ExpStruct
    >> {
    >> enum { LAM, VAR, APP, LIT } type;
    >> union
    >> {
    >> struct LamStruct lam;
    >> struct VarStruct var;
    >> struct AppStruct app;
    >> struct LitStruct lit; /* or perhaps `int lit` */
    >> }
    >> };

    >
    > Note that the union-member of a "struct ExpStruct" needs a name.
    > For instance, the second-to-last line above should perhaps read
    > "} un;".


    ARGH. Thanks, Chris. Sorry, Johan.

    It was a typo. Well, a braino.

    --
    Chris "Brain-O, Bray-ay-ay-ayn-O, daylight come & ..." Dollin
    "Who are you? What do you want?" /Babylon 5/
     
    Chris Dollin, Jul 19, 2006
    #6
  7. Richard Bos Guest

    jacob navia <> wrote:

    > Chris Torek wrote:


    [ SNIP! over 60 lines ]

    > > Some non-C languages (including Plan 9 C) do have a way to insert
    > > anonymous items that "bubble out" to an outer enclosing struct,
    > > but neither C89 nor C99 support this.

    >
    > <spam> supports this.


    You just quoted _seventy lines_ just to add a oneliner that says that
    <spammy spammy spam> is non-conforming (which, btw, we already know all
    too well)? Grief, have you so few customers that you need such tactics?

    Richard
     
    Richard Bos, Jul 19, 2006
    #7
  8. jacob navia said:

    > Chris Torek wrote:

    <snip>
    >>
    >> Some non-C languages (including Plan 9 C) do have a way to insert
    >> anonymous items that "bubble out" to an outer enclosing struct,
    >> but neither C89 nor C99 support this.

    >
    > lcc-win32 supports this.


    That might be relevant in an lcc newsgroup, but what matters in comp.lang.c
    is that the construct is non-standard and non-portable, and should thus be
    avoided by those who need to write portable code.

    Please stop confusing "Navia's implementation" with "the real world of
    portable programming".

    --
    Richard Heathfield
    "Usenet is a strange place" - dmr 29/7/1999
    http://www.cpax.org.uk
    email: rjh at above domain (but drop the www, obviously)
     
    Richard Heathfield, Jul 19, 2006
    #8
  9. Johan Tibell Guest

    Simon Biber wrote:
    > Yes, a series of struct within a union would be reasonably efficient.
    > Here is one way to lay them out:
    >
    > struct Exp;
    >
    > struct Lam
    > {
    > char *id;
    > struct Exp *exp;
    > };
    >
    > struct Var
    > {
    > char *id;
    > };
    >
    > struct App
    > {
    > struct Exp *exp1;
    > struct Exp *exp2;
    > };
    >
    > struct Lit
    > {
    > int value;
    > };
    >
    > enum Type
    > {
    > LAM,
    > VAR,
    > APP,
    > LIT
    > };
    >
    > struct Exp
    > {
    > enum Type type;
    > union {
    > struct Lam *lam;
    > struct Var *var;
    > struct App *app;
    > struct Lit *lit;
    > } form;
    > };
    >
    > For those types of expression that only contain a single data member
    > (Var only contains a string and Lit only contains an integer), you may
    > choose to place their data directly into the union rather than adding a
    > level of indirection.
    >
    > --
    > Simon.


    When dynamically allocating memory for such a struct using malloc will
    sizeof(Exp) be enough to get sufficient memory allocated for the
    biggest union member? Will this work:

    struct Exp *exp;
    exp = malloc(sizeof(struct Exp));
    if (!exp)
    fprintf(stderr, "out of memory\n");

    exp->form.lam = malloc(sizeof(struct Lam));
    if (!exp->form.lam)
    fprintf(stderr, "out of memory\n");

    (Also, is it good style?)
     
    Johan Tibell, Jul 20, 2006
    #9
  10. "Johan Tibell" <> writes:
    > Simon Biber wrote:

    [...]
    >> struct Exp
    >> {
    >> enum Type type;
    >> union {
    >> struct Lam *lam;
    >> struct Var *var;
    >> struct App *app;
    >> struct Lit *lit;
    >> } form;
    >> };

    [...]
    >
    > When dynamically allocating memory for such a struct using malloc will
    > sizeof(Exp) be enough to get sufficient memory allocated for the
    > biggest union member?


    Yes. sizeof for a union yields at least the size of its largest
    member. (It's typically the size of its largest member rounded up to
    the alignment of its most strictly aligned member.)

    > Will this work:
    >
    > struct Exp *exp;
    > exp = malloc(sizeof(struct Exp));
    > if (!exp)
    > fprintf(stderr, "out of memory\n");
    >
    > exp->form.lam = malloc(sizeof(struct Lam));
    > if (!exp->form.lam)
    > fprintf(stderr, "out of memory\n");
    >
    > (Also, is it good style?)


    Yes, and mostly.

    Rather than
    exp = malloc(sizeof(struct Exp));
    ...
    exp->form.lam = malloc(sizeof(struct Lam));
    I'd write:
    exp = malloc(sizeof *exp);
    ...
    exp->form.lam = malloc(sizeof *(exp->form.lam));

    (The parentheses on the second malloc aren't strictly necessary, but I
    find it easier to read with them.)

    --
    Keith Thompson (The_Other_Keith) <http://www.ghoti.net/~kst>
    San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
    We must do something. This is something. Therefore, we must do this.
     
    Keith Thompson, Jul 20, 2006
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. DaKoadMunky
    Replies:
    4
    Views:
    552
    Lee Weiner
    Apr 20, 2004
  2. Stub

    B tree, B+ tree and B* tree

    Stub, Nov 12, 2003, in forum: C Programming
    Replies:
    3
    Views:
    10,126
  3. Melkor Ainur
    Replies:
    6
    Views:
    441
    Uli Kusterer
    Jan 2, 2004
  4. Lonnie Princehouse

    Abstract Syntax Tree conundrum

    Lonnie Princehouse, Jan 2, 2004, in forum: Python
    Replies:
    2
    Views:
    300
    Lonnie Princehouse
    Jan 4, 2004
  5. Saeed Amrollahi
    Replies:
    1
    Views:
    319
    Alf P. Steinbach
    Nov 20, 2007
Loading...

Share This Page