The design of LLVM

Juha Nieminen · May 31, 2012

Please don't shoot me for posting something that's not 100% on-topic, but
I found this article on the design of LLVM to be interesting:

http://www.drdobbs.com/architecture-and-design/240001128

BGB · May 31, 2012

Please don't shoot me for posting something that's not 100% on-topic, but
I found this article on the design of LLVM to be interesting:

http://www.drdobbs.com/architecture-and-design/240001128

there are good points and bad points of LLVM.

as mentioned in the article, at least it is not some monolithic
executable. I am less fond of LLVM in that it drifts "further than
desired" from a more traditional compiler architecture (namely, in that
LLVM IR does pretty much everything).

in my case (for my VM project), I have more of a hybrid strategy:
most components are written as libraries, and modularity is stressed;
however, a lot more of the "traditional" stages are represented (for
example, there is an "assembler" library), and the process does more
clearly follow a stage-based model (though which stages are involved
depend a lot more on the choice of various parts).

( note that my project is far less mature than LLVM, and I am the sole
developer and probably also the sole user, ... this is more for "general
information", and not meant to imply that anyone should use it at
present vs LLVM. )

but, for example, the "assembler" takes textual ASM and currently
produces COFF objects (in memory buffers);
the linker currently takes these COFF objects and links them into the
process (currently there is no in-project static linker).

(there is partial ELF support, but it wasn't a huge priority, and isn't
currently used).

although there is a binary interface for the assembler, for the most
part it is used via its textual interface, where ASM code is fed in as
text, and assembled from this. the assembler also optionally supports a
macro preprocessor. the syntax is generally similar to that used by NASM.

note that beyond very high-level similarities (between my project and
LLVM), what is done and how it is done is considerably different (I had
looked at LLVM before, but the architectural differences as well as
differences in terms of goals were considerably different).

for example, high performance wasn't really a major goal, so much as
keeping things fairly dynamic (and having fairly fast
loading-from-source and eval). keeping things dynamic and having fast
recompile times has generally also been a higher priority than fast
execution speeds.

likewise, many parts of the back-end compilation process are done
incrementally.

note that the primary use case at present is for dynamically
loaded/compiled script languages. in particular, my own scripting
language, which uses a syntax similar to ActionScript3 and largely
implements the ECMA-262 spec, albeit there are a number of cases where
there are semantic differences from ECMA-262 (as well, it borrows some
features from Java, C#, C, and C++ as well).

typically, I don't care as much about static compilation, as both GCC
and MSVC do this fairly well (at present, pretty much all native C and
C++ code is compiled by general-purpose static compilers).

the script language doesn't "replace" C and C++, rather, there are
things that native languages do well and things they don't (dynamically
loading code from a textual source-code format being a big one...).

the assembler is currently multi-target, currently using mostly shared
code (between the x86 and x86-64, and ARM, and Thumb targets), with
different listing tables being used (x86 and x86-64 use a shared
listing, whereas ARM and Thumb use their own ISA listings). note that
things like the register set are also specified in the listings.

note that the listings are text files which are then converted into C
code via a specialized tool.

some extensions can be registered with the assembler and linker mostly
by registering callbacks (hell, it works). an example of this is
registering handlers for link-time code generation tasks and similar.

the design could be a little cleaner in some ways, but for now it works.

a current inconvenience though is that (due to historical reasons),
there is a split between 2 different systems for representing ASTs:
one is a system originally based on XML and DOM, where currently ASTs
are specified using a specialized form of DOM-like nodes;
the other AST system is loosely descended from Scheme (and built from
lists of cons-cells).

this issue currently creates some compatibility problems between the
parts (parts using S-Exps and parts using XML can't be used together).

sadly, neither is "clearly better", as both have merits and drawbacks:
XML is more flexible, but tends to eat a lot more memory;
S-Expressions use less memory and can perform faster in many cases, and
are usually easier to work with, but are much less flexible (it is not
usually possible to extend or tweak an S-Expression based structure
without having to also alter any code which depends on it, which is a
problem).

another problem with XML is that it is unreasonably verbose in textual
form, but I have devised an alternate syntax to help with this problem,
which I am calling "ZEML".

for example:
<foo x="3">
<bar name="y">
<baz/>
</bar>
<bar name="z">
<baz/>
<baz/>
</bar>
</foo>

could be written in ZEML as:
<foo x=3 <bar name="y" <baz>> <bar name="z" <baz> <baz>>>

both may map to the same representation internally (so either
printer/parser may be used with the ASTs), albeit ZEML does
differentiate between strings and numbers. the syntax imposes a number
of other differences from XML, such as there is no free-floating text,
it uses C-style escapes, the character/token rules differ.

there is also a binary serialization (for both XML and ZEML), itself
vaguely similar to WBXML (a byte-based serialized version of the DOM
tree), but it supports a few more features and is a little more compact.

there is also a bytecode format, where typically ASTs are compiled to
bytecode before being fed to the back-ends, such as an interpreter or JIT.

currently, the interpreter backend converts from bytecode into "indirect
threaded code", which is what is actually executed (threaded code is
simpler and more portable than a JIT, if albeit considerably slower, but
is faster and more flexible than a "giant-switch" based interpreter loop).

my JITs have tended to compile bytecode into more traditional functions,
often using a more conventional ABI (such as cdecl, Win64, or
SysV/AMD64). in other cases, a specialized ABI has been used internally
(and any cross-language calls use thunks).

( sadly, I don't presently have a working JIT, and getting the JIT back
into working order hasn't really been a high priority. )

note that my bytecode formats are (like the JVM, .NET, and AVM2), based
on a high-level stack-machine architecture.

currently, the VM uses 5 stacks:
the value stack (used for intermediate values, passing arguments, ...);
the mark stack (used for mark the start of arguments lists);
the control-flow stack (used for function/method calls/returns);
the local/lexical environment stack (used for both local and lexical
variables, may be either a stack or a list depending on whether or not
the function uses environment capture);
the dynamic environment stack (used for dynamic environment bindings).

currently, most opcodes indicate type via prefixes:
no prefix indicates dynamically-typed instructions;
a type-prefix indicates the use of a static type.

note: it is not safe to mix static and dynamic types on data values or
variables, since the types may use different physical representations,
hence requiring explicit type-conversion operations (to avoid crashing
or breaking the VM).

I personally find stack machines a bit easier to generate code for than
something like TAC or SSA, since in the case of a stack machine,
figuring out how to map things to temporaries or registers, and how to
deal with interactions between variable state and control flow, can be
left to the JIT back-end or similar.

I have started moving a lot more type-handling into the front-end
though, mostly as for the interpreter I wasn't really bothering with
this (because type-flow analysis is a pain, and is itself a fairly
large/complex part of the JIT).

it was a bit easier to start adding a bit more of this logic to the
front-end compiler, during the AST->Bytecode compilation stage. it also
gave a better opportunity to give type-check error messages as well.

note that this would still leave things like assigning temporaries and
register allocation to the JIT backend (as well as probably performing
more in-depth type analysis), but the interpreter does not need to
bother with any of this (and may still benefit from statically known types).

note that type-checking works a little different in ECMAScript style
languages (JavaScript, ...) than it does in C, since very often types
will not be tagged explicitly, and a variable may hold different types
depending on execution path, requiring the use of type-inference algorithms.

or such...

Help design HTA	0	Jan 5, 2023
Design of a TextBuffer and support of istream	2	Jan 11, 2013
SemiOT: Google and llvm[-clang] vs Java	0	Oct 31, 2010
How to Design a Cross Network app	0	Aug 15, 2022
How do I call the exact function on page load?	6	May 18, 2023
The end of maintenance	1	Feb 5, 2025
The Horror of pointers...	5	Jan 11, 2025
Does anyone use wordpress for web design?	2	Sep 14, 2021

The design of LLVM

Juha Nieminen

BGB

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads