[SUMMARY] Bytecode Compiler (#100)

Discussion in 'Ruby' started by Ruby Quiz, Nov 9, 2006.

  1. Ruby Quiz

    Ruby Quiz Guest

    I'm very glad this quiz came along, because I honestly had no idea what
    "bytecode" meant before I read this quiz. I think it was good for dumbing the
    idea down for the masses (or just me, if everyone else already knew this stuff).
    Not all of us port entire virtual machines, as the quiz creator does.

    There were a couple of approaches to this week's quiz. One strategy was to
    convert the expression into valid Ruby that would emit the bytecode and let Ruby
    do the parsing. This works for this parsing problem because Ruby's expression
    rules are comparable to those specified by the quiz. I think we can break down
    one of those solutions to better understand what was involved with this problem.

    Here's a chunk of code from Cameron Pope's solution:

    module CreateExpressions
    def +(other) Expr.new:)add, self, other) end
    def -(other) Expr.new:)sub, self, other) end
    def *(other) Expr.new:)mul, self, other) end
    def /(other) Expr.new:)div, self, other) end
    def %(other) Expr.new:)mod, self, other) end
    def **(other) Expr.new:)pow, self, other) end
    end

    You can see a set of operator definitions here, matching those needed by the
    quiz. Instead of these methods doing what they normally do (math), the versions
    we see here just create and return an Expr object. Let's have a look at that
    definition now:

    class Expr
    include CreateExpressions
    OPCODES = {:add => 0x0a, :sub => 0x0b, :mul => 0x0c, :pow => 0x0d,
    :div => 0x0e, :mod => 0x0f}

    def initialize(op, a, b)
    @op = op
    @first = a
    @second = b
    end

    def to_s
    "(#{@op.to_s} #{@first.to_s} #{@second.to_s})"
    end

    def emit
    @first.emit << @second.emit << OPCODES[@op]
    end
    end

    Ignoring the to_s() method, which is just for debugging, this object is trivial.
    It stores an operator and operands, and can emit() bytecode. To do that, it
    forwards the emit() call to both operands, which may be nested Expr objects (say
    from parenthetical expressions) or something else we will discuss in just a
    moment. emit() also looks up the operator's bytecode and appends that to the
    results of the operand emit()s.

    Note that this class include()s all of the operator definitions we saw earlier.
    This means you can add, multiply, etc. Expr objects, which yields a new Expr
    with the original Epxrs nested in the operands.

    That covers expressions, but what about plain old numbers? For that, there is
    another class:

    class Const
    include CreateExpressions
    OPCODES = {2 => 0x01, 4 => 0x02}

    def initialize(i)
    @value = i
    end

    def to_s
    @value
    end

    def emit
    case @value
    when (-32768..32767): bytes = [@value].pack("n").unpack("C*")
    else bytes = [@value].pack("N").unpack("C*")
    end
    bytes.insert 0, OPCODES[bytes.size]
    end
    end

    We have pretty much the same pattern here, save that emit() converts the number
    to the bytecode for the properly sized constant plus the packed bytes. Again we
    see the arithmetic operators include()ed, so these too can be combined with
    other Const and Expr objects.

    Now we need a means to turn the normal numbers of the expression into Const
    objects:

    class Fixnum
    def to_const
    Const.new(self)
    end
    end

    Yep, that will do just that.

    All that's left is to put it to use:

    class Compiler
    def self.compile(expr)
    self.mangle(expr).emit.flatten
    end

    def self.explain(expr)
    self.mangle(expr).to_s
    end

    private
    def self.mangle(expr)
    eval(expr.gsub(/\d+/) {|s| "#{s}.to_const()"})
    end
    end

    The mangle() method is the heart of this solution. A simple Regexp is used to
    tack a to_const() call on to each number of the expression and a hand-off is
    made to eval() to build up the indicated combination of Const and Expr objects.
    Once mangle()d, the result is emit()ted and flatten()ed into the final bytecode
    Array in compile().

    That's easy enough to understand and it works just fine in this instance.
    However, what if the rules differed from Ruby's and you couldn't lean on Ruby's
    parser? In that case, you would have to roll your own parser, which some
    decided to do.

    Luckily there is already a popular algorithm for unrolling infix arithmetic
    expressions into the RPN order required by bytecode spec:

    http://en.wikipedia.org/wiki/Shunting_yard_algorithm

    Several people employed this strategy, including Daniel Martin:

    # This is a solution to Ruby Quiz #100
    #
    # It's basically just a shunting algorithm, but with a twist
    # since it needs to distinguish between a "-" that's part of
    # a number and a "-" that's an operator. To do that, I use
    # a state machine while parsing to remember if I need next
    # an operator or an integer.

    require 'strscan'
    class Compiler
    # A small class made so that I can use case ... when
    # with a StringScanner
    class Token < Regexp
    def initialize(re)
    super(re)
    end
    # Using is_a? instead of respond_to? isn't very duck-typey,
    # but unfortunately String#scan and StringScanner#scan mean
    # completely different things.
    def ===(s)
    if (s.is_a?(StringScanner))
    s.scan(self)
    else
    super(s)
    end
    end
    end

    # ...

    This first class of Daniel's solution is really just a Regexp with a custom case
    equals method. This sets up an elegant syntax for the solution proper we will
    encounter in a bit.

    # ...

    # The tokens I need
    WSPACE = Token.new(/\s+/)
    LPAREN = Token.new(/\(/)
    RPAREN = Token.new(/\)/)
    OP = Token.new(/\*\*|[+*%\/-]/)
    NEG = Token.new(/-/)
    INT = Token.new(/\d+/)

    OpValMap = {'+' => 0x0a, '-' => 0x0b, '*' => 0x0c,
    '**' => 0x0d, '/' => 0x0e, '%' => 0x0f}

    # ...

    Here you see the class used to build the Tokens for lexxing the expression
    content. These are very basic regular expressions.

    Here are the interface methods required by the quiz solution:

    # ...

    def initialize(instring)
    @scanner = StringScanner.new(instring)
    @opstack = Array.new
    @outarr = Array.new
    end

    def compile()
    state = :state_int
    while state != :state_end
    case @scanner
    when WSPACE
    next
    else
    state = send(state)
    raise "Syntax error at index #{@scanner.pos}" if ! state
    end
    end
    while ! @opstack.empty?
    op = @opstack.pop
    raise "Mismatched parens" if LPAREN === op
    @outarr << OpValMap[op]
    end
    @outarr
    end

    # Class method as required by the test harness
    def self.compile(instring)
    new(instring).compile
    end

    # ...

    Notice that initialize() sets up the stack and output Arrays needed by the
    Shunting Yard algorithm. The compile() method wraps a state machine we will
    examine the workings of shortly. You can see that is discards whitespace as
    needed, forwards to state methods, pop()s the final operators as the algorithm
    requires, and returns the resulting output Array. The class compile() is just a
    wrapper over the other two methods.

    The heart of what is going on here is hidden in the state methods. From
    compile() we saw that the initial state for the machine is :state_int, so let's
    begin with that method:

    # ...

    private

    # ...

    # state where we're expecting an integer or left paren
    def state_int
    case @scanner
    when LPAREN
    @opstack << @scanner.matched
    :state_int
    when INT
    integer(@scanner.matched.to_i)
    :state_op
    when NEG
    :state_neg
    end
    end

    # ...

    Here we begin to see the magic of the Token class. Matching a Token advances
    the StringScanner index and matched() can then be used to grab the element.

    The LPAREN when clause is right out of the algorithm description, and the INT
    clause is pretty obviously if you know that integer() handles the constant
    conversion. The NEG clause is needed to distinguish an unary minus from the
    binary operator. Note that each case returns the next expected state for the
    machine.

    Here's the integer() helper we have seen before:

    # ...

    # Handle an integer
    def integer(i)
    if (i <= 32767 and i >= -32768)
    @outarr << 0x01
    @outarr.push(*(.pack("n").unpack("C*")))
    else
    @outarr << 0x02
    @outarr.push(*(.pack("N").unpack("C*")))
    end
    end

    # ...

    The only difference here is that constants are pushed onto the output Array as
    required by the algorithm.

    Let's return to the state methods:

    # ...

    # Expecting an operator or right paren
    def state_op
    case @scanner
    when RPAREN
    while not LPAREN === @opstack[-1]
    raise "Mismatched parens" if @opstack.empty?
    @outarr << OpValMap[@opstack.pop]
    end
    @opstack.pop
    :state_op
    when OP
    op = @scanner.matched
    while is_lower(@opstack[-1], op)
    @outarr << OpValMap[@opstack.pop]
    end
    @opstack << op
    :state_int
    else
    # I would handle this with an EOS token, but unfortunately
    # StringScanner is broken w.r.t. @scanner.scan(/$/)
    :state_end if @scanner.eos?
    end
    end

    # ...

    Again, the RPAREN clause is right out of the algorithm description. The OP
    clause is as well and you can see that it handles the precedence check via the
    is_lower() helper method. The else clause gives us our exit state when the
    expression has been exhausted.

    Here's the is_lower() helper:

    # ...

    # Define the precedence order
    # One thing to note is that for an operator a,
    # is_lower(a,a) being true will make that operator
    # left-associative, while is_lower(a,a) being false
    # makes that operator right-associative. Note that
    # we want ** to be right associative, but all other
    # operators to be left associative.
    def is_lower(op_on_stack, op_in_hand)
    case op_on_stack
    when nil, LPAREN; false
    when /\*\*|[*\/%]/; op_in_hand =~ /^.$/
    when /[+-]/; op_in_hand =~ /[+-]/
    end
    end

    # ...

    The comment surely explains this better than I can, but the point of this method
    is to resolve whether or not the operator we just matched is lower in precedence
    than the operator on the stack. For example, in the last line, when we have a
    plus or minus on the stack only another plus or minus would trigger the true
    result.

    Here's the final state:

    # ...

    # The state where we've seen a minus and are expecting
    # the rest of the integer
    def state_neg
    case @scanner
    when INT
    integer(-(@scanner.matched.to_i))
    :state_op
    end
    end

    # ...
    end

    This just reads the constant following a negation operator and ensures that it
    is negated.

    Those are two different approaches that pass the quiz tests. It's probably
    slightly easier to lean on Ruby in cases like this where you know you can get
    away with it. If you can't though, the custom parser isn't too much more work
    as Daniel shows.

    My thanks to all who taught me many things I didn't know about bytecode
    compilation through their solutions and to Ross Bamford for the educational
    quiz.

    Tomorrow we will play with VCRs and I'm told that dates me...
    Ruby Quiz, Nov 9, 2006
    #1
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Andreas Klemt
    Replies:
    2
    Views:
    411
    Marina
    Jul 28, 2003
  2. Not4u
    Replies:
    9
    Views:
    1,042
    Not4u
    Feb 27, 2004
  3. Ruby Quiz

    [QUIZ] Bytecode Compiler (#100)

    Ruby Quiz, Nov 3, 2006, in forum: Ruby
    Replies:
    19
    Views:
    426
    James Edward Gray II
    Nov 9, 2006
  4. fred
    Replies:
    3
    Views:
    278
    Zifud
    Mar 17, 2005
  5. Replies:
    5
    Views:
    884
Loading...

Share This Page