[SUMMARY] Learning Tic-Tac-Toe (#11)

Discussion in 'Ruby' started by Ruby Quiz, Dec 16, 2004.

  1. Ruby Quiz

    Ruby Quiz Guest

    In an interesting contrast, this quiz generated a lot of good discussion, but
    only two solutions. I believe that may be because the problem turned out to be
    more complicated than I intended. I know I personally ran into a few
    complications and didn't have a chance to finish my own solution.

    Again, the discussion was excellent and you should probably skim the thread, if
    you weren't following it this weekend. Multiple Tic-Tac-Toe servers (written in
    Ruby, of course) were posted, so programs could play against each other

    I posted a message about how Tic-Tac-Toe positions can be transformed by
    rotation and "mirroring" to other seemly different layouts that can be handled
    the same.

    I also posted a "tictactoe.rb" library that makes most interaction with the game
    trivially simple.

    Finally, a few of us posted some notes about the problems we ran into. I agree
    with Hans Fugal, who said that you can learn almost as much from those.

    The two solutions posted are similar. Basically, they learn to avoid their
    mistakes over time. They accomplish this by "scoring" the moves they made at
    each position in a game, based on whether they won or lost. Eventually, this
    knowledge allows them to select mainly strong moves, simply by remembering how
    they've done in the past, in the same position.

    I'll show Brian Schroeder's code here, but both were interesting to examine.
    Brian's solution contained eight files of Ruby code, the embedded documentation
    for said files, charts, a write-up of the process, and was all wrapped up in a
    handy web page, so you can dig as deep as you like into what he's done. (And he
    once asked where *I* find the time!) For the purposes of this summary, I'll
    stick to his learning code.

    Here it is:

    class Learning < BasicInterface
    attr_accessor :random_prob
    attr_reader :player

    def initialize
    @state_values = Hash.new(0)
    @state_transitions = {}
    @random_prob = 0.05

    def new_game(player)
    @player = player
    @states_visited = []

    def choose_move(game)
    moves = game.moves
    if [email protected]_transitions[game.state_id] or rand < random_prob
    move = moves[rand(moves.length)]
    move_id = @state_transitions[game.state_id].max{ |(ma,sa),(mb,sb)|
    @state_values[sa] <=> @state_values[sb]
    move = moves.select{|m| m.move_id == move_id}[0]

    def inform_of_move(before, after, move)
    @states_visited << before.state_id << after.state_id
    (@state_transitions[before.state_id] ||= {})[move.move_id] =

    if after.final?
    winner = after.winner
    if winner
    value = winner == self.player ? 100.0 : -1000.0
    value = 0.0

    factor = 1.0
    while state = @states_visited.pop
    @state_values[state] = (1.0 - factor) * @state_values[state] +
    factor * value
    factor *= 0.5

    The initialize() method sets up Brian's @state_values and @state_transitions,
    which constitute the AI's brain.

    @state_values will hold scores for the positions the AI has won or lost with
    before. @state_transitions holds a "map" of how to get from position to
    position. When these are filled in, the AI will have "learned" what positions
    are desirable and how it can reach them.

    Knowing this, choose_move() is easy to breakdown. It checks to see if it knows
    anything about the moves from the current position. If it does, it selects the
    highest score it can find for itself (else branch). If it doesn't, it goes with
    a random choice from all available moves (if branch).

    The final piece of the puzzle is inform_of_move(). This method remembers all
    moves made in the game, mainly. When it sees a final position, it scores all
    those moves based on whether it won or lost. The scoring scale is slanted, to
    encourage the AI to avoid losses.

    That's the heart of Brian's solution. The rest of the code is interface,
    client/server, a perfect minimax player, and Tic-Tac-Toe details.

    For the curious, this quiz was inspired by the research of Donald Michie. In
    1961 he built a "machine" that learned to play perfect Tic-Tac-Toe against
    humans, using matchboxes and beads. He called the machine MENACE (Matchbox
    Educable Naughts And Crosses Engine).

    304 matchboxes where labeled with images of Tic-Tac-Toe positions and filled
    with colored beads representing possible moves. At each move, a bead would be
    rattled out of the proper box to determine a move. When MENACE would win, more
    beads of the colors played would be added to each position box. When it would
    lose, the beads were left out to discourage these moves.

    Michie claimed that he trained MENACE in 220 games, being beaten by it eight out
    of the final ten games.

    Thanks to those who tried this one, successful or not. Sorry it turned out to
    be a bit involved.

    Tomorrow, "How Ruby Can Help You Beat Your Grandmother In Scrabble" is the
    topic. If your grandmother is anything like mine, you'll appreciate all the
    help you can get!
    Ruby Quiz, Dec 16, 2004
    1. Advertisements

  2. Thanks for the writeup james. I have to correct this paragraph here, as the
    program would not work if it worked the way you describe it. It is important to
    see, that one should differentiate between exploration and exploitation
    behaviour. Exploration means, that the player tries out new moves to learn more
    about this game, exploitation means usage of the learned knowledge. Your
    description suggest, that once the game know how to make a move, it makes it.
    If it would exhibit this behaviour, it would never learn how to play different
    than the first game. The only thing would be, that it would learn that it plays
    badly (set the score of the states to -INFINITY after an infinit number of

    In fact, there is an adjustable chance, that the player pics a random move,
    even though it know a move. This is the exploitation factor, that is adjusted
    with the badly named random_prob attribute.


    Brian Schröder, Dec 16, 2004
    1. Advertisements

  3. You're correct, of course. I was trying not to over-complicate my
    explanation, but that is a pretty important detail I omitted. Thanks
    for keeping me honest.

    James Edward Gray II
    James Edward Gray II, Dec 16, 2004
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.