[SUMMARY] Learning Tic-Tac-Toe (#11)

R

Ruby Quiz

In an interesting contrast, this quiz generated a lot of good discussion, but
only two solutions. I believe that may be because the problem turned out to be
more complicated than I intended. I know I personally ran into a few
complications and didn't have a chance to finish my own solution.

Again, the discussion was excellent and you should probably skim the thread, if
you weren't following it this weekend. Multiple Tic-Tac-Toe servers (written in
Ruby, of course) were posted, so programs could play against each other
remotely.

I posted a message about how Tic-Tac-Toe positions can be transformed by
rotation and "mirroring" to other seemly different layouts that can be handled
the same.

I also posted a "tictactoe.rb" library that makes most interaction with the game
trivially simple.

Finally, a few of us posted some notes about the problems we ran into. I agree
with Hans Fugal, who said that you can learn almost as much from those.

The two solutions posted are similar. Basically, they learn to avoid their
mistakes over time. They accomplish this by "scoring" the moves they made at
each position in a game, based on whether they won or lost. Eventually, this
knowledge allows them to select mainly strong moves, simply by remembering how
they've done in the past, in the same position.

I'll show Brian Schroeder's code here, but both were interesting to examine.
Brian's solution contained eight files of Ruby code, the embedded documentation
for said files, charts, a write-up of the process, and was all wrapped up in a
handy web page, so you can dig as deep as you like into what he's done. (And he
once asked where *I* find the time!) For the purposes of this summary, I'll
stick to his learning code.

Here it is:

class Learning < BasicInterface
attr_accessor :random_prob
attr_reader :player

def initialize
@state_values = Hash.new(0)
@state_transitions = {}
@random_prob = 0.05
end

def new_game(player)
@player = player
@states_visited = []
end

def choose_move(game)
moves = game.moves
if !@state_transitions[game.state_id] or rand < random_prob
move = moves[rand(moves.length)]
else
move_id = @state_transitions[game.state_id].max{ |(ma,sa),(mb,sb)|
@state_values[sa] <=> @state_values[sb]
}[0]
move = moves.select{|m| m.move_id == move_id}[0]
end
move
end

def inform_of_move(before, after, move)
@states_visited << before.state_id << after.state_id
(@state_transitions[before.state_id] ||= {})[move.move_id] =
after.state_id

if after.final?
winner = after.winner
if winner
value = winner == self.player ? 100.0 : -1000.0
else
value = 0.0
end

factor = 1.0
while state = @states_visited.pop
@state_values[state] = (1.0 - factor) * @state_values[state] +
factor * value
factor *= 0.5
end
end
end
end

The initialize() method sets up Brian's @state_values and @state_transitions,
which constitute the AI's brain.

@state_values will hold scores for the positions the AI has won or lost with
before. @state_transitions holds a "map" of how to get from position to
position. When these are filled in, the AI will have "learned" what positions
are desirable and how it can reach them.

Knowing this, choose_move() is easy to breakdown. It checks to see if it knows
anything about the moves from the current position. If it does, it selects the
highest score it can find for itself (else branch). If it doesn't, it goes with
a random choice from all available moves (if branch).

The final piece of the puzzle is inform_of_move(). This method remembers all
moves made in the game, mainly. When it sees a final position, it scores all
those moves based on whether it won or lost. The scoring scale is slanted, to
encourage the AI to avoid losses.

That's the heart of Brian's solution. The rest of the code is interface,
client/server, a perfect minimax player, and Tic-Tac-Toe details.

For the curious, this quiz was inspired by the research of Donald Michie. In
1961 he built a "machine" that learned to play perfect Tic-Tac-Toe against
humans, using matchboxes and beads. He called the machine MENACE (Matchbox
Educable Naughts And Crosses Engine).

304 matchboxes where labeled with images of Tic-Tac-Toe positions and filled
with colored beads representing possible moves. At each move, a bead would be
rattled out of the proper box to determine a move. When MENACE would win, more
beads of the colors played would be added to each position box. When it would
lose, the beads were left out to discourage these moves.

Michie claimed that he trained MENACE in 220 games, being beaten by it eight out
of the final ten games.

Thanks to those who tried this one, successful or not. Sorry it turned out to
be a bit involved.

Tomorrow, "How Ruby Can Help You Beat Your Grandmother In Scrabble" is the
topic. If your grandmother is anything like mine, you'll appreciate all the
help you can get!
 
B

Brian Schröder

In an interesting contrast, this quiz generated a lot of good discussion, but
only two solutions. I believe that may be because the problem turned out to
be more complicated than I intended. I know I personally ran into a few
complications and didn't have a chance to finish my own solution.

Again, the discussion was excellent and you should probably skim the thread,
if you weren't following it this weekend. Multiple Tic-Tac-Toe servers
(written in Ruby, of course) were posted, so programs could play against each
other remotely.

I posted a message about how Tic-Tac-Toe positions can be transformed by
rotation and "mirroring" to other seemly different layouts that can be
handled the same.

I also posted a "tictactoe.rb" library that makes most interaction with the
game trivially simple.

Finally, a few of us posted some notes about the problems we ran into. I
agree with Hans Fugal, who said that you can learn almost as much from those.

The two solutions posted are similar. Basically, they learn to avoid their
mistakes over time. They accomplish this by "scoring" the moves they made at
each position in a game, based on whether they won or lost. Eventually, this
knowledge allows them to select mainly strong moves, simply by remembering
how they've done in the past, in the same position.

I'll show Brian Schroeder's code here, but both were interesting to examine.
Brian's solution contained eight files of Ruby code, the embedded
documentation for said files, charts, a write-up of the process, and was all
wrapped up in a handy web page, so you can dig as deep as you like into what
he's done. (And he once asked where *I* find the time!) For the purposes of
this summary, I'll stick to his learning code.

Here it is:

class Learning < BasicInterface
attr_accessor :random_prob
attr_reader :player

def initialize
@state_values = Hash.new(0)
@state_transitions = {}
@random_prob = 0.05
end

def new_game(player)
@player = player
@states_visited = []
end

def choose_move(game)
moves = game.moves
if !@state_transitions[game.state_id] or rand < random_prob
move = moves[rand(moves.length)]
else
move_id = @state_transitions[game.state_id].max{ |(ma,sa),(mb,sb)|
@state_values[sa] <=> @state_values[sb]
}[0]
move = moves.select{|m| m.move_id == move_id}[0]
end
move
end

def inform_of_move(before, after, move)
@states_visited << before.state_id << after.state_id
(@state_transitions[before.state_id] ||= {})[move.move_id] =
after.state_id

if after.final?
winner = after.winner
if winner
value = winner == self.player ? 100.0 : -1000.0
else
value = 0.0
end

factor = 1.0
while state = @states_visited.pop
@state_values[state] = (1.0 - factor) * @state_values[state] +
factor * value
factor *= 0.5
end
end
end
end

The initialize() method sets up Brian's @state_values and @state_transitions,
which constitute the AI's brain.

@state_values will hold scores for the positions the AI has won or lost with
before. @state_transitions holds a "map" of how to get from position to
position. When these are filled in, the AI will have "learned" what
positions are desirable and how it can reach them.

Knowing this, choose_move() is easy to breakdown. It checks to see if it
knows anything about the moves from the current position. If it does, it
selects the highest score it can find for itself (else branch). If it
doesn't, it goes with a random choice from all available moves (if branch).

Thanks for the writeup james. I have to correct this paragraph here, as the
program would not work if it worked the way you describe it. It is important to
see, that one should differentiate between exploration and exploitation
behaviour. Exploration means, that the player tries out new moves to learn more
about this game, exploitation means usage of the learned knowledge. Your
description suggest, that once the game know how to make a move, it makes it.
If it would exhibit this behaviour, it would never learn how to play different
than the first game. The only thing would be, that it would learn that it plays
badly (set the score of the states to -INFINITY after an infinit number of
games).

In fact, there is an adjustable chance, that the player pics a random move,
even though it know a move. This is the exploitation factor, that is adjusted
with the badly named random_prob attribute.

Regards,

Brian
 
J

James Edward Gray II

Thanks for the writeup james. I have to correct this paragraph here,
as the
program would not work if it worked the way you describe it. It is
important to
see, that one should differentiate between exploration and exploitation
behaviour. Exploration means, that the player tries out new moves to
learn more
about this game, exploitation means usage of the learned knowledge.
Your
description suggest, that once the game know how to make a move, it
makes it.
If it would exhibit this behaviour, it would never learn how to play
different
than the first game. The only thing would be, that it would learn that
it plays
badly (set the score of the states to -INFINITY after an infinit
number of
games).

In fact, there is an adjustable chance, that the player pics a random
move,
even though it know a move. This is the exploitation factor, that is
adjusted
with the badly named random_prob attribute.

You're correct, of course. I was trying not to over-complicate my
explanation, but that is a pretty important detail I omitted. Thanks
for keeping me honest.

James Edward Gray II
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,772
Messages
2,569,593
Members
45,108
Latest member
AlbertEste
Top