Machine
Learning
Othello Project Report
Monty Kahan
Columbia University
NY, NY 10027
mk463@columbia.edu
Introduction
The Problem(s)
I have been given a Genetic Programming system with a small number of "Othello"
primitives that can potentially produce a GP function tree that can play
Othello well against a Random Player. The system uses a set of primitives
that hopefully describes the Othello board in an accurate, and complete
manner. The GP, based on the board state which described that set of primitives,
then is asked to evaluate the n possible moves from that state. Since it
would be difficult to describe every board state, our choice of primitives
are guesses as to what set of primitives can still be small, and yet
describe the board in such a manner that our GP player will be able to
calculate what should be the next move. This set of primitives may be too
small to do the job. Furthermore, as of present the GP system evaluates
the players in its population by having them play against a random player.
Now real life experience tells us that in order to improve at a game we
should play a broad variety of players, who are all experts. Now we have
been supplied with Edgar, who is an expert but playing against him alone
would mean we would produce expert anti-Edgar Othello players, and missing
the broad variety. Playing against random players alone, would be equivalent
of playing chess against only novices, we would become slightly better but
we would miss having the experience of grand masters, and seeing their
strategies in action.
Finally, our final evaluation of our GP players is to see what percentage
of games they win against random players. It would be interesting to see
some more information in our evaluation than just the percentage of wins.
Taking the analogy of chess, we know that grand masters playing against
novice players should be able to win within 15 moves or so, and thus
we have another method of evaluation which can let us see how good our
grand master is. We want to do the equivalent in Othello.
The Solution(s)
Since the GP's search space is large and its learning method is very random,
it would be useful to see if the addition of two primitives, chosen by
me, would help the system along. Specifically, I added a primitive for
how many pieces are near edges, and another primitive for the size of the
longest consecutive row/column/diagonal on the board. Since this was done
for both White and Black a total of four primitives were added to the
system. Near edges, are like near corners, bad for the player, but not as
bad as near corners. And it's not absolutely bad as near corners, from
experience we know that a near edge piece can be useful. The second
primitive works on the concept that if there are long strings of your
opponent's pieces that means a good move would be to place your piece
somewhere in the middle of that string and then it would be possible later
to put your pieces on either side of the string to convert the entire
string to your color. Placing your piece at the edge of the string would
mean that in the next move your opponent can simply place his/her piece
after you piece, making your move fruitless.
Both Playing against randomized players alone, or playing against Edgar
alone are lacking as good training for our GP Player. What we like to do
is use a combination of Edgar and randomized players for our GP's to measure
up against. So while now each player in the population plays 5 games against
a random player, we will now randomly choose who to play against (either
Edgar or the Random player) for each of the five games for the performance
measure. Assuming then that Edgar plays much better than the Random player;
since some players playing more against Edgar will do worse, than those
players will probably have a poorer performance measure than other players.
This will introduce a lot more noise in the fitness measure.
Since we would like some more insight as to how well our GP players
do, beyond the percentage of wins, we would like to evaluate how well they
win or lose to their opponents. Now this is a bit tricky since we're not
sure if this is indeed a good measure. For example, the best strategy
of winning Othello would be to be to deliberately lose and then with the
final 6 moves or so, come ahead with only a 5 point lead. While the lead
would be only be five points, if this strategy could consistently beat
any player no matter how good, then that strategy or algorithm would evaluate
as well it really should by using point spreads as well as percentage of
wins for our evaluation. However, assuming that this is not the case, that
the best strategies are those in which our player annihilates its opponent
by a huge amount, then indeed the players with the best point spread will
be the one to consistently win the most games.
Approach
New Primitives
I. Near Edges
In order to give the GP system an extra boost two new primitives were added,
in the hopes that an even more complete and detailed representation of
the board would help the system create better Othello players. The first
primitive represents near edges. OthelloBoard.java was modified to
calculate the num_white_near_edges and num_black_near_edges. Finally
near_edges (black and white) was added to the node set of the GP system
and evaluate functions for the GP Players. The hopes for this primitive
was following the strategy that edges controlled the board, and thus the
it was beneficial for your opponents to control the squares right near the
edge, as then you had a chance of getting your pieces on the edge. Also
this primitive works very with the near_corners primitive. In general
experience says the most important pieces are the corners (or having at
least two of them is very important.) So controlling near the corners is
very bad. However experience also says that the diagonal "near corner"
square is worse than the horizontal or vertical "near corner" squares. In
combination with the primitive "near edge" which is also from experience a
bad position to control, the square occupying the diagonal is "near
corner" and "near edge" while the horizontal and vertical "near corners"
are merely "near corner". Near edges is not always bad, sometimes
experience shows its okay and even sometimes good to control, but we can
let the GP system decide if using the primitive helps produce a better
player.
II. Long Consecutive Chains
Once again I'll be using experience to create a primitive that will hopefully
will be able to give the system a strategic piece of knowledge about the
game board. Here I assumed that it was always best to try to cut long chains
of your opponent's pieces. Also when determining how to control the edge,
its always best to put the edges pieces together in a chain, rather than
leaves holes in the middle since your opponent can then place a piece in
that hole and when you reach the "near corner" position, you have effectively
given your opponent the corner. However I decided for simplicity sake that
we would only concern ourselves with the longest chain on the board for
both white and black. So OthelloBoard now updates a num_white_long and
num_black_long primitive by going trough the board column by column, row
by row, diagonal by diagonal. (The diagonals are split into two, those
that go from top right to bottom left and those that go from top left to
bottom right). It finds the longest chain of each color. The primitive
was then added to the node set and the evaluate functions in the GP system
and GP player classes.
Variety Training
In order for the system to measure the performance of the players in the
population with a wide variety of players, we want our GP players to play
both random and Edgar players for our fitness measure. In order to do so,
we have to remember that Edgar is deterministic and so is our GP player,
so to we have to randomize Edgar a little. To do that we had the evalMove
function return a random number for any GameTime less than sixteen, (meaning
the first 8 moves of Edgar for any game would be random). We further
decided that the choice of whether to play against Edgar or a Random
Player was itself a random choice, (a number was chosen between 0 and 1,
less than .5 meant Random, more meant Edgar.) This actually meant the
fitness measure could be slightly skewed by players who played more Edgar
games than Random games (assuming that Edgar was better than a random
player.) However than meant the system may be a bit more random in
selecting the best GP player for selection to the next generation. (Note:
Originally a new EdgarPlayer object was created for every game but that
means a file read for every object, and that resulted in the system
getting too many file errors. Now all Edgar games are played with the same
Edgar object.)
Evaluating Using Score Totals
We are also interested in seeing how well or badly our GP players win or
lose respectively. Here we will modify OthelloCache to have our GP players
play 50 games against Edgar (randomized of course since otherwise we could
play 1 game) and fifty games against Random Players, outputting the total
scores as well as how many they win or lose out of the total games played.
We can see if the percentages match up to be roughly equal. If our GP is
losing close to 100% or winning 100% of the time we could use the total
score percentages to see if they're only just losing all the games or only
just winning all the games by a hairbreadth. Or if the losses and wins
have large point spreads. Using the Random players as well as Edgar for
our evaluation will make sure that we haven't developed some special expert
anti-Edgar player who will do poorly playing against anyone else.
Results
Introduction
The results are slightly varied and I feel the need to remind people of
two issues they should keep in mind as they view these results.
-
Training Took Forever.
The addition of the primitive "black_long" and "white_long" which computed
the longest chain of black or white pieces added 256 additional steps to
the UpdateBoard function, (checking row-wise, column-wise, diagonal-left-wise,
diagonal-right-wise.) Especially with the last experiment where we
training against Edgar, training took even longer, and since our training
kept getting interrupted, we have uneven sets of experiment data. The most
from Part I adding new primitives, the second greatest from Part 0
(Baseline - no additions. The only reason we have less GP's from here to
test out is that Part I even though it took less to compute was that it
got interrupted too often.) And the least from Part II training against
Edgar. Part III was made so that running the previous two parts would
automatically give the evaluation in terms of total score as well as just
the win percentage.
-
Different Fitness Measures.
Since in Part II we trained against Edgar, and not juts a random player,
the GP's produced may have a higher (worse) fitness measure but indeed
they are better players overall. We have to rely on the end - evaluation
were the players were tested in 50 games against Random and Randomized
Edgar players.
Baseline
The Baseline had no additions. Four runs in total. All the generations
of all the runs added up to 15. Because of the paucity of data I chose to
select those best of generations which did well in training. From among
these best of generations I selected those that did well against a set
of test examples for presentation here. (a biased measure but one I had to
resort to, becasue the runs took so long and very rarely ever beat the
terminating fitness measure.) Now when we chose which GP to present
here, we ignored all GP's which were obvious such as "black" or "-white",
or anything that had "black" or "-white" as the only important terms,
even if they were a best player in a run, e.g. Run #2's player.
Out of the three presented here, two players stand out, as having
especially good scores (relatively) against either Random Players or
Edgar. (All percentages are done out of playing 50 games.)
Run Number |
Generation Number |
Best Player In Run |
GP Function |
Against Random Players |
Against Randomized Edgar Players |
|
|
|
|
GP Win Percentage |
GP Total Score Percentage* |
GP Win Percentage |
GP Total Score Percentage* |
1 |
2 |
NO |
- black_edges + 10 / black_near_corners black |
74% |
62% |
32% |
40% |
3 |
4 |
YES |
+ * + black_near_corners white - black_corners 10 / / / white white_corners
+ black white_corners * 10 white_corners |
84% |
61% |
26% |
42% |
4 |
1 |
NO |
+ * black_edges black_corners black |
80% |
63% |
16% |
37% |
* - AVG percentage of board occupied by black pieces at end of game.
It is interesting to see that players which have a higher win percentage
with Edgar have a much lower percentage with Random Players and vice versa.
What seems good for one, seems bad for the other or, some are just
overspecialized for beating Edgar. Now we do "okay" against Random
Players. Winning sometimes 84% of the time. However well we do against
Random Players, seems to be roughly the same as how we do against the big
guns, Edgar. Meaning as we are to Random players, so is Edgar to us. We
win 75% to 85% of the time against Random Players. Edgar wins 68% to 85%
of the time against us. The Total Score Percentage is roughly the same,
Against Random Players we have 60% of the points and yet win most of the
time with that, Edgar against us has 60% of the points and yet wins most
of the time with that.
New Primitives
New Primitives had 2 runs only, however they ran for much longer. The first
one had 12 generations before terminating because of a system shut-down.
The second run went for 18 generations before it was interrupted, so it
never hit terminal fitness. In general we would expect that since we have
only added two new primitives and nothing else, we would have done as well
as the Baseline since it could just ignore the new primitives and get GP's
as good as the Baseline. However we see that it was only able to hit
terminal fitness once while the Baseline hit it four times with much less
generations. For this I have no explanation. What is presented here are
only those GP players produced which contain the new primitives. There
were those GP players which seemed to do well, yet didn't contain the new
primitives, that were also produced, though they are not presented here.
(Frankly none of them seemed to do that well, especially those that
finished best in the run.)
Run Number |
Generation Number |
Best Player In Run |
GP Function |
Against Random Players |
Against Randomized Edgar Players |
|
|
|
|
GP Win Percentage |
GP Total Score Percentage* |
GP Win Percentage |
GP Total Score Percentage* |
1 |
8 |
NO |
+ / black white_corners white_near_edges |
66% |
58% |
18% |
40% |
2 |
4 |
NO |
/ * black_edges black_long white_edges |
68% |
61% |
24% |
37% |
2 |
8 |
NO |
+ black_long black_edges |
72% |
61% |
10% |
34% |
* - AVG percentage of board occupied by black pieces at end of game.
Frankly the performance is lackluster. None of the GP players produced
do anywhere near as well as the Baseline which as I've stated earlier is
strange. And even though some of the players have interesting algorithms
when we think about what strategy the GP player is following, none of them
work that well. Especially strange is that we have a player (best in its
generation like all these players), has a very high percentage against
random players (relative to the rest of the players) using the new primitive
black_long (Player 3 in the table), and yet has a percentage win of 10%,
which is appalling. Even the GP Total Score Percentage which usually stays
to around 40%, for the worse strategy in every experiment, here drops to
34%. And what is really weird is that I understand the strategy the GP
suggests, I think I use that strategy when I play (which may explain why
I lose), and yet it does so badly. (The player is black_edges + black_long,
which suggests that when I'm trying to get edge pieces, it would be advisable
to clump the edges together. A very valid strategy in my opinion.)
Variety Training
Using the Variety Training took an incredibly long amount of time, and
so we were only able to get one run with nine generations. This run did
terminate with a GP player with a fitness of 40. Subjectively while looking
at the GP players produced, I was impressed. I am aware that what I'm saying
here is both subjective and inexact, but in any case, the GP players produced
by the first two methods had a lot of players that were just asking to
be wiped by any intelligent player. Here almost all our players were combinations
of good things to have when playing, black, black_edges, black_corners.
All of them seemed to boast good strategies. Many of them were slight alterations
of each other, seemingly it found something good, that was also good with
some modifications, which could mean it was good over a larger set of test
cases. And the GPs produced did the best overall at least from what I tested
in my evaluation over 50 games. I present two here.
Run Number |
Generation Number |
Best Player In Run |
GP Function |
Against Random Players |
Against Randomized Edgar Players |
|
|
|
|
GP Win Percentage |
GP Total Score Percentage* |
GP Win Percentage |
GP Total Score Percentage* |
1 |
3 |
NO |
+ + white_near_edges black black_corners |
90% |
66% |
34% |
44% |
1 |
7 |
NO |
+ + white_near_edges black black |
74% |
61% |
16% |
39% |
* - AVG percentage of board occupied by black pieces at end of game.
Run 1 Generation 3 has the highest evaluation scores of any GP player
produced and that is in every category, whether against Random Players,
or against Edgar, whether we're dealing with Win Percentage or Total Score
Percentage. And it was generated with only three generations. In addition,
Run 1 Generation 7 has very high test scores, (at least compared to New
Primitives.) While the time took longer to train, I believe the GP's selected
were probably on the whole better players because of the experience of
having trained both against expert and random players. The fact that we
didn't hit Terminal Fitness was expected now that we were training against
Edgar and were expecting to lose more often.
New Evaluation Method
The data from using the Total Score Percentage is already presented. Frankly,
it seems that Score Percentage roughly follows a 60-40 split, 60% to the
better strategy. So as an evaluation tool it doesn't work. Which doesn't
mean that the experiment is a failure. What is important here is to realize
that Total Scores are used as a fitness measure throughout the learning
process, but the scores here seem to say that in general its more important
to consider how many times one has won and not by how much. If we had the
time we should be training using 50 games and not 5 games per generation
for each individual, and use Win Percentage for our fitness measure and
not the Total Score Percentage. No matter how good or bad you are, it
seems that you win/lose only by roughly the same amount of points.
Conclusions
Most of what I believe the data has "told" me, I've already noted in the
Results section. I would just like to sum up here what I've already said.
Once again I can't explain how the Baseline did so much better than New
Primitives. The only thing I can think of is that my new primitives do not
accomplish anything and they confused the GP system with useless
information. They take a long time to compute and may not really
encapsulate anything which the GP can really use to evalute the
board well. The GP is spending time trying to create players with
these new primitves making the system go that much slower in creating a
good Othello player. New Primitives did not do well, quite poorly in
fact, though it did beat the random player, though against Edgar, one
"good" GP player won only 10% of the time. Variety Training seemed to help
a lot and we did beat the Baseline, and the GP players produced looked on
the whole all strategically sound and promising. The New Evaluation Method
didn't seem to give any new insights except for the fact that we should be
looking for a new fitness measure, and that if we had the time we would
play more games in training and use the win percentage.
Click here to see my best GP (Variety Training, Run #1, Generation #3) in
action
against
Last modified Nov 18, 1997 by
Monty Kahan mk463@columbia.edu