There are several ways of pitting individuals against one another to determine relative fitness. One possibility is to have each individual play a fixed number of games against a random sampling of the population, with fitness derived from the individual's performance over those games. An alternative is to play a single-elimination tournament among the individuals in the entire population. The latter method has the advantage of requiring fewer games, thereby reducing training times but potentially incorporating more noise into the learning process (in each generation, half of the population is only judged solely on the result of a single game).
This assignment examines the use of a single-elimination tournament as a competitive fitness measure in a GA system for generating Othello players. The tournament fitness measure is evaluated both in terms of its effects on learning time and on the generalization success of the resulting individuals. The experiments also address the problem of measuring progress when a competitive fitness measure is employed.
The other baseline case used the same setup as the first, but with fitness measured by playing against Edgar rather than the random player. Since both Edgar and the individuals in the population play the game deterministically, it was not possible to play a series of games to evaluate each individual (an attempt was made to randomize the GA players by selecting a move at random when there were several choices with the same value, but this resulted in degraded performance of the GA players against even the random player!). The population size was halved to 64 due to time constraints. The termination fitness for this experiment was set to 20.0, which meant that the resulting player was required to beat Edgar by a reasonable margin.
The first two competitive fitness experiments replaced the fixed-set evaluation of the baseline testcases with a single-elimination tournament, which was used to derive the fitness of each individual in the population. In each round of the tournament, if an individual had lost in the previous round, the individual was "awarded" a score of 64 (the worst possible) in order to indicate that it was out of the tournament. This produced the intended effect of favoring those individuals who advanced farther in the tournament. It also meant that all of the losers in a particular round were penalized equally, with their fitnesses differentiated by their performance in earlier rounds. In the experiments conducted, the population size was set to a power of two to avoid byes, but the tournament-playing code was written to handle byes by awarding the reciever of the bye a score equal to its average score in the previous rounds of the tournament (so that the bye neither directly helped nor directly harmed the player's score).
In the first competitive fitness experiment, the progress of the GA was tracked by measuring only the tournament winner against the random player. The experiment was run with identical parameters to the first baseline. Again, the resulting player was required to beat the random player in five consecutive games without allowing the random player any pieces at the end of any of the five games.
A similar experiment was carried out using performance against Edgar as the measure for determining termination fitness. This experiment used the same parameters as the second baseline.
These first two competitive fitness experiments used only the single-elimination tournament to guide evolutionary progress. This meant that there was no direct evolutionary pressure on the individuals to improve with respect to the measure used to determine success (performance against either the random player or Edgar). As a result, the population might fail to make measurable progress towards a successful outcome. A final experiment combined the single-elimination tournament with fitness measured against Edgar, in an attempt to see if applying such evolutionary pressure would help the learning process. For this experiment, each individual's tournament result was weighted evenly with its score in a game played against Edgar and the sum of the two was used as the measure of fitness. The experiment was conducted using the same parameters as the second baseline, but with the termination fitness for the experiment set at a level that would require both good performance against Edgar and good performance in the tournament.
The test runs for the baseline testcases were compared with the competitive fitness test runs to measure the effect of competitive fitness on both the learning rate and the time required. Because the execution time varied due to the presence of other processes, this comparison was based upon the number of generations and games played.
Finally, the resulting players from the baseline testcases were evaluated against the players generated using competitive fitness measures in order to determine the generalization success of players.
(additional details of the modifications to the original Othello GA system are provided here)
In terms of the learning time, while the baseline case required fewer generations, the number of games played in each generation was 640 (128 individuals each playing five games), while the competitive fitness experiment played only 132 games per generation (127 for the single-elimination tournament plus five games for the tournament winner against the random player). Thus the average number of games played per run was only 950.4 for the competitive case vs. 1280 for the baseline. These results are summarized below.
Testcase Games/ Avg. # of Avg. # of Generation Generations Games Played ------------------------------------------------------------ Baseline 640 2.0 1280.0 Tournament 132 7.2 950.4In order to measure generalization success, the resulting players from the baseline were pitted against the resulting players from the competitive fitness runs. Two games were played between each pair of players, with each side playing both as black and white. In addition, the players were also evaluated against the random player and against Edgar. The results are shown below.
Games won vs. Testcase (# players) Other Random (%) Edgar (%) ------------------------------------------------------------ Baseline (6 players) 47 (39.17%) 100% 0% Tournament (10 players) 73 (60.83%) 100% 0%Both sets of players fared well against the random player. This was expected, since performance against the random player was used precisely as the measure of fitness in the baseline case and also as the test for the termination condition in the tournament case. Surprisingly, neither set of players performed well against Edgar, with both sets losing all games. More relevantly, in head-to-head competition, the players generated using competitive fitness won 60.83% of the games. This supports the notion that competitive fitness exposes the population to a wider range of strategies, thus resulting in more generalizable solutions.
For the second competitive fitness experiment (where only tournament result was used to determine fitness), the GA generated the players shown here. The GA required an average of 45.4 generations to reach a successful solution, taking into account two bad runs which were terminated in failure after the maximum number of generations had been reached. With 64 games played in each generation (63 for the tournament over a population size of 64 individuals plus one game to evaluate the best individual against Edgar), this meant on average 2905.6 games played per successful run, many more than the number required for the baseline.
The final competitive fitness experiment resulted in the players depicted here. On average, 20.7 generations were required to reach a successful solution, accounting for one bad run which failed to achieve a good solution in the maximum number of generations. Since each generation required a total of 127 games (63 for the tournament and 64 to evaluate each individual against Edgar), the average number of games played per good run was 2628.9. This is still many more than the number of games required in the baseline case, but fewer than the number required when only tournament result is used to determine fitness. In addition, the average number of generations required by this hybrid fitness case was less than half of the number required in the pure tournament case. This lends support to the idea that incorporating direct evolutionary pressure reduces the number of generations required to arrive at a solution.
The preceding results are illustrated below. The line labeled Tournament refers to the second competitive fitness experiment (using only tournament result to evaluate fitness), while the Hybrid line refers to the final experiment (employing both tournament result and performance against Edgar as the fitness measure).
Testcase Games/ Avg. # of Avg. # of Generation Generations Games Played ------------------------------------------------------------ Baseline 64 6.0 384.0 Tournament 64 45.4 2905.6 Hybrid 127 20.7 2628.9As in the previous case, the resulting players from the second baseline and the final two competitive fitness runs were pitted against one another for the purpose of determining generalization success. Again, each pair of individuals played two games with each side playing as both black and white. The players were also evaluated against the random player and against Edgar. The results are shown below.
Baseline vs. Pure Tournament Games won vs. Testcase (# players) Other Random (%) Edgar (%) ------------------------------------------------------------ Baseline (10 players) 21 (21%) 60% 100% Tournament (5 players) 79 (79%) 60% 80%
Baseline vs. Hybrid Fitness Games won vs. Testcase (# players) Other Random (%) Edgar (%) ------------------------------------------------------------ Baseline (10 players) 64 (32%) 60% 100% Hybrid(10 players) 136 (68%) 80% 80%
Pure Tournament vs. Hybrid Fitness Games won vs. Testcase (# players) Other Random (%) Edgar (%) ------------------------------------------------------------ Tournament (5 players) 52 (52%) 60% 80% Hybrid(10 players) 48 (48%) 80% 80%All three sets of players performed well against Edgar. In the pure tournament case, the single loss to Edgar occurred because the individual with the best fitness in the population was not the tournament winner (this could have occurred if the tournament winner narrowly beat its opponent in the last round of the tournament, with that opponent having a better score average for the previous rounds). Although the implementation measured the tournament winner against Edgar and determined that the termination fitness had been reached, the individual with the best fitness was selected by the GA as the resulting player. In the hybrid case, of the two games that were not victories against Edgar, one was a tie and the other was a narrow loss. In both cases, the combined fitness of the individual was still lower than the termination fitness, indicating that the player performed very well in the tournament half of the fitness measurement.
As expected, in head-to-head competition against the baseline testcase, both the pure tournament and hybrid competitive fitness players outperformed the players generated without using competitive fitness. In the pure tournament case, the resulting players won 79% of the time, while the hybrid players claimed 68% of the games against the baseline players. When measured against each other, the players generated using pure tournament and hybrid cases were nearly evenly matched, with the slight edge going to the pure tournament players. On the other hand, the hybrid players performed slightly better against the random player.