My new testing scheme
Posted: Tue Nov 20, 2007 4:26 am
I'll start out with a little boring story, those not interested can just read the last couple of paragraphs.
Up until fairly recently, my chess engine's development has been plagued by the lack of any real testing regimen. I have only recently learned of the need for games, and lots of them. Before I had used mostly test suites and my own observations to judge improvements, which is of course ridiculous.
I started playing test matches, but I still felt something missing. I didn't have the time to keep up testing schedules, let alone keep track of game results in my head. I tried to alleviate this last problem by creating a game folder that saved every game played by the various versions.
But in the past few weeks, I had a problem. I had a version that did pretty well in games, and so I was pretty satisfied. I started making a bunch of new versions, with various bugfixes and improvements, that I didn't check for strength. When I tested the latest version, the strength seemed to be much lower than the best so far. So I started looking back. As I didn't make a new version for every change, I seemed to have lost the exact version that did well, but I still have an approximation. So I go through all of the changes, and test and test... still haven't quite found the sweet spot yet.
This whole fiasco has made me keenly aware of two big problems I had:
1)A way to get my machine running as much as possible, to maximize the amount of results I get
2)A way to keep track of changes between various versions, including a way to keep track of the results.
In thinking about the first problem, I had envisioned an automated testing scheme that read a list of versions and tested each one. Every version could be tested without me having to wait for a certain version to finish a tournament. All I would have to do is add the version to a file. But this still leaves the problem of distributing the games among versions. Sometimes only a few tests are needed before a version can be scrapped, and obviously some older versions should be tested if there are no new versions to test.
With the second problem, I have always maintained my versions by hand, so I thought about using a versioning system. I have used CVS in the past for other purposes and thought it to be a pain. Some of the other ones sounded nice, but none of them quite fit my purposes. I have a basically linear version progression, no branches. But I also needed some way to keep track of the version number, with the engine being aware of the number.
And then, in a brilliant flash, it came to me. Why don't I integrate the solutions to both of these problems, in my own versioning/testing system? The testing program includes one key insight: I have been doing some research in Go, and I have become familiar with the UCT algorithm, which is an extension of UCB. I realized that UCB could be applied quite well to the distribution of testing games. The versions that do the best are tested more often, to increase the confidence. Versions that suck can be abandoned. I can also apply a bias towards new versions so that they get a good amount of testing quickly. Instead of using probability of win, I could just use ELO (with the help of bayeselo), after modifying it to a suitable distribution.
So I lumped these all together, into a sort-of-cohesive whole. I have a program to create a new version, that increases the version number (writing it to a central file, as well as modifying the makefile) and creates a new directory. I can then run it after every change that affects playing strength (that is also bug tested). Then I have a testing program, that reads all of the versions and automatically tests them according to the aforementioned UCB algorithm. This program is constantly run in the background, reading the version list produced by the versioning program. A suitable version for testing is selected, and then run against any number of opponents with XBoard. The net result is UCB+versioning+bayeselo+XBoard, and a much simpler, streamlined way of testing changes.
What do you think?
Up until fairly recently, my chess engine's development has been plagued by the lack of any real testing regimen. I have only recently learned of the need for games, and lots of them. Before I had used mostly test suites and my own observations to judge improvements, which is of course ridiculous.
I started playing test matches, but I still felt something missing. I didn't have the time to keep up testing schedules, let alone keep track of game results in my head. I tried to alleviate this last problem by creating a game folder that saved every game played by the various versions.
But in the past few weeks, I had a problem. I had a version that did pretty well in games, and so I was pretty satisfied. I started making a bunch of new versions, with various bugfixes and improvements, that I didn't check for strength. When I tested the latest version, the strength seemed to be much lower than the best so far. So I started looking back. As I didn't make a new version for every change, I seemed to have lost the exact version that did well, but I still have an approximation. So I go through all of the changes, and test and test... still haven't quite found the sweet spot yet.
This whole fiasco has made me keenly aware of two big problems I had:
1)A way to get my machine running as much as possible, to maximize the amount of results I get
2)A way to keep track of changes between various versions, including a way to keep track of the results.
In thinking about the first problem, I had envisioned an automated testing scheme that read a list of versions and tested each one. Every version could be tested without me having to wait for a certain version to finish a tournament. All I would have to do is add the version to a file. But this still leaves the problem of distributing the games among versions. Sometimes only a few tests are needed before a version can be scrapped, and obviously some older versions should be tested if there are no new versions to test.
With the second problem, I have always maintained my versions by hand, so I thought about using a versioning system. I have used CVS in the past for other purposes and thought it to be a pain. Some of the other ones sounded nice, but none of them quite fit my purposes. I have a basically linear version progression, no branches. But I also needed some way to keep track of the version number, with the engine being aware of the number.
And then, in a brilliant flash, it came to me. Why don't I integrate the solutions to both of these problems, in my own versioning/testing system? The testing program includes one key insight: I have been doing some research in Go, and I have become familiar with the UCT algorithm, which is an extension of UCB. I realized that UCB could be applied quite well to the distribution of testing games. The versions that do the best are tested more often, to increase the confidence. Versions that suck can be abandoned. I can also apply a bias towards new versions so that they get a good amount of testing quickly. Instead of using probability of win, I could just use ELO (with the help of bayeselo), after modifying it to a suitable distribution.
So I lumped these all together, into a sort-of-cohesive whole. I have a program to create a new version, that increases the version number (writing it to a central file, as well as modifying the makefile) and creates a new directory. I can then run it after every change that affects playing strength (that is also bug tested). Then I have a testing program, that reads all of the versions and automatically tests them according to the aforementioned UCB algorithm. This program is constantly run in the background, reading the version list produced by the versioning program. A suitable version for testing is selected, and then run against any number of opponents with XBoard. The net result is UCB+versioning+bayeselo+XBoard, and a much simpler, streamlined way of testing changes.
What do you think?