My new testing scheme

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
Zach Wegner
Posts: 1922
Joined: Thu Mar 09, 2006 12:51 am
Location: Earth

My new testing scheme

Post by Zach Wegner »

I'll start out with a little boring story, those not interested can just read the last couple of paragraphs.

Up until fairly recently, my chess engine's development has been plagued by the lack of any real testing regimen. I have only recently learned of the need for games, and lots of them. Before I had used mostly test suites and my own observations to judge improvements, which is of course ridiculous.

I started playing test matches, but I still felt something missing. I didn't have the time to keep up testing schedules, let alone keep track of game results in my head. I tried to alleviate this last problem by creating a game folder that saved every game played by the various versions.

But in the past few weeks, I had a problem. I had a version that did pretty well in games, and so I was pretty satisfied. I started making a bunch of new versions, with various bugfixes and improvements, that I didn't check for strength. When I tested the latest version, the strength seemed to be much lower than the best so far. So I started looking back. As I didn't make a new version for every change, I seemed to have lost the exact version that did well, but I still have an approximation. So I go through all of the changes, and test and test... still haven't quite found the sweet spot yet.

This whole fiasco has made me keenly aware of two big problems I had:
1)A way to get my machine running as much as possible, to maximize the amount of results I get
2)A way to keep track of changes between various versions, including a way to keep track of the results.

In thinking about the first problem, I had envisioned an automated testing scheme that read a list of versions and tested each one. Every version could be tested without me having to wait for a certain version to finish a tournament. All I would have to do is add the version to a file. But this still leaves the problem of distributing the games among versions. Sometimes only a few tests are needed before a version can be scrapped, and obviously some older versions should be tested if there are no new versions to test.

With the second problem, I have always maintained my versions by hand, so I thought about using a versioning system. I have used CVS in the past for other purposes and thought it to be a pain. Some of the other ones sounded nice, but none of them quite fit my purposes. I have a basically linear version progression, no branches. But I also needed some way to keep track of the version number, with the engine being aware of the number.

And then, in a brilliant flash, it came to me. Why don't I integrate the solutions to both of these problems, in my own versioning/testing system? The testing program includes one key insight: I have been doing some research in Go, and I have become familiar with the UCT algorithm, which is an extension of UCB. I realized that UCB could be applied quite well to the distribution of testing games. The versions that do the best are tested more often, to increase the confidence. Versions that suck can be abandoned. I can also apply a bias towards new versions so that they get a good amount of testing quickly. Instead of using probability of win, I could just use ELO (with the help of bayeselo), after modifying it to a suitable distribution.

So I lumped these all together, into a sort-of-cohesive whole. I have a program to create a new version, that increases the version number (writing it to a central file, as well as modifying the makefile) and creates a new directory. I can then run it after every change that affects playing strength (that is also bug tested). Then I have a testing program, that reads all of the versions and automatically tests them according to the aforementioned UCB algorithm. This program is constantly run in the background, reading the version list produced by the versioning program. A suitable version for testing is selected, and then run against any number of opponents with XBoard. The net result is UCB+versioning+bayeselo+XBoard, and a much simpler, streamlined way of testing changes.

What do you think?
friedeks

Re: My new testing scheme

Post by friedeks »

You might have a look at subversion as your versioning system. It should fit your needs quite well as creating new versions is nothing but "making a copy".
As this is just a kind of "virtual copy" subversion copys as dam fast an need very few storage.
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: My new testing scheme

Post by Michael Sherwin »

Because, of the random factor that Bob has been trying to explain, if one does not play the 2500+ games that are required to measure a small change in a version then any testing done will be adhoc. "Hitting the sweet spot" most likely means that most random factors went the new versions way this time and that it may only seem better, but in fact may be worse, a lot worse. My method of testing is flawed, but seems to work over time. I try to hit the sweet spot, but I try to hit it better than I have ever hit it before. This requires test after test against the same few engines in the hope of getting a new record, even if that means accepting on few games just a better random influenced result.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Ratta

Re: My new testing scheme

Post by Ratta »

Or as revision system you may also consider git, that is distibuted and much easer to set up than subversion, and that i consider much superior.
Regards!
YL84

Re: My new testing scheme

Post by YL84 »

Hi,
the classical test suites are not a good way to test the engine real strength (it is easy to tune the engine with them, and obtain a weaker engine in real game).
An idea could be to build your own test suites (to test some positional terms, king safety, passed pawns,...) based on some positions you want your engine to play well. The "improvements" made in your engine should never deteriorate these positional tests; it is something I apply but have never developped a really accurate file for every kind of posititonal play;

The best is to test in game but I agree than thousands of games is out of reach for me also :cry:

Now that I know is is possible that a new version is not better than the previous one, i do not release this new version.
Yves
jdart
Posts: 4366
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: My new testing scheme

Post by jdart »

For a while I used CVS. But I find it easier to just have a script that zips all the program code and test results together and saves it by date. So I build a version, zip it, run tests/matches, update the zip with the results. If I determine that it sucks, I can revert to an older zipped version easily. If I want to see what changed across two versions, I just unzip them into separate directories and run "diff".

CVS or SVN would be useful if you had a team working on the same code base but with one developer it is a bit of overkill, IMO.
hristo

Re: My new testing scheme

Post by hristo »

jdart wrote:For a while I used CVS. But I find it easier to just have a script that zips all the program code and test results together and saves it by date. So I build a version, zip it, run tests/matches, update the zip with the results. If I determine that it sucks, I can revert to an older zipped version easily. If I want to see what changed across two versions, I just unzip them into separate directories and run "diff".

CVS or SVN would be useful if you had a team working on the same code base but with one developer it is a bit of overkill, IMO.
The system you describe would work, but in my experience is suboptimal, even for a single developer. The main issue with the revision control systems is that they requires proper procedure to be useful ... and SVN or CVS trump the script-zip approach by a very wide margin as soon as reasonable procedures are in place, IMO.

For instance here is what I do (using SVN) in a single developer mode:
1) Set a goal -- something simple -- and go about implementing it.
2) Commit the changes to SVN
3) Go to #1

If I want to try something new, then create a branch from the Root tree and follow the above procedure on that branch. In some cases I would have 3 different versions (branches) going on at the same time, with their own respective revision history. Using SVN makes it easier to identify the set of changes that cause a particular behavior of the application.

Regards,
Hristo
jdart
Posts: 4366
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: My new testing scheme

Post by jdart »

I very, very seldom make two unrelated changes at once or two in parallel. If I did this and found change A was good and change B was good (or in fact any other combination), it doesn't tell me whether A + B will be good or not - I'd still need to test that combination. If I had a big rack of servers for testing I could run A, B and A+B in parallel but I am not that well equipped.
hristo

Re: My new testing scheme

Post by hristo »

jdart wrote:I very, very seldom make two unrelated changes at once or two in parallel. If I did this and found change A was good and change B was good (or in fact any other combination), it doesn't tell me whether A + B will be good or not - I'd still need to test that combination. If I had a big rack of servers for testing I could run A, B and A+B in parallel but I am not that well equipped.
Whatever works for you is what is important. :-)
I am not, in any way, attempting to persuade you to use SVN(CVS).

One of the issues that often arise is granularity (or code bombs), even when using a single branch by a single developer. If I commit (single action) large amount of changes over many files then it becomes nearly impossible to traceback or figure out what is the cause of a particular behavior. Because of this, I do prefer to "commit often", even if the changes don't seem important -- if I were to use 'script-zip' this would create a nuisance, IMO. The SVN+'many commits' makes it easier to identify piece(s) of code that 'misbehave', at some later time. The 'script-zip' method often leads to 'code bombs' (too many changes) that give me headaches. :-)

Again, your usage pattern might be such that it works great for you. I'm just reflecting upon my own experience.

Regards,
Hristo
User avatar
Ross Boyd
Posts: 114
Joined: Wed Mar 08, 2006 9:52 pm
Location: Wollongong, Australia

Re: My new testing scheme

Post by Ross Boyd »

jdart wrote:For a while I used CVS. But I find it easier to just have a script that zips all the program code and test results together and saves it by date. So I build a version, zip it, run tests/matches, update the zip with the results. If I determine that it sucks, I can revert to an older zipped version easily. If I want to see what changed across two versions, I just unzip them into separate directories and run "diff".
That's more or less exactly what I do... the editor I use (TSE (descendant of Qedit)) has a very powerful macro language which allows me to automate building a 'versioned' zip file whenever I hit the compile hotkey.
If the test results of a new version are not good I can always backtrack to a previous best version in about a minute flat - and try something else.

I don't know of a better way to go about it....

Absolutely the worst scenario is not having the ability to retreat completely when a promising idea turns out bad.... and that happens very frequently.

Ross