Test suites for specific engine features

Aaron Becker · Post by **Aaron Becker** » Fri Jul 10, 2009 9:16 pm

I'm currently in the process of writing a new chess engine, and in an attempt to make things go as smoothly as possible, I'm trying to test each feature I add as thoroughly as possible before moving on. In some areas, this is quite simple. For move generation, for example, perft testing easily uncovered my castling and en passant bugs, and now I'm confident that I'm generating moves correctly.

I'm not aware of any comparable testing strategy for the search. I've made sure that things are generally working by testing on a collection of "mate in x" puzzles, which has uncovered some bugs in my iterative deepening and pv collection code, but it hardly feels like a thorough test. I just added null moves, and for those I'm really not sure how to test. The test suites that I've tried have limited usefulness at this point because my engine will always fail a large number of positions, and when I fail I don't know if it's because of a bug or because I'm using material-only evaluation and don't have a transposition table yet.

So, I'm looking for a collection of tests that are targeted toward specific engine features. I recently saw a thread here where a simple king and pawn endgame position was used to find transposition table bugs, which is exactly the kind of thing I'm looking for (and I'll definitely be using it when I finally add a transposition table).

Obviously differences between engines will make it impossible to provide tests that work identically in all cases. However, I think it should be possible to specify a test procedure that's broadly useful. In the case of null moves, I'm imagining something like: "Search this position to a depth of 10 with null moves off, then search again with them turned on. You should get the same score, but the second search should be much faster."

I'm still quite a novice in the chess programming world, so I'm not sure if such a test suite exists, or if it's even feasible to make one. So, if you were writing a new engine, how would you test it as it evolved?

gingell · Post by **gingell** » Fri Jul 10, 2009 9:46 pm

I don't know of any test suite that tests specific features in the way you describe. The impact of implementing different search techniques will vary a lot from one engine to the next so I would be surprised if such a suite were possible.

I started my engine in January and the strategy I've evolved toward is the following:

I build with and without the change and run an automated tournament under xboard. It's very important to add some random noise to the evaluation function, otherwise you could play a lot of similar games and bias your results. I collect the output in a .pgn file and then I use the free utility "bayeselo" to estimate the strength difference between the two engines. If my rating drops or stays the same then I know I need to rethink or debug.

It is also very useful to test against an engine slightly stronger and one slightly weaker than your own. Testing against yourself is potentially a bit bogus since both sides of the match might be making the same error systematically.

One of the questions I had early on was how many test games I need to run to get meaningful results. You need to make a statistically significant number of measurements to be confident your results are meaningful. Bayeselo will do exactly that for you by putting + or minus error bounds on the number it generates.

I put up some general thoughts on diminishing returns to testing on my blog early on. You might find that interesting:

https://sourceforge.net/apps/wordpress/ ... h-testing/

Dann Corbit · Post by **Dann Corbit** » Fri Jul 10, 2009 10:04 pm

There are a few test suites here:
http://cap.connx.com/EPD/
There are different themes there, but not well categorized.
A lot of the files are mate collections and tactical suites, but there are other interesting things as well (quiescent nightmares, huge numbers of legal moves, nullmove tests, pawn tests, etc.)

Here are some themed tests:
http://computerchessblogger.googlepages.com/sts

Aaron Becker · Post by **Aaron Becker** » Sat Jul 11, 2009 12:28 am

Dann, thanks for the link to such a massive collection of epds. I'm sure I'll find tons of useful stuff in there once I start sorting through it. The themed tests might come in handy once I have a more sophisticated eval as well.

Matthew, I found your blog quite interesting. I think once my engine is complete enough to play competitively your observations will be very helpful. Is it common practice to introduce randomness to your evaluation during test tournaments? It seems like an unreliable way of getting variation compared to starting the games from a variety of even positions.

JVMerlino · Post by **JVMerlino** » Sat Jul 11, 2009 1:03 am

Ron Murawski (Horizon) has given me a great deal of insight into his testing methodologies. In particular, he said that he is very much against the idea of testing one version of your engine against another.

I hope he won't mind me quoting him:

Testing a current version against an older version is a no-no. The old version might have a total blindness about something and the new version will look too good against it (way outside the Elo ratings when played head-to-head), but if you play each version against common opponents you have a reliable gauge of each engine's Elo strength. There are times when new-vs-old will give totally wrong results compared against common opponent results. When this happens you will swear to yourself that you will never, ever do it again. Admittedly this does not happen very often, but when it does it is extremely painful trying to figure out what happened. I once lost 2 months of continuous testing (about 8,000 wasted games) due to exactly this situation.

Clearly the voice of experience. My engine is still quite young, now only barely scratching at a 1700-1800 rating. So it is still the case that a large majority of what I implement gives a clear benefit.

So my testing involves playing against a small series of opponents (all hopefully within 100 ELO), a few test suites, and a handful of positions from previous Myrddin games.

You can find the test suites that I use at Myrddin's website (also hosted by Ron!):

http://computer-chess.org/doku.php?id=c ... ddin:index

jm

gingell · Post by **gingell** » Sat Jul 11, 2009 8:49 pm

Aaron Becker wrote: Is it common practice to introduce randomness to your evaluation during test tournaments? It seems like an unreliable way of getting variation compared to starting the games from a variety of even positions.

I'm not sure - I haven't implemented an opening book so for me it was just a very easy option. The idea was that if there are a few moves the engine is basically ambivalent between, then a couple of points of randomness will swing it to a different solution on each run.

Test suites for specific engine features

Test suites for specific engine features

Re: Test suites for specific engine features

Re: Test suites for specific engine features

Re: Test suites for specific engine features

Re: Test suites for specific engine features

Re: Test suites for specific engine features