building a good test suite to predict the rating of engines

Uri Blass · Post by **Uri Blass** » Sun Jun 10, 2012 2:00 pm

The idea is that the test suite can be used by programmers to improve their engine when they simply tune for better results in the test suite and do not tune for better result in games because testing for games take more time and in 99% of the cases better results in the test suite means also better result in games(assuming that we do not talk about time management changes but about evaluation or search changes).

I wonder if there are programmers who work about building this type of test suite including the scoring system.

Note that I do not think that existing tactical test suites are good for this purpose but it does not mean that it is impossible to build a test suite that is good for this purpose and after you build it you can save testing time by not playing games against other programs.

Adam Hair · Post by **Adam Hair** » Sun Jun 10, 2012 2:44 pm

I have studied this a bit. I think it is possible to use test positions to measure the improvement of an engine reliably. It is no good for comparing engines to each other (I have used a large test suite, and while there is definitely positive correlation of engine strength to test suite score, the standard deviation was 150 Elo. And this was for a 1500+ position test suite). But, given a large and diverse suite of positions covering all game phases, improvements in score should definitely translate to increasae in Elo.

I do not think it would be hard for a programmer to create such a test suite. Several of the top engines could be used as oracles. Use them to determine the best moves (moves they all, or the majority, agree are best when given a reasonable amount of time to think) for a large (10,000+ ?) suite of positions comprising all game phases. If the engine in question already agrees with more than 50% of the positions, then maybe some should be replaced. Perhaps there should be a minimum number that the engine should agree with in order to catch a regression (25% ?). Given the suite of positions and a particular amount of thinking time for each position (balanced between economy of time spent testing and reasonable depth), one could test search and evaluation changes.

My outline may have some holes in its reasoning. But, the problem with using other test suites was that not enough positions were used to judge improvements. While there is no minimum number of positions that can ensure that all improvements in score will translate to improvements in strength, 10,000 positions from all game phases most likely is a practical minimum.

Dan Honeycutt · Post by **Dan Honeycutt** » Sun Jun 10, 2012 5:08 pm

Uri Blass wrote:The idea is that the test suite can be used by programmers to improve their engine when they simply tune for better results in the test suite and do not tune for better result in games because testing for games take more time and in 99% of the cases better results in the test suite means also better result in games(assuming that we do not talk about time management changes but about evaluation or search changes).

I wonder if there are programmers who work about building this type of test suite including the scoring system.

Note that I do not think that existing tactical test suites are good for this purpose but it does not mean that it is impossible to build a test suite that is good for this purpose and after you build it you can save testing time by not playing games against other programs.

I've always heard - and it makes sense to me intuitively - that you can test search improvements with test suites but evaluation improvements are better tested with games.

Best
Dan H.

kbhearn · Post by **kbhearn** » Sun Jun 10, 2012 9:53 pm

I think more conditions would need to be added. Things like time to solve would need to be incorporated into your score, not just % solved (you could increase your % solved with new knowledge, but decrease your overall strength due to slowdowns for instance). Also you'd want to human-verify the analysis of the test positions which would be a massive amount of work. The test positions would need to be reasonably distributed over likelihood of occurrence which would dramatically increase the total position count, since you'd probably want many positions on a single theme so as not to promote catching a single case of it, and some of those positions should probably promote the contrary.

i.e. you might want say a hundred positions on trapped pieces of each type. Then you want say another hundred that involve lines where pieces of each type look potentially trapped but aren't really trapped to reasonably represent the knowledge. But having 800 test positions on trapped pieces means you're going to need many more (at least hundreds, maybe thousands) of themes so as to not overemphasize the importance of correctly evaluating potentially trapped pieces.

Uri Blass · Post by **Uri Blass** » Mon Jun 11, 2012 7:17 am

Dan Honeycutt wrote:
Uri Blass wrote:The idea is that the test suite can be used by programmers to improve their engine when they simply tune for better results in the test suite and do not tune for better result in games because testing for games take more time and in 99% of the cases better results in the test suite means also better result in games(assuming that we do not talk about time management changes but about evaluation or search changes).

I wonder if there are programmers who work about building this type of test suite including the scoring system.

Note that I do not think that existing tactical test suites are good for this purpose but it does not mean that it is impossible to build a test suite that is good for this purpose and after you build it you can save testing time by not playing games against other programs.
I've always heard - and it makes sense to me intuitively - that you can test search improvements with test suites but evaluation improvements are better tested with games.

Best
Dan H.

I see no reason not to test evaluation improvement with test suite.

If evaluation improvement cause the program to play better moves in some positions then these positions can be part of the test suite.

There should be no problem if there is no single move that is better than the other moves and
it is possible to have scoring system that give the same points for some moves that are equal and not to have only success and failure for every position.

You can have 10 points for move A 10 points for move B 9 points for move C that is slightly inferior
and 5 points for move D that is a positional mistake when only tactical mistakes get 0 points.

Uri Blass · Post by **Uri Blass** » Mon Jun 11, 2012 7:23 am

Note that we already have rating lists of different engines and the way to develop the right test suite is to start with something and try to add positions in order to make the test suite rating list more similiar to the real rating list.

If the test suite is good in predicting the rating of existing engines(after we tune it to predict the rating of existing engines by adding the right positions) then we can hope that it is also going to be good in predicting the rating of future engines or good in testing if a change in the evaluation is an improvement or not an improvement.

Don · Post by **Don** » Mon Jun 11, 2012 7:06 pm

Uri Blass wrote:The idea is that the test suite can be used by programmers to improve their engine when they simply tune for better results in the test suite and do not tune for better result in games because testing for games take more time and in 99% of the cases better results in the test suite means also better result in games(assuming that we do not talk about time management changes but about evaluation or search changes).

I wonder if there are programmers who work about building this type of test suite including the scoring system.

Note that I do not think that existing tactical test suites are good for this purpose but it does not mean that it is impossible to build a test suite that is good for this purpose and after you build it you can save testing time by not playing games against other programs.

We have such a set. I think it is quite good at measuring whether a given program has improved and it correlates highly with general strength too, but not perfectly.

A good set to do this is still subject to the same error margins you see in playing matches so there are no real shortcuts to take and unlike real games certain aspects of a chess program cannot be tested like this such as time control algorithms. Each program has it's own sense of good and bad and these tests don't measure how well a program is able to push it's own agenda, a key factor in being strong. Nevertheless I think it's possible for such a set to be pretty useful. Our set has about 150,000 positions and we don't make much use of it, but when we make a small improvement it almost always checks out on the set (when we use it) as being slightly better too.

JVMerlino · Post by **JVMerlino** » Mon Jun 11, 2012 8:50 pm

I assume you are aware of the Strategic Test Suite (STS)?

https://sites.google.com/site/strategictestsuite/

I'm not saying that it satisfactorily does what you are trying to achieve, as I believe it is mostly intended to test evaluation improvements. But I have used it with some success for that purpose.

jm

JBNielsen · Post by **JBNielsen** » Sun Aug 05, 2012 12:56 am

On my old HomePage I have written this:

From 1986 to 1991 I constructed a test based on more than 100 chess positions.
The goal was to be able to estimate the rating of the programs tested.
And use the test for improving the development of chess programs.

Many of the positions are too easy for the computers of today.

Perhaps someone makes a better test in the future...

The link is:
http://www.jens-musik.dk/skak.htm

Here you Can find more details: the positions, the weights and how they were adjusted, the results, a detailed article about this test etc.

Richard Allbert · Post by **Richard Allbert** » Sun Aug 05, 2012 8:43 am

Hi John,

Agreed - this seems to fit the purpose. I've been redoing Jabba's eval from scratch in recent weeks, for that I created a built in tuning system that uses the STS suite.

So far it has worked well.

The only problem with the STS stuite is you find you score over 50% just having some piece square tables, and nothing else.

What I would love to have is an .epd of Endgame postions that do a similar thing.

Regards

Richard

building a good test suite to predict the rating of engines

building a good test suite to predict the rating of engines

Re: building a good test suite to predict the rating of engi

Re: building a good test suite to predict the rating of engi

Re: building a good test suite to predict the rating of engi

Re: building a good test suite to predict the rating of engi

Re: building a good test suite to predict the rating of engi

Re: building a good test suite to predict the rating of engi

Re: building a good test suite to predict the rating of engi

Re: building a good test suite to predict the rating of engi

Re: building a good test suite to predict the rating of engi