building a good test suite to predict the rating of engines

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Uri Blass
Posts: 10282
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

building a good test suite to predict the rating of engines

Post by Uri Blass »

The idea is that the test suite can be used by programmers to improve their engine when they simply tune for better results in the test suite and do not tune for better result in games because testing for games take more time and in 99% of the cases better results in the test suite means also better result in games(assuming that we do not talk about time management changes but about evaluation or search changes).

I wonder if there are programmers who work about building this type of test suite including the scoring system.

Note that I do not think that existing tactical test suites are good for this purpose but it does not mean that it is impossible to build a test suite that is good for this purpose and after you build it you can save testing time by not playing games against other programs.
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: building a good test suite to predict the rating of engi

Post by Adam Hair »

I have studied this a bit. I think it is possible to use test positions to measure the improvement of an engine reliably. It is no good for comparing engines to each other (I have used a large test suite, and while there is definitely positive correlation of engine strength to test suite score, the standard deviation was 150 Elo. And this was for a 1500+ position test suite). But, given a large and diverse suite of positions covering all game phases, improvements in score should definitely translate to increasae in Elo.

I do not think it would be hard for a programmer to create such a test suite. Several of the top engines could be used as oracles. Use them to determine the best moves (moves they all, or the majority, agree are best when given a reasonable amount of time to think) for a large (10,000+ ?) suite of positions comprising all game phases. If the engine in question already agrees with more than 50% of the positions, then maybe some should be replaced. Perhaps there should be a minimum number that the engine should agree with in order to catch a regression (25% ?). Given the suite of positions and a particular amount of thinking time for each position (balanced between economy of time spent testing and reasonable depth), one could test search and evaluation changes.

My outline may have some holes in its reasoning. But, the problem with using other test suites was that not enough positions were used to judge improvements. While there is no minimum number of positions that can ensure that all improvements in score will translate to improvements in strength, 10,000 positions from all game phases most likely is a practical minimum.
User avatar
Dan Honeycutt
Posts: 5258
Joined: Mon Feb 27, 2006 4:31 pm
Location: Atlanta, Georgia

Re: building a good test suite to predict the rating of engi

Post by Dan Honeycutt »

Uri Blass wrote:The idea is that the test suite can be used by programmers to improve their engine when they simply tune for better results in the test suite and do not tune for better result in games because testing for games take more time and in 99% of the cases better results in the test suite means also better result in games(assuming that we do not talk about time management changes but about evaluation or search changes).

I wonder if there are programmers who work about building this type of test suite including the scoring system.

Note that I do not think that existing tactical test suites are good for this purpose but it does not mean that it is impossible to build a test suite that is good for this purpose and after you build it you can save testing time by not playing games against other programs.
I've always heard - and it makes sense to me intuitively - that you can test search improvements with test suites but evaluation improvements are better tested with games.

Best
Dan H.
kbhearn
Posts: 411
Joined: Thu Dec 30, 2010 4:48 am

Re: building a good test suite to predict the rating of engi

Post by kbhearn »

I think more conditions would need to be added. Things like time to solve would need to be incorporated into your score, not just % solved (you could increase your % solved with new knowledge, but decrease your overall strength due to slowdowns for instance). Also you'd want to human-verify the analysis of the test positions which would be a massive amount of work. The test positions would need to be reasonably distributed over likelihood of occurrence which would dramatically increase the total position count, since you'd probably want many positions on a single theme so as not to promote catching a single case of it, and some of those positions should probably promote the contrary.

i.e. you might want say a hundred positions on trapped pieces of each type. Then you want say another hundred that involve lines where pieces of each type look potentially trapped but aren't really trapped to reasonably represent the knowledge. But having 800 test positions on trapped pieces means you're going to need many more (at least hundreds, maybe thousands) of themes so as to not overemphasize the importance of correctly evaluating potentially trapped pieces.
Uri Blass
Posts: 10282
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: building a good test suite to predict the rating of engi

Post by Uri Blass »

Dan Honeycutt wrote:
Uri Blass wrote:The idea is that the test suite can be used by programmers to improve their engine when they simply tune for better results in the test suite and do not tune for better result in games because testing for games take more time and in 99% of the cases better results in the test suite means also better result in games(assuming that we do not talk about time management changes but about evaluation or search changes).

I wonder if there are programmers who work about building this type of test suite including the scoring system.

Note that I do not think that existing tactical test suites are good for this purpose but it does not mean that it is impossible to build a test suite that is good for this purpose and after you build it you can save testing time by not playing games against other programs.
I've always heard - and it makes sense to me intuitively - that you can test search improvements with test suites but evaluation improvements are better tested with games.

Best
Dan H.
I see no reason not to test evaluation improvement with test suite.

If evaluation improvement cause the program to play better moves in some positions then these positions can be part of the test suite.

There should be no problem if there is no single move that is better than the other moves and
it is possible to have scoring system that give the same points for some moves that are equal and not to have only success and failure for every position.

You can have 10 points for move A 10 points for move B 9 points for move C that is slightly inferior
and 5 points for move D that is a positional mistake when only tactical mistakes get 0 points.
Uri Blass
Posts: 10282
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: building a good test suite to predict the rating of engi

Post by Uri Blass »

Note that we already have rating lists of different engines and the way to develop the right test suite is to start with something and try to add positions in order to make the test suite rating list more similiar to the real rating list.

If the test suite is good in predicting the rating of existing engines(after we tune it to predict the rating of existing engines by adding the right positions) then we can hope that it is also going to be good in predicting the rating of future engines or good in testing if a change in the evaluation is an improvement or not an improvement.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: building a good test suite to predict the rating of engi

Post by Don »

Uri Blass wrote:The idea is that the test suite can be used by programmers to improve their engine when they simply tune for better results in the test suite and do not tune for better result in games because testing for games take more time and in 99% of the cases better results in the test suite means also better result in games(assuming that we do not talk about time management changes but about evaluation or search changes).

I wonder if there are programmers who work about building this type of test suite including the scoring system.

Note that I do not think that existing tactical test suites are good for this purpose but it does not mean that it is impossible to build a test suite that is good for this purpose and after you build it you can save testing time by not playing games against other programs.
We have such a set. I think it is quite good at measuring whether a given program has improved and it correlates highly with general strength too, but not perfectly.

A good set to do this is still subject to the same error margins you see in playing matches so there are no real shortcuts to take and unlike real games certain aspects of a chess program cannot be tested like this such as time control algorithms. Each program has it's own sense of good and bad and these tests don't measure how well a program is able to push it's own agenda, a key factor in being strong. Nevertheless I think it's possible for such a set to be pretty useful. Our set has about 150,000 positions and we don't make much use of it, but when we make a small improvement it almost always checks out on the set (when we use it) as being slightly better too.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
JVMerlino
Posts: 1357
Joined: Wed Mar 08, 2006 10:15 pm
Location: San Francisco, California

Re: building a good test suite to predict the rating of engi

Post by JVMerlino »

I assume you are aware of the Strategic Test Suite (STS)?

https://sites.google.com/site/strategictestsuite/

I'm not saying that it satisfactorily does what you are trying to achieve, as I believe it is mostly intended to test evaluation improvements. But I have used it with some success for that purpose.

jm
JBNielsen
Posts: 267
Joined: Thu Jul 07, 2011 10:31 pm
Location: Denmark

Re: building a good test suite to predict the rating of engi

Post by JBNielsen »

On my old HomePage I have written this:

From 1986 to 1991 I constructed a test based on more than 100 chess positions.
The goal was to be able to estimate the rating of the programs tested.
And use the test for improving the development of chess programs.

Many of the positions are too easy for the computers of today.

Perhaps someone makes a better test in the future...

The link is:
http://www.jens-musik.dk/skak.htm

Here you Can find more details: the positions, the weights and how they were adjusted, the results, a detailed article about this test etc.
Richard Allbert
Posts: 792
Joined: Wed Jul 19, 2006 9:58 am

Re: building a good test suite to predict the rating of engi

Post by Richard Allbert »

Hi John,

Agreed - this seems to fit the purpose. I've been redoing Jabba's eval from scratch in recent weeks, for that I created a built in tuning system that uses the STS suite.

So far it has worked well.

The only problem with the STS stuite is you find you score over 50% just having some piece square tables, and nothing else.

What I would love to have is an .epd of Endgame postions that do a similar thing.

Regards

Richard