The problem about testing engines, with "all the openings" available, is the amount of games you'd need to complete a single test. When you work with short books, it might be doable, but those leave out of the picture many variations, with their unique positions. The only thing you can be sure of, when accepting a patch under current testing conditions is that, on average, it's good for all tested lines. Not a bad thing in itself, unless the lines selected aren't representative of chess openings "in general". In that case, you'll need a big number of games to bring error bars down, and even then, you won't be able to tell a marginally good patch, from a marginally bad one.cdani wrote:there are different types of changes that tend to be good for all the openings in general
Some musings about search
Moderators: hgm, Rebel, chrisw
-
- Posts: 1535
- Joined: Sun Oct 25, 2009 2:30 am
Re: Some musings about search
-
- Posts: 2204
- Joined: Sat Jan 18, 2014 10:24 am
- Location: Andorra
Re: Some musings about search
Yes. I (and most people) live with what one can achieve with their current working conditions. More or less it works. Anyway I must improve it a lot for sureOzymandias wrote:The problem about testing engines, with "all the openings" available, is the amount of games you'd need to complete a single test. When you work with short books, it might be doable, but those leave out of the picture many variations, with their unique positions. The only thing you can be sure of, when accepting a patch under current testing conditions is that, on average, it's good for all tested lines. Not a bad thing in itself, unless the lines selected aren't representative of chess openings "in general". In that case, you'll need a big number of games to bring error bars down, and even then, you won't be able to tell a marginally good patch, from a marginally bad one.cdani wrote:there are different types of changes that tend to be good for all the openings in general
Daniel José - http://www.andscacs.com
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Some musings about search
The problem I have with it is the 1 vs 1. If you test against something other than yourself, the number of games climbs. And if you want to know whether you are making progress, you need to measure against several opponents, not just one.Laskos wrote:SPRT is not necessarily used in self-testing, self-testing helps because it's easier to set hypotheses, the engines are very close in strength. Huge number of games are needed only if the differences are small, and SPRT will save roughly a factor of 2 in number of games, if not more, even compared to rigorously applied standard deviation stopping rule, given that SPRT hypotheses are reasonable. A loosely applied standard deviation stopping rule (or LOS or p-value) can ruin the testing framework. SPRT has well defined Type I and II errors, one sets hypotheses, and does pretty much nothing more.bob wrote:However, SPRT is used in self-testing, which is not the optimal way to test chess engines... It has its good things, but it has its faults that normal huge # of games testing doesn't suffer from.Laskos wrote:Yes, Type I and II errors are controlled. Also, if you use LOS stopping rule, basically standard deviation or p-value stopping rule, with subjective criteria for stopping, it accumulates critical Type I error. One can use LOS stopping rule whenever LOS passes a certain threshold, but 3 standard deviations (LOS of 99.86%) must be used as threshold, the usual 2 standard deviations (LOS 97.7%) are too few, it accumulates rapidly errors.Rebel wrote:Is SPRT superior over LOS?Laskos wrote:That's why SPRT framework is important. Or, if too cumbersome, keep 3 standard deviations at the stop of your choosing. Not 2, at least 3.
With SPRT (0.05, 0.05) you will save time compared to 3-standard deviations rule, given that you set H0 and H1 reasonably.
I have had some luck with it when tuning a "thing". But I have had contradictory results when adding features that the original did not have.