Some musings about search

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
Ozymandias
Posts: 1535
Joined: Sun Oct 25, 2009 2:30 am

Re: Some musings about search

Post by Ozymandias »

cdani wrote:there are different types of changes that tend to be good for all the openings in general
The problem about testing engines, with "all the openings" available, is the amount of games you'd need to complete a single test. When you work with short books, it might be doable, but those leave out of the picture many variations, with their unique positions. The only thing you can be sure of, when accepting a patch under current testing conditions is that, on average, it's good for all tested lines. Not a bad thing in itself, unless the lines selected aren't representative of chess openings "in general". In that case, you'll need a big number of games to bring error bars down, and even then, you won't be able to tell a marginally good patch, from a marginally bad one.
User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: Some musings about search

Post by cdani »

Ozymandias wrote:
cdani wrote:there are different types of changes that tend to be good for all the openings in general
The problem about testing engines, with "all the openings" available, is the amount of games you'd need to complete a single test. When you work with short books, it might be doable, but those leave out of the picture many variations, with their unique positions. The only thing you can be sure of, when accepting a patch under current testing conditions is that, on average, it's good for all tested lines. Not a bad thing in itself, unless the lines selected aren't representative of chess openings "in general". In that case, you'll need a big number of games to bring error bars down, and even then, you won't be able to tell a marginally good patch, from a marginally bad one.
Yes. I (and most people) live with what one can achieve with their current working conditions. More or less it works. Anyway I must improve it a lot for sure :-)
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Some musings about search

Post by bob »

Laskos wrote:
bob wrote:
Laskos wrote:
Rebel wrote:
Laskos wrote:That's why SPRT framework is important. Or, if too cumbersome, keep 3 standard deviations at the stop of your choosing. Not 2, at least 3.
Is SPRT superior over LOS?
Yes, Type I and II errors are controlled. Also, if you use LOS stopping rule, basically standard deviation or p-value stopping rule, with subjective criteria for stopping, it accumulates critical Type I error. One can use LOS stopping rule whenever LOS passes a certain threshold, but 3 standard deviations (LOS of 99.86%) must be used as threshold, the usual 2 standard deviations (LOS 97.7%) are too few, it accumulates rapidly errors.

With SPRT (0.05, 0.05) you will save time compared to 3-standard deviations rule, given that you set H0 and H1 reasonably.
However, SPRT is used in self-testing, which is not the optimal way to test chess engines... It has its good things, but it has its faults that normal huge # of games testing doesn't suffer from.
SPRT is not necessarily used in self-testing, self-testing helps because it's easier to set hypotheses, the engines are very close in strength. Huge number of games are needed only if the differences are small, and SPRT will save roughly a factor of 2 in number of games, if not more, even compared to rigorously applied standard deviation stopping rule, given that SPRT hypotheses are reasonable. A loosely applied standard deviation stopping rule (or LOS or p-value) can ruin the testing framework. SPRT has well defined Type I and II errors, one sets hypotheses, and does pretty much nothing more.
The problem I have with it is the 1 vs 1. If you test against something other than yourself, the number of games climbs. And if you want to know whether you are making progress, you need to measure against several opponents, not just one.

I have had some luck with it when tuning a "thing". But I have had contradictory results when adding features that the original did not have.