Yes.Rebel wrote:After a search change I am (was) used to get an impression first by manually going through 40-50 positions before starting a self-play match.
And sometimes that first glimpse looked so good it was reasonable to assume the change to be an improvement. And then after playing 10,000 x 40/15s games it turned out not to be, say an 48% result as an example.
Then stubbornly I couldn't believe it (not a bad attitude in CC) and started to play the match on 40/30s and even 40/1m.
And I never have seen such a change to become an improvement after doubling or four folding the time control.
Is this a global experience?
Some musings about search
Moderators: hgm, Rebel, chrisw
-
- Posts: 3232
- Joined: Mon May 31, 2010 1:29 pm
- Full name: lucasart
Re: Some musings about search
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
-
- Posts: 6995
- Joined: Thu Aug 18, 2011 12:04 pm
Re: Some musings about search
Is SPRT superior over LOS?Laskos wrote:That's why SPRT framework is important. Or, if too cumbersome, keep 3 standard deviations at the stop of your choosing. Not 2, at least 3.
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: Some musings about search
Yes, Type I and II errors are controlled. Also, if you use LOS stopping rule, basically standard deviation or p-value stopping rule, with subjective criteria for stopping, it accumulates critical Type I error. One can use LOS stopping rule whenever LOS passes a certain threshold, but 3 standard deviations (LOS of 99.86%) must be used as threshold, the usual 2 standard deviations (LOS 97.7%) are too few, it accumulates rapidly errors.Rebel wrote:Is SPRT superior over LOS?Laskos wrote:That's why SPRT framework is important. Or, if too cumbersome, keep 3 standard deviations at the stop of your choosing. Not 2, at least 3.
With SPRT (0.05, 0.05) you will save time compared to 3-standard deviations rule, given that you set H0 and H1 reasonably.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Some musings about search
However, SPRT is used in self-testing, which is not the optimal way to test chess engines... It has its good things, but it has its faults that normal huge # of games testing doesn't suffer from.Laskos wrote:Yes, Type I and II errors are controlled. Also, if you use LOS stopping rule, basically standard deviation or p-value stopping rule, with subjective criteria for stopping, it accumulates critical Type I error. One can use LOS stopping rule whenever LOS passes a certain threshold, but 3 standard deviations (LOS of 99.86%) must be used as threshold, the usual 2 standard deviations (LOS 97.7%) are too few, it accumulates rapidly errors.Rebel wrote:Is SPRT superior over LOS?Laskos wrote:That's why SPRT framework is important. Or, if too cumbersome, keep 3 standard deviations at the stop of your choosing. Not 2, at least 3.
With SPRT (0.05, 0.05) you will save time compared to 3-standard deviations rule, given that you set H0 and H1 reasonably.
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: Some musings about search
SPRT is not necessarily used in self-testing, self-testing helps because it's easier to set hypotheses, the engines are very close in strength. Huge number of games are needed only if the differences are small, and SPRT will save roughly a factor of 2 in number of games, if not more, even compared to rigorously applied standard deviation stopping rule, given that SPRT hypotheses are reasonable. A loosely applied standard deviation stopping rule (or LOS or p-value) can ruin the testing framework. SPRT has well defined Type I and II errors, one sets hypotheses, and does pretty much nothing more.bob wrote:However, SPRT is used in self-testing, which is not the optimal way to test chess engines... It has its good things, but it has its faults that normal huge # of games testing doesn't suffer from.Laskos wrote:Yes, Type I and II errors are controlled. Also, if you use LOS stopping rule, basically standard deviation or p-value stopping rule, with subjective criteria for stopping, it accumulates critical Type I error. One can use LOS stopping rule whenever LOS passes a certain threshold, but 3 standard deviations (LOS of 99.86%) must be used as threshold, the usual 2 standard deviations (LOS 97.7%) are too few, it accumulates rapidly errors.Rebel wrote:Is SPRT superior over LOS?Laskos wrote:That's why SPRT framework is important. Or, if too cumbersome, keep 3 standard deviations at the stop of your choosing. Not 2, at least 3.
With SPRT (0.05, 0.05) you will save time compared to 3-standard deviations rule, given that you set H0 and H1 reasonably.
-
- Posts: 4833
- Joined: Sun Aug 10, 2008 3:15 pm
- Location: Philippines
Re: Some musings about search
I am trying to solve if not at least knew what really happened in the test.Rebel wrote:One thing I learned from my last active period (2012-2013) is that playing too few games can be disastrous. In principle playing 1000-2000 games in most cases is good enough, (say) 9 of 10 times. But it is the 10th time that is going to hurt, you get good results while if you had played more you would have known the change isn't an improvement at all.Ferdy wrote: Probably the field is generally contested at typical CCRL 40/4 and 40/40, I try to look good at 40/4.
Say this is self test.
* Test the engine changes by knowing well the opening test suite used. Say the Silver opening test suite where Albert described what the suite is.
* Test engines in steps, as you have done say 1000 games per steps. But one should knew what this 500 start positions is? (assuming start color is reversed) Is this a general test suite? Is this a gambit test suite? (create separate stats, the side that gambits and not). Is this a suite that would test the mobility of the program? Is this a suite that would test the pawn structure of the program? Is this a suite that would test the engine in closed positions? Is this a suite for certain opening say sicilian?
* After playing each steps, we record the stats. Now we knew or at least have an idea what this suite is doing to the engine change that we have done. We will be able to know if the mobility is improved (by examining the mobility test suite) though it failed a little bit in closed position test suite, in an engine change that increases the score of passers at 6th rank when opponent has no bishops. This also will give us an idea on the strength and weaknesses we introduced when implementing a certain feature to the engine.
* We can gamble on the change, say the result is even but the change showed that its mobility test suite result was doing well. Without knowing such details we could have easily discard the change and leave us an impression that that certain change was not doing at all. It is this impression that probably would hold back our progress because we tend to say, "no it does not really improved, I tried it".
We have lost a lot of time and resources doing this random testing and see if the elo has increased by +5 or not, at some point there must be a feedback mechanism that should be examined after running those thousands of game tests.
Collecting those kind of test suites would not be easy at all.
We would challenge our opening test suite creators to create such suites.
-
- Posts: 1535
- Joined: Sun Oct 25, 2009 2:30 am
Re: Some musings about search
That's in line with what I told you. You can't trust the error boundaries suggested by rating programs; no matter how may games you play, once you get to "new" openings, the ratings will fluctuate.cdani wrote:Some days ago it happened to me that three computers at something like 2000 games each already played, so 6000 games, where at +10 elo for a patch in every one of the three computers. I was tempted to stop the test and give it as good, but I let it continue. When the total games where at 20000, the patch clearly showed as a regression!
-
- Posts: 154
- Joined: Thu Oct 03, 2013 4:17 pm
Re: Some musings about search
Whenever I've had a patch like that it behaves almost randomly. 5+0.05 is bad, 15+0.05 is good, 40/20s is good, 40/80s is bad etc. I just don't take the risk anymore.cdani wrote:Really curious. I always accept those patches, as they are even better at longer time controls, not by much, of course.Bloodbane wrote:I've seen some patches which were bad at short TC but good at long TC, but I only accept patches which are good at all time controls I use.
Functional programming combines the flexibility and power of abstract mathematics with the intuitive clarity of abstract mathematics.
https://github.com/mAarnos
https://github.com/mAarnos
-
- Posts: 2204
- Joined: Sat Jan 18, 2014 10:24 am
- Location: Andorra
Re: Some musings about search
Hola!Ozymandias wrote:That's in line with what I told you. You can't trust the error boundaries suggested by rating programs; no matter how may games you play, once you get to "new" openings, the ratings will fluctuate.cdani wrote:Some days ago it happened to me that three computers at something like 2000 games each already played, so 6000 games, where at +10 elo for a patch in every one of the three computers. I was tempted to stop the test and give it as good, but I let it continue. When the total games where at 20000, the patch clearly showed as a regression!
Yes, the openings are another very important factor to test with. Anyway there are different types of changes that tend to be good for all the openings in general, and those are the ones that most engine coders tends to work with.
Daniel José - http://www.andscacs.com
-
- Posts: 2204
- Joined: Sat Jan 18, 2014 10:24 am
- Location: Andorra
Re: Some musings about search
Of course I will do the same in this case. But doing 40/80s is a lot of time! I most probably will not do 10000 games on it.Bloodbane wrote:Whenever I've had a patch like that it behaves almost randomly. 5+0.05 is bad, 15+0.05 is good, 40/20s is good, 40/80s is bad etc. I just don't take the risk anymore.cdani wrote:Really curious. I always accept those patches, as they are even better at longer time controls, not by much, of course.Bloodbane wrote:I've seen some patches which were bad at short TC but good at long TC, but I only accept patches which are good at all time controls I use.
Daniel José - http://www.andscacs.com