Some musings about search

lucasart · Post by **lucasart** » Sat Aug 15, 2015 6:44 am

Rebel wrote:After a search change I am (was) used to get an impression first by manually going through 40-50 positions before starting a self-play match.

And sometimes that first glimpse looked so good it was reasonable to assume the change to be an improvement. And then after playing 10,000 x 40/15s games it turned out not to be, say an 48% result as an example.

Then stubbornly I couldn't believe it (not a bad attitude in CC) and started to play the match on 40/30s and even 40/1m.

And I never have seen such a change to become an improvement after doubling or four folding the time control.

Is this a global experience?

Yes.

Rebel · Post by **Rebel** » Sat Aug 15, 2015 11:36 pm

Laskos wrote:That's why SPRT framework is important. Or, if too cumbersome, keep 3 standard deviations at the stop of your choosing. Not 2, at least 3.

Is SPRT superior over LOS?

Laskos · Post by **Laskos** » Sat Aug 15, 2015 11:47 pm

Rebel wrote:
Laskos wrote:That's why SPRT framework is important. Or, if too cumbersome, keep 3 standard deviations at the stop of your choosing. Not 2, at least 3.
Is SPRT superior over LOS?

Yes, Type I and II errors are controlled. Also, if you use LOS stopping rule, basically standard deviation or p-value stopping rule, with subjective criteria for stopping, it accumulates critical Type I error. One can use LOS stopping rule whenever LOS passes a certain threshold, but 3 standard deviations (LOS of 99.86%) must be used as threshold, the usual 2 standard deviations (LOS 97.7%) are too few, it accumulates rapidly errors.

With SPRT (0.05, 0.05) you will save time compared to 3-standard deviations rule, given that you set H0 and H1 reasonably.

bob · Post by **bob** » Sun Aug 16, 2015 12:04 am

Laskos wrote:
Rebel wrote:
Laskos wrote:That's why SPRT framework is important. Or, if too cumbersome, keep 3 standard deviations at the stop of your choosing. Not 2, at least 3.
Is SPRT superior over LOS?
Yes, Type I and II errors are controlled. Also, if you use LOS stopping rule, basically standard deviation or p-value stopping rule, with subjective criteria for stopping, it accumulates critical Type I error. One can use LOS stopping rule whenever LOS passes a certain threshold, but 3 standard deviations (LOS of 99.86%) must be used as threshold, the usual 2 standard deviations (LOS 97.7%) are too few, it accumulates rapidly errors.

With SPRT (0.05, 0.05) you will save time compared to 3-standard deviations rule, given that you set H0 and H1 reasonably.

However, SPRT is used in self-testing, which is not the optimal way to test chess engines... It has its good things, but it has its faults that normal huge # of games testing doesn't suffer from.

Laskos · Post by **Laskos** » Sun Aug 16, 2015 12:19 am

bob wrote:
Laskos wrote:
Rebel wrote:
Laskos wrote:That's why SPRT framework is important. Or, if too cumbersome, keep 3 standard deviations at the stop of your choosing. Not 2, at least 3.
Is SPRT superior over LOS?
Yes, Type I and II errors are controlled. Also, if you use LOS stopping rule, basically standard deviation or p-value stopping rule, with subjective criteria for stopping, it accumulates critical Type I error. One can use LOS stopping rule whenever LOS passes a certain threshold, but 3 standard deviations (LOS of 99.86%) must be used as threshold, the usual 2 standard deviations (LOS 97.7%) are too few, it accumulates rapidly errors.

With SPRT (0.05, 0.05) you will save time compared to 3-standard deviations rule, given that you set H0 and H1 reasonably.
However, SPRT is used in self-testing, which is not the optimal way to test chess engines... It has its good things, but it has its faults that normal huge # of games testing doesn't suffer from.

SPRT is not necessarily used in self-testing, self-testing helps because it's easier to set hypotheses, the engines are very close in strength. Huge number of games are needed only if the differences are small, and SPRT will save roughly a factor of 2 in number of games, if not more, even compared to rigorously applied standard deviation stopping rule, given that SPRT hypotheses are reasonable. A loosely applied standard deviation stopping rule (or LOS or p-value) can ruin the testing framework. SPRT has well defined Type I and II errors, one sets hypotheses, and does pretty much nothing more.

Ferdy · Post by **Ferdy** » Sun Aug 16, 2015 12:20 am

Rebel wrote:
Ferdy wrote: Probably the field is generally contested at typical CCRL 40/4 and 40/40, I try to look good at 40/4.
One thing I learned from my last active period (2012-2013) is that playing too few games can be disastrous. In principle playing 1000-2000 games in most cases is good enough, (say) 9 of 10 times. But it is the 10th time that is going to hurt, you get good results while if you had played more you would have known the change isn't an improvement at all.

I am trying to solve if not at least knew what really happened in the test.
Say this is self test.
* Test the engine changes by knowing well the opening test suite used. Say the Silver opening test suite where Albert described what the suite is.
* Test engines in steps, as you have done say 1000 games per steps. But one should knew what this 500 start positions is? (assuming start color is reversed) Is this a general test suite? Is this a gambit test suite? (create separate stats, the side that gambits and not). Is this a suite that would test the mobility of the program? Is this a suite that would test the pawn structure of the program? Is this a suite that would test the engine in closed positions? Is this a suite for certain opening say sicilian?
* After playing each steps, we record the stats. Now we knew or at least have an idea what this suite is doing to the engine change that we have done. We will be able to know if the mobility is improved (by examining the mobility test suite) though it failed a little bit in closed position test suite, in an engine change that increases the score of passers at 6th rank when opponent has no bishops. This also will give us an idea on the strength and weaknesses we introduced when implementing a certain feature to the engine.
* We can gamble on the change, say the result is even but the change showed that its mobility test suite result was doing well. Without knowing such details we could have easily discard the change and leave us an impression that that certain change was not doing at all. It is this impression that probably would hold back our progress because we tend to say, "no it does not really improved, I tried it".

We have lost a lot of time and resources doing this random testing and see if the elo has increased by +5 or not, at some point there must be a feedback mechanism that should be examined after running those thousands of game tests.

Collecting those kind of test suites would not be easy at all.
We would challenge our opening test suite creators to create such suites.

Ozymandias · Post by **Ozymandias** » Sun Aug 16, 2015 9:20 am

cdani wrote:Some days ago it happened to me that three computers at something like 2000 games each already played, so 6000 games, where at +10 elo for a patch in every one of the three computers. I was tempted to stop the test and give it as good, but I let it continue. When the total games where at 20000, the patch clearly showed as a regression!

That's in line with what I told you. You can't trust the error boundaries suggested by rating programs; no matter how may games you play, once you get to "new" openings, the ratings will fluctuate.

Bloodbane · Post by **Bloodbane** » Sun Aug 16, 2015 11:26 am

cdani wrote:
Bloodbane wrote:I've seen some patches which were bad at short TC but good at long TC, but I only accept patches which are good at all time controls I use.
Really curious. I always accept those patches, as they are even better at longer time controls, not by much, of course.

Whenever I've had a patch like that it behaves almost randomly. 5+0.05 is bad, 15+0.05 is good, 40/20s is good, 40/80s is bad etc. I just don't take the risk anymore.

cdani · Post by **cdani** » Sun Aug 16, 2015 11:55 am

Ozymandias wrote:
cdani wrote:Some days ago it happened to me that three computers at something like 2000 games each already played, so 6000 games, where at +10 elo for a patch in every one of the three computers. I was tempted to stop the test and give it as good, but I let it continue. When the total games where at 20000, the patch clearly showed as a regression!
That's in line with what I told you. You can't trust the error boundaries suggested by rating programs; no matter how may games you play, once you get to "new" openings, the ratings will fluctuate.

Hola!

Yes, the openings are another very important factor to test with. Anyway there are different types of changes that tend to be good for all the openings in general, and those are the ones that most engine coders tends to work with.

cdani · Post by **cdani** » Sun Aug 16, 2015 11:57 am

Bloodbane wrote:
cdani wrote:
Bloodbane wrote:I've seen some patches which were bad at short TC but good at long TC, but I only accept patches which are good at all time controls I use.
Really curious. I always accept those patches, as they are even better at longer time controls, not by much, of course.
Whenever I've had a patch like that it behaves almost randomly. 5+0.05 is bad, 15+0.05 is good, 40/20s is good, 40/80s is bad etc. I just don't take the risk anymore.

Of course I will do the same in this case. But doing 40/80s is a lot of time! I most probably will not do 10000 games on it.

Some musings about search

Re: Some musings about search

Re: Some musings about search

Re: Some musings about search

Re: Some musings about search

Re: Some musings about search

Re: Some musings about search

Re: Some musings about search

Re: Some musings about search

Re: Some musings about search

Re: Some musings about search