Well, glad you put that right...Rein Halbersma wrote:He indeed calculates things correctly, the difference is that I standardize everything to a single game to look at the scaling behaviour for multiple games.
But this might be a good point to re-iterate an issue that I brought up before: how desirable is it actually to have very good confidence and resolving power. Your own example show that you need excessively large number of games to acheive such confidence levels; 27,000 games represents a life time of testing to most of us. So after that we can be really sure that the change was an improvement. Well, much good will it do, as we are dead and burried by that time, and in the mean time all development has been stopped because out computer was busy running test matches. 7 Elo in 10 years, is that good progress?
Bob says that going with smaller confidence makes your engine development like a random, as you will take many steps in the wrong direction, discarding improvements and adopting changes for the worse.
But I say: "So what?"! A random walk can bring you to your destination faster than going in a straight line, if you take a step every second where the straight walker takes only one step every week. An angry hornet will get to the honey earlier than a snail. So the picture is not complete if you don't consider progress per unit of time.
So taking one step backward for every 20 forward (as you would have with 95% confidence testing) brings you still 18 steps forward, 90% of what you could have done through perfect testing. If testing twice longer would beef up confidence to 99%, you would accept one backward step out of 100, so 98% progress. But with twice the effort, so only 49% progress in the same time the lower confidence made 90% progress. Going from 95% confidence to 99% confidence testing was a losing strategy!
So in fact one could do the opposite, and reduce confidence. Only do 1.15 sigma testing in stead of 1.65-sigma, for only 87.5% confidence. One of every 8 steps is backwards. But you only need half the number of games, so you make 75% progress in half the time, 150% in the same time as you otherwise advanced 90%. Another factor 2 in testing level brings you to the 79% confidence level (0.83 sigma, one-sided), 58% progress in one quarter of the time, is 232% in the time you originally achieved 90%.
You see that the low confidence testing is extremely competative, and leads to much faster improvement of your engine than high-confidence testing. At some point it stops, of course. And this approach assumes that testing is the limited facture. The picture changes when you are running out of ideas to test. But in this method you should not shy from re-trying ideas that were thrown out earlier, as they might have just been unlucky. You would have to do that anyway, if you wanted to test combinations of ideas. What might not work initially, might work after you changed many other things. So the need to retest rejected ideas, and continuing testing if the accepted ideas are still useful, is a necessity anyway. Also in the high-confidence approach.
Thing might also become different when you get really close to the optimum, where almost no change gives an improvement, and almost every change is detrimental. Then the math above doesn't apply, and you would have to take into account the natural bias against improvements. But for engines around 1600 Elo, we are very far from such an optimum, and almost any change can have an effect in either direction. It is just a completely different game as improving engines of 2700 Elo. And as a consequence, it needs different type of testing.