I disagree. Example, if one move has been played 4 times and scored 4 points, while some other move from the same position has been played 4 times and scored 0 points, I believe the data is statistically significant and the former move should be preferred over the later. The Wilson's 95% score interval for the first is 1.00 to 0.51 while for the former it's 0.00 to 0.49. This is enough to exclude the second move from consideration. This assumes all games were played by the same engine under the same time controls against approximately equal opponents. Any move who's interval's maximum value is less than the minimum value of the best move's interval can safely be excluded from further consideration. When using a 95% confidence interval as few as 8 games are useful for excluding moves from further consideration in this way.Dann Corbit wrote:First, if you have less than 32 games, the statistical significance of wins and losses is so small you should ignore it and rely purely on the computer evaluation.
There is no requirement to use a 95% confidence interval. A 50% interval is adequate. The advantage of using such a loose interval is that it's requires fewer games played from any position to be useful. Some may object to this, but I would note that it's by far the most efficient use of resources. Analyzing opening positions is very time consuming and resource heavy. Playing full games is even more resource dependent. A good example of using a 50% confidence interval in a critical situation is demonstrated by the US military's almost exclusive use of a 50% interval in their Joint Munitions Effectiveness Manuals (JMEMs) which are used to calculated how many weapons should be used and how they are employed to achieve the CINC's intent. Military resources are at least as restricted as an engine programmers and few situations are more critical than when someones life is on the line.
I'm not a big fan of statistical methods applied to human games. The primary problem is the data violates most if not ALL assumptions about data when using most statistical methods therefore the results are of dubious value. This is, no doubt, one of the reasons books based on human games are chocked full of losing lines of play.Dann Corbit wrote:If the computer evaluation is so shallow that at the current time control you can out-think it, then you are officially out of book.
I think the question you are asking is : How to know what the breakpoint is to trust the data?
Clearly, this depends on how good your data is.
For instance from human games, is the data from correspondence chess between world championship candidates? Is it from FICS games between 'beanhead' and 'gizmo' with Elo of 800 and 750?
Is the data from bullet games? Is the data from TCEC games?
If you separate the data you collect into wins/losses and draws by type, then you can gain a lot more from it.
A very good point! You should never mix other games in with the engine's games in the statistics area of the book. Always keep your engines statistics separate for all other games. It dilutes them to a point that they become as worthless and as dubious as the rest of the data.Dann Corbit wrote:You should also store your own engine's wins/losses/draws from the given position. Even if it is a good position, you may want to avoid it if your engine does not play it well.