Re: Engine Testing - Statistics
Posted: Sun Jan 17, 2010 7:28 pm
I think that the target is not correct.zamar wrote:I think that most of us in here agree what you are saying, but I want to express things from a different point of view:bob wrote: That said, one can compute how bad things might get, worse case, or best case, and choose to stop if the results drop below this threshold, or if they go above. But you have to be _sure_ you are willing to accept the resulting inaccuracy. Simply makes more sense to have a fixed stop point set before you start. Anything else is based on a fallacy that is well-known but not well-understood. Sudden up-swings or down-swings are expected. And they can't be what you use to measure overeall results.
* As an inexperienced programmer I _often_ write patches which are extremely bad. In Stockfish development we have a rule that each patch which gets accepted must be tested in 1000 games match versus original.
* Now if after 150 games result is Mod - Orig: 50 - 100, there is no point of continuing anymore, patch is completely garbage. But I'd like to get scientifically valid condition when to stop.
* Another example is that if after 1000 games result is +30 elo (which is rear, but sometimes do happen!), there is no need for further games. But if it is only say +6elo, there might be need for another 1000 games match. But I'd really like to know what happens to error bars in this case. Because now it's question about conditional probability I cannot use the usual formula for the error bars.
In short this comes down to question: We want to commit patches which are improvements and discard the rest. We want the success rate to be X% (for example 90%). What is the optimal testing strategy when you want to minimize the number of games?
The target is to improve the engine and I see no reason for deciding about constant X% success rate.
If after many games you do not get a significant result then you still may want to accept the version that scored better even if you have less than X% success rate because even if 6 of these changes give 1 elo and 4 of these changes reduce 1 elo you still have a 2 elo improvement.
I think that a good testing strategy may be something like this(when you can modify 10,000 and 200 to different numbers).
1)For every change play 10,000 games or less than it.
2)If the new program is leading by 200 points(relative to the old program)
than then accept the change(it means that if you see result like 2600:2400 you accept the change).
3)If the old program is leading by 200 points reject the change.
4)In other cases choose the winner(in other cases you do not know if the change is good or bad but in most cases the program that scored better is better and you do not care if the probability that the change is positive is only 60%.
Note that this strategy is also possible for people who do not like testing engine against a previous version and they can simply compare the result of version X against some opponents after N games with the result of version X+1 against the same opponents after N games.