http://talkchess.com/forum/viewtopic.php?t=36979
http://talkchess.com/forum/viewtopic.php?t=31699
but no one has discussed it in detail. I will attempt to do so here, in
the hope that it may be helpful to any author how wants to reliably test
new versions of their engines but have limited CPU resources.
A brief, and not very clear to the layman, description is given here:
http://en.wikipedia.org/wiki/SPRT
The basic idea is to evaluate the data as it is accumulated and determine,
based on predetermined decision rules, whether to accept the null
hypothesis, reject the null hypothesis in favor of the alternative, or to
keep testing. The goal is to more efficiently determine, compared to
a fixed size test, which hypothesis is correct.
In context with the interest everybody here, the hypotheses are:
H_o : The current version and the new version are similar in strength
H_a : The new version is X elo stronger than the current version
A test statistic must be formulated in order to test the two hypotheses.
Let S_n be that statistic. The decision rules are:
If S_n > log A, then stop testing and accept H_a
If S_n < log B, then stop testing and accept H_o
If log B < S_n < log A, then keep testing
Now to determine what S_n, A, and B should be:
First, let's forget about draws for now in order to make the math simpler
and more understandable.
Let X_1, X_2, ..., X_i be random variables such that X_i = 1 with
probability p and = 0 with probability (1-p). This could thought of
as representing a game where if you win a round, you get a point, and
if you lose the round, you get no points. Now, lets test the hypothesis
that p = p_o versus p = p_1.
Let Z_i = log (P( X=x|p_1) / P(X=x|p_o))
Then Z_i = log (p_1/p_o) when x = 1
and Z_i = log ( (1- p_1)/(1 - p_o)) when x=0
Let S_n = Sum ( Z_i ) where 1 <= i <= n
Then S_n = w*log (p_1/p_o) + (n-w)*log((1-p_1)/(1-p_o)) where w
is the number of wins and n is the number of rounds.
A and B are chosen so that Type I errors ( false positive ) and
Type II errors ( false negative ) are held to a certain level.
If a = P( Type I error ) and b = P( Type II error ), then it can be shown
that A ~= log ((1-b)/a) and B ~= log ( a/(1-b))
So, if log(a/(1-b))< w*log(p_1/p_o) + (n-w)log((1-p_1)/(1-p_o))<log((1-b)/a),
keep testing. If not, the test is stopped and a determination is made.
With a little manipulation, it can be seen that:
If w > (log((1-b)/a) + n*log((1-p_o)/(1-p_1)))/ log((p_1*(1-p_o))/(p_o*(1-p_1))),
then stop testing and accept H_a.
If w<(-log((1-b)/a) + n*log((1-p_o)/(1-p_1))/log((p_1*(1-p_o))/(p_o*(1-p_1))),
then stop testing and accept H_o.
If w falls between those two bounds, then keep testing.
If we were testing p=0.5 versus p=0.5133 and we set
a=P(Type I error)=0.05 and b=P(Type II error)=0.05,
we would get the following bounds:
Code: Select all
Rounds Lower Bound Upper Bound
150 20.6641 131.3312
200 45.9966 156.6637
250 71.3292 181.9962
300 96.6617 207.3288
350 121.9942 232.6613
400 147.3268 257.9939
450 172.6593 283.3264
500 197.9919 308.6589
550 223.3244 333.9915
600 248.6569 359.3240
650 273.9895 384.6566
700 299.3220 409.9891
750 324.6545 435.3216
800 349.9871 460.6542
850 375.3196 485.9867
900 400.6522 511.3192
950 425.9847 536.6518
1000 451.3172 561.9843
stayed within the bounds through 1000 rounds, then nothing has been
determined and testing should continue. However, if w escapes the
bounds at some point, testing can be stopped. If w is greater than the upper bound
at that point, then we can accept that p = 0.5133 with a 5% probability
of a Type I error ( false positive ). If w is less than the lower bound,
then we can accept that p = 0.5 with a 5% probability of a Type II error
( false negative ). It can be shown, and I will show it later, that in this
case the expected stopping time (end of testing) with these parameters
is ~ 7490 rounds. If we conducted a fixed sized test with a=b=0.05,
the size of the test would be ~ 15290 rounds.
I am out of time tonight. I plan to continue in the discussion in the next
couple of days, provided that this interests anyone. I do have a question.
Is there some practical reason why no one seems concerned about
false negatives? Everybody uses a 95% confidence interval, which
means ( unless I am mistaken ) a=0.025. Even if you run a 30,000
game gauntlet, the possibility of a false negative is 20% or more when
there is an actual positive improvment of 5 to 6 elo.