but no one has discussed it in detail. I will attempt to do so here, in
the hope that it may be helpful to any author how wants to reliably test
new versions of their engines but have limited CPU resources.
A brief, and not very clear to the layman, description is given here:
The basic idea is to evaluate the data as it is accumulated and determine,
based on predetermined decision rules, whether to accept the null
hypothesis, reject the null hypothesis in favor of the alternative, or to
keep testing. The goal is to more efficiently determine, compared to
a fixed size test, which hypothesis is correct.
In context with the interest everybody here, the hypotheses are:
H_o : The current version and the new version are similar in strength
H_a : The new version is X elo stronger than the current version
A test statistic must be formulated in order to test the two hypotheses.
Let S_n be that statistic. The decision rules are:
If S_n > log A, then stop testing and accept H_a
If S_n < log B, then stop testing and accept H_o
If log B < S_n < log A, then keep testing
Now to determine what S_n, A, and B should be:
First, let's forget about draws for now in order to make the math simpler
and more understandable.
Let X_1, X_2, ..., X_i be random variables such that X_i = 1 with
probability p and = 0 with probability (1-p). This could thought of
as representing a game where if you win a round, you get a point, and
if you lose the round, you get no points. Now, lets test the hypothesis
that p = p_o versus p = p_1.
Let Z_i = log (P( X=x|p_1) / P(X=x|p_o))
Then Z_i = log (p_1/p_o) when x = 1
and Z_i = log ( (1- p_1)/(1 - p_o)) when x=0
Let S_n = Sum ( Z_i ) where 1 <= i <= n
Then S_n = w*log (p_1/p_o) + (n-w)*log((1-p_1)/(1-p_o)) where w
is the number of wins and n is the number of rounds.
A and B are chosen so that Type I errors ( false positive ) and
Type II errors ( false negative ) are held to a certain level.
If a = P( Type I error ) and b = P( Type II error ), then it can be shown
that A ~= log ((1-b)/a) and B ~= log ( a/(1-b))
So, if log(a/(1-b))< w*log(p_1/p_o) + (n-w)log((1-p_1)/(1-p_o))<log((1-b)/a),
keep testing. If not, the test is stopped and a determination is made.
With a little manipulation, it can be seen that:
If w > (log((1-b)/a) + n*log((1-p_o)/(1-p_1)))/ log((p_1*(1-p_o))/(p_o*(1-p_1))),
then stop testing and accept H_a.
If w<(-log((1-b)/a) + n*log((1-p_o)/(1-p_1))/log((p_1*(1-p_o))/(p_o*(1-p_1))),
then stop testing and accept H_o.
If w falls between those two bounds, then keep testing.
If we were testing p=0.5 versus p=0.5133 and we set
a=P(Type I error)=0.05 and b=P(Type II error)=0.05,
we would get the following bounds:
Code: Select all
Rounds Lower Bound Upper Bound 150 20.6641 131.3312 200 45.9966 156.6637 250 71.3292 181.9962 300 96.6617 207.3288 350 121.9942 232.6613 400 147.3268 257.9939 450 172.6593 283.3264 500 197.9919 308.6589 550 223.3244 333.9915 600 248.6569 359.3240 650 273.9895 384.6566 700 299.3220 409.9891 750 324.6545 435.3216 800 349.9871 460.6542 850 375.3196 485.9867 900 400.6522 511.3192 950 425.9847 536.6518 1000 451.3172 561.9843
stayed within the bounds through 1000 rounds, then nothing has been
determined and testing should continue. However, if w escapes the
bounds at some point, testing can be stopped. If w is greater than the upper bound
at that point, then we can accept that p = 0.5133 with a 5% probability
of a Type I error ( false positive ). If w is less than the lower bound,
then we can accept that p = 0.5 with a 5% probability of a Type II error
( false negative ). It can be shown, and I will show it later, that in this
case the expected stopping time (end of testing) with these parameters
is ~ 7490 rounds. If we conducted a fixed sized test with a=b=0.05,
the size of the test would be ~ 15290 rounds.
I am out of time tonight. I plan to continue in the discussion in the next
couple of days, provided that this interests anyone. I do have a question.
Is there some practical reason why no one seems concerned about
false negatives? Everybody uses a 95% confidence interval, which
means ( unless I am mistaken ) a=0.025. Even if you run a 30,000
game gauntlet, the possibility of a false negative is 20% or more when
there is an actual positive improvment of 5 to 6 elo.