http://talkchess.com/forum/viewtopic.php?t=36979

http://talkchess.com/forum/viewtopic.php?t=31699

but no one has discussed it in detail. I will attempt to do so here, in

the hope that it may be helpful to any author how wants to reliably test

new versions of their engines but have limited CPU resources.

A brief, and not very clear to the layman, description is given here:

http://en.wikipedia.org/wiki/SPRT

The basic idea is to evaluate the data as it is accumulated and determine,

based on predetermined decision rules, whether to accept the null

hypothesis, reject the null hypothesis in favor of the alternative, or to

keep testing. The goal is to more efficiently determine, compared to

a fixed size test, which hypothesis is correct.

In context with the interest everybody here, the hypotheses are:

H_o : The current version and the new version are similar in strength

H_a : The new version is X elo stronger than the current version

A test statistic must be formulated in order to test the two hypotheses.

Let S_n be that statistic. The decision rules are:

If S_n > log A, then stop testing and accept H_a

If S_n < log B, then stop testing and accept H_o

If log B < S_n < log A, then keep testing

Now to determine what S_n, A, and B should be:

First, let's forget about draws for now in order to make the math simpler

and more understandable.

Let X_1, X_2, ..., X_i be random variables such that X_i = 1 with

probability p and = 0 with probability (1-p). This could thought of

as representing a game where if you win a round, you get a point, and

if you lose the round, you get no points. Now, lets test the hypothesis

that p = p_o versus p = p_1.

Let Z_i = log (P( X=x|p_1) / P(X=x|p_o))

Then Z_i = log (p_1/p_o) when x = 1

and Z_i = log ( (1- p_1)/(1 - p_o)) when x=0

Let S_n = Sum ( Z_i ) where 1 <= i <= n

Then S_n = w*log (p_1/p_o) + (n-w)*log((1-p_1)/(1-p_o)) where w

is the number of wins and n is the number of rounds.

A and B are chosen so that Type I errors ( false positive ) and

Type II errors ( false negative ) are held to a certain level.

If a = P( Type I error ) and b = P( Type II error ), then it can be shown

that A ~= log ((1-b)/a) and B ~= log ( a/(1-b))

So, if log(a/(1-b))< w*log(p_1/p_o) + (n-w)log((1-p_1)/(1-p_o))<log((1-b)/a),

keep testing. If not, the test is stopped and a determination is made.

With a little manipulation, it can be seen that:

If w > (log((1-b)/a) + n*log((1-p_o)/(1-p_1)))/ log((p_1*(1-p_o))/(p_o*(1-p_1))),

then stop testing and accept H_a.

If w<(-log((1-b)/a) + n*log((1-p_o)/(1-p_1))/log((p_1*(1-p_o))/(p_o*(1-p_1))),

then stop testing and accept H_o.

If w falls between those two bounds, then keep testing.

If we were testing p=0.5 versus p=0.5133 and we set

a=P(Type I error)=0.05 and b=P(Type II error)=0.05,

we would get the following bounds:

Code: Select all

```
Rounds Lower Bound Upper Bound
150 20.6641 131.3312
200 45.9966 156.6637
250 71.3292 181.9962
300 96.6617 207.3288
350 121.9942 232.6613
400 147.3268 257.9939
450 172.6593 283.3264
500 197.9919 308.6589
550 223.3244 333.9915
600 248.6569 359.3240
650 273.9895 384.6566
700 299.3220 409.9891
750 324.6545 435.3216
800 349.9871 460.6542
850 375.3196 485.9867
900 400.6522 511.3192
950 425.9847 536.6518
1000 451.3172 561.9843
```

stayed within the bounds through 1000 rounds, then nothing has been

determined and testing should continue. However, if w escapes the

bounds at some point, testing can be stopped. If w is greater than the upper bound

at that point, then we can accept that p = 0.5133 with a 5% probability

of a Type I error ( false positive ). If w is less than the lower bound,

then we can accept that p = 0.5 with a 5% probability of a Type II error

( false negative ). It can be shown, and I will show it later, that in this

case the expected stopping time (end of testing) with these parameters

is ~ 7490 rounds. If we conducted a fixed sized test with a=b=0.05,

the size of the test would be ~ 15290 rounds.

I am out of time tonight. I plan to continue in the discussion in the next

couple of days, provided that this interests anyone. I do have a question.

Is there some practical reason why no one seems concerned about

false negatives? Everybody uses a 95% confidence interval, which

means ( unless I am mistaken ) a=0.025. Even if you run a 30,000

game gauntlet, the possibility of a false negative is 20% or more when

there is an actual positive improvment of 5 to 6 elo.