SPRT and Engine testing

Adam Hair · Post by **Adam Hair** » Mon Dec 13, 2010 6:02 am

SPRT, " Sequential Probability Ratio Test", has been mentioned a couple of times before,

http://talkchess.com/forum/viewtopic.php?t=36979
http://talkchess.com/forum/viewtopic.php?t=31699

but no one has discussed it in detail. I will attempt to do so here, in
the hope that it may be helpful to any author how wants to reliably test
new versions of their engines but have limited CPU resources.

A brief, and not very clear to the layman, description is given here:
http://en.wikipedia.org/wiki/SPRT

The basic idea is to evaluate the data as it is accumulated and determine,
based on predetermined decision rules, whether to accept the null
hypothesis, reject the null hypothesis in favor of the alternative, or to
keep testing. The goal is to more efficiently determine, compared to
a fixed size test, which hypothesis is correct.

In context with the interest everybody here, the hypotheses are:

H_o : The current version and the new version are similar in strength
H_a : The new version is X elo stronger than the current version

A test statistic must be formulated in order to test the two hypotheses.
Let S_n be that statistic. The decision rules are:

If S_n > log A, then stop testing and accept H_a
If S_n < log B, then stop testing and accept H_o
If log B < S_n < log A, then keep testing

Now to determine what S_n, A, and B should be:

First, let's forget about draws for now in order to make the math simpler
and more understandable.

Let X_1, X_2, ..., X_i be random variables such that X_i = 1 with
probability p and = 0 with probability (1-p). This could thought of
as representing a game where if you win a round, you get a point, and
if you lose the round, you get no points. Now, lets test the hypothesis
that p = p_o versus p = p_1.
Let Z_i = log (P( X=x|p_1) / P(X=x|p_o))
Then Z_i = log (p_1/p_o) when x = 1
and Z_i = log ( (1- p_1)/(1 - p_o)) when x=0

Let S_n = Sum ( Z_i ) where 1 <= i <= n

Then S_n = w*log (p_1/p_o) + (n-w)*log((1-p_1)/(1-p_o)) where w
is the number of wins and n is the number of rounds.

A and B are chosen so that Type I errors ( false positive ) and
Type II errors ( false negative ) are held to a certain level.
If a = P( Type I error ) and b = P( Type II error ), then it can be shown
that A ~= log ((1-b)/a) and B ~= log ( a/(1-b))

So, if log(a/(1-b))< w*log(p_1/p_o) + (n-w)log((1-p_1)/(1-p_o))<log((1-b)/a),
keep testing. If not, the test is stopped and a determination is made.

With a little manipulation, it can be seen that:

If w > (log((1-b)/a) + n*log((1-p_o)/(1-p_1)))/ log((p_1*(1-p_o))/(p_o*(1-p_1))),
then stop testing and accept H_a.

If w<(-log((1-b)/a) + n*log((1-p_o)/(1-p_1))/log((p_1*(1-p_o))/(p_o*(1-p_1))),
then stop testing and accept H_o.

If w falls between those two bounds, then keep testing.

If we were testing p=0.5 versus p=0.5133 and we set
a=P(Type I error)=0.05 and b=P(Type II error)=0.05,
we would get the following bounds:

Code: Select all

Rounds  Lower Bound    Upper Bound
150	20.6641	  131.3312
200	45.9966	  156.6637
250	71.3292	  181.9962
300	96.6617	  207.3288
350	121.9942	  232.6613
400	147.3268	  257.9939
450	172.6593	  283.3264
500	197.9919	  308.6589
550	223.3244	  333.9915
600	248.6569	  359.3240
650	273.9895	  384.6566
700	299.3220	  409.9891
750	324.6545	  435.3216
800	349.9871	  460.6542
850	375.3196	  485.9867
900	400.6522	  511.3192
950	425.9847	  536.6518
1000	451.3172	  561.9843

Of course, the bounds change with each round. If the number of wins
stayed within the bounds through 1000 rounds, then nothing has been
determined and testing should continue. However, if w escapes the
bounds at some point, testing can be stopped. If w is greater than the upper bound
at that point, then we can accept that p = 0.5133 with a 5% probability
of a Type I error ( false positive ). If w is less than the lower bound,
then we can accept that p = 0.5 with a 5% probability of a Type II error
( false negative ). It can be shown, and I will show it later, that in this
case the expected stopping time (end of testing) with these parameters
is ~ 7490 rounds. If we conducted a fixed sized test with a=b=0.05,
the size of the test would be ~ 15290 rounds.

I am out of time tonight. I plan to continue in the discussion in the next
couple of days, provided that this interests anyone. I do have a question.
Is there some practical reason why no one seems concerned about
false negatives? Everybody uses a 95% confidence interval, which
means ( unless I am mistaken ) a=0.025. Even if you run a 30,000
game gauntlet, the possibility of a false negative is 20% or more when
there is an actual positive improvment of 5 to 6 elo.

Milos · Post by **Milos** » Mon Dec 13, 2010 8:24 am

Adam Hair wrote:I do have a question.
Is there some practical reason why no one seems concerned about
false negatives? Everybody uses a 95% confidence interval, which
means ( unless I am mistaken ) a=0.025. Even if you run a 30,000
game gauntlet, the possibility of a false negative is 20% or more when
there is an actual positive improvment of 5 to 6 elo.

Interesting stuff, I'll also come back to it when I have a bit more time. In the meanwhile the answer to your question, coz calculating false negative is wrong in this way (I don't understand how you came up with 20% number, but that's completely wrong). 95% is the probability that the improvement is in the range of X Elo +/- 2xsigma.
If you want false negative, then (1-LOS) is what gives it, and for 30k games it's really, really small.

Houdini · Post by **Houdini** » Mon Dec 13, 2010 10:32 am

Adam Hair wrote:Even if you run a 30,000
game gauntlet, the possibility of a false negative is 20% or more when
there is an actual positive improvment of 5 to 6 elo.

Your numbers are slightly pessimistic.
30,000 game gauntlets typically produce 95% confidence interval of about +- 4 Elo, according to BayesElo.

Adam Hair · Post by **Adam Hair** » Mon Dec 13, 2010 1:57 pm

Milos wrote:
Adam Hair wrote:I do have a question.
Is there some practical reason why no one seems concerned about
false negatives? Everybody uses a 95% confidence interval, which
means ( unless I am mistaken ) a=0.025. Even if you run a 30,000
game gauntlet, the possibility of a false negative is 20% or more when
there is an actual positive improvment of 5 to 6 elo.
Interesting stuff, I'll also come back to it when I have a bit more time. In the meanwhile the answer to your question, coz calculating false negative is wrong in this way (I don't understand how you came up with 20% number, but that's completely wrong). 95% is the probability that the improvement is in the range of X Elo +/- 2xsigma.
If you want false negative, then (1-LOS) is what gives it, and for 30k games it's really, really small.

Actually, the calculation of the true probablility of false negatives is
much more complicated than what I had in my head last night.
The distribution wins, losses, and draws is approximately a
bivariate normal. So I won't disagree with you on the actual percentage
of false negatives.

However, if we look at the distribution of scores ( Wins + 0.5*Draws ),
then ~20% false negative is not so wrong. That is what I had in mind
last night.

Adam Hair · Post by **Adam Hair** » Tue Dec 14, 2010 4:34 am

Milos wrote:
Adam Hair wrote:I do have a question.
Is there some practical reason why no one seems concerned about
false negatives? Everybody uses a 95% confidence interval, which
means ( unless I am mistaken ) a=0.025. Even if you run a 30,000
game gauntlet, the possibility of a false negative is 20% or more when
there is an actual positive improvment of 5 to 6 elo.
Interesting stuff, I'll also come back to it when I have a bit more time. In the meanwhile the answer to your question, coz calculating false negative is wrong in this way (I don't understand how you came up with 20% number, but that's completely wrong). 95% is the probability that the improvement is in the range of X Elo +/- 2xsigma.
If you want false negative, then (1-LOS) is what gives it, and for 30k games it's really, really small.

I now realize the answer to my question about false negatives ( Type II
errors). The fact is that P(Type II error) can not be computed for
composite alternative hypotheses, such as H_a: p > p_o. The alternative
value of p has to be specified, as well as specifying the P(Type I error).

Let p_hat be the measured estimate of p
Let a = P(Type I error)

Then a = P(Type I error) = P((p_hat - p_o)/sigma) > ((k-p_o)/sigma_o)) = P(Z>z_a) when p = p_o
That means that (k-p_o)/sigma = z_a
and then k= z_a*sigma - p_o ( z_a standard deviations from p_o)

Let b = P(Type II error) = P((p_hat-p_a)/sigma_a)<=((k-p_a)/sigma_a)) = P(Z<=-z_b) when p = p_a
So (k-p_a)/sigma_a = - z_b
(z_a*sigma_o - p_o -p_a)/sigma_a= -z_b which then can be used to compute b.

To determine n for specified a and b, use the two equations above and
solve for n.

n= (z_a*sigma_o + z_b*sigma_a)/(p_a - p_o)

Milos, can you explain to me why you say (1-LOS) gives the probability
of a false negative?

Adam Hair · Post by **Adam Hair** » Tue Dec 14, 2010 4:41 am

Houdini wrote:
Adam Hair wrote:Even if you run a 30,000
game gauntlet, the possibility of a false negative is 20% or more when
there is an actual positive improvment of 5 to 6 elo.
Your numbers are slightly pessimistic.
30,000 game gauntlets typically produce 95% confidence interval of about +- 4 Elo, according to BayesElo.

The only connection that the error bars have with the probability of a
false negative is that it gives the value of k for computing P(false negative).

Adam Hair · Post by **Adam Hair** » Tue Dec 14, 2010 5:24 am

Adam Hair wrote: To determine n for specified a and b, use the two equations above and
solve for n.

n= (z_a*sigma_o + z_b*sigma_a)/(p_a - p_o)

Actually, this makes no sense.

If sigma_o = sqrt(n*p_o*(1-p_o))
sigma_a = sqrt(n*p_a*(1-p_a))
and instead of p_o and p_a we had n*p_o and n*p_a, then

n=((z_a*sqrt(p_o*(1-p_o)) + z_b*sqrt(p_a*(1-p_a)))/(p_a - p_o))^2,
which gives the size of the test needed to ensure that
P(Type I error) = a and P(Type II error) = b.

Milos · Post by **Milos** » Tue Dec 14, 2010 6:11 am

Adam Hair wrote:I now realize the answer to my question about false negatives ( Type II errors). The fact is that P(Type II error) can not be computed for composite alternative hypotheses, such as H_a: p > p_o. The alternative value of p has to be specified, as well as specifying the P(Type I error).

Let p_hat be the measured estimate of p
Let a = P(Type I error)

Then a = P(Type I error) = P((p_hat - p_o)/sigma) > ((k-p_o)/sigma_o)) = P(Z>z_a) when p = p_o
That means that (k-p_o)/sigma = z_a
and then k= z_a*sigma - p_o ( z_a standard deviations from p_o)

Let b = P(Type II error) = P((p_hat-p_a)/sigma_a)<=((k-p_a)/sigma_a)) = P(Z<=-z_b) when p = p_a
So (k-p_a)/sigma_a = - z_b
(z_a*sigma_o - p_o -p_a)/sigma_a= -z_b which then can be used to compute b.

To determine n for specified a and b, use the two equations above and
solve for n.

n= (z_a*sigma_o + z_b*sigma_a)/(p_a - p_o)

Milos, can you explain to me why you say (1-LOS) gives the probability
of a false negative?

I was wrong coz I had a different set of hypothesis in mind (a more standard I would say):
H0: Engine A is X Elo stronger than engine B
H1: Engine A is not stronger than engine B

micron · Post by **micron** » Tue Dec 14, 2010 8:58 am

This reminds me of sequential paired trials in epidemiology and pharmacology. At one time these were analysed with the aid of specially prepared graph paper resembling this (but with a finer mesh):

Code: Select all

xxxxxxxxxxxxxxxxxxxxxx.......
xxxx A > B xxxxxx..........xx
xxxxxxxxxxxxx............xxxx
xxxxxxxx...............xxxxxx
xxxx.................xxxxxxxx
x..................xxxxxxxxxx
.................xxxxxxxxxxxx
..............xxxxxxxxxxxxxxx
0..........xxxxxxx A = B xxxx
..............xxxxxxxxxxxxxxx
.................xxxxxxxxxxxx
x..................xxxxxxxxxx
xxxx................xxxxxxxxx
xxxxxxxx...............xxxxxx
xxxxxxxxxxxxx............xxxx
xxxx B > A xxxxxx..........xx
xxxxxxxxxxxxxxxxxxxxxx.......

In engine-testing terms, the instructions would be
Starting from point 0, each time engine A wins, move to the right and up one square. Each time engine B wins, move to the right and down one square. Ignore draws.

The match ends if and when one of the three xxx regions is hit, the conclusion being the appropriate one of
[1] engine A is better than engine B, at such-and-such confidence level.
[2] engine B is better than engine A, at such-and-such confidence level.
[3] the match failed to detect a significant difference. A score difference of x or greater (if it existed) would be detected in so-and-so proportion of matches.

In those days, the graph was designed by a biostatistician, who would quiz you closely about type I and II errors, maximum sample size and so on. It should be possible nowadays to dispense with both the statistician and the graph paper.

Robert P.

Adam Hair · Post by **Adam Hair** » Wed Dec 15, 2010 4:19 am

micron wrote:This reminds me of sequential paired trials in epidemiology and pharmacology. At one time these were analysed with the aid of specially prepared graph paper resembling this (but with a finer mesh):
Code: Select all
xxxxxxxxxxxxxxxxxxxxxx.......
xxxx A > B xxxxxx..........xx
xxxxxxxxxxxxx............xxxx
xxxxxxxx...............xxxxxx
xxxx.................xxxxxxxx
x..................xxxxxxxxxx
.................xxxxxxxxxxxx
..............xxxxxxxxxxxxxxx
0..........xxxxxxx A = B xxxx
..............xxxxxxxxxxxxxxx
.................xxxxxxxxxxxx
x..................xxxxxxxxxx
xxxx................xxxxxxxxx
xxxxxxxx...............xxxxxx
xxxxxxxxxxxxx............xxxx
xxxx B > A xxxxxx..........xx
xxxxxxxxxxxxxxxxxxxxxx.......
In engine-testing terms, the instructions would be
Starting from point 0, each time engine A wins, move to the right and up one square. Each time engine B wins, move to the right and down one square. Ignore draws.

The match ends if and when one of the three xxx regions is hit, the conclusion being the appropriate one of
[1] engine A is better than engine B, at such-and-such confidence level.
[2] engine B is better than engine A, at such-and-such confidence level.
[3] the match failed to detect a significant difference. A score difference of x or greater (if it existed) would be detected in so-and-so proportion of matches.

In those days, the graph was designed by a biostatistician, who would quiz you closely about type I and II errors, maximum sample size and so on. It should be possible nowadays to dispense with both the statistician and the graph paper.

Robert P.

It is very much related. The example I gave and what you are talking
about are related to random walks.

Back to my example:

S_n = w*log(p_1/p_o) + (n-w)*log((1-p_1)/(1-p_o))
Of course, n-w = # of losses = l
And log((1-p_1)/(1-p_o)) = - log((1-p_o)/(1-p_1)) < 0 when p_1 > p_o
So,
S_n = w*log(p_1/p_o) - l*log((1-p_o)/(1-p_1))
With each win, S_n moves +log(p_1/p_o) units.
With each loss, it moves -log((1-p_o)/1-p_1)) units.

The upper and lower bounds are chosen in such a way so that
the type I and type II errors held to a predetermined level.

In fact, P(S_n>= log((1-b)/a)| H_o)= a
P(S_n<=log(a/(1-b))=-log((1-b)/a) | H_o) = 1-a
P(S_n>=log((1-b)/a) | H_a) = 1-b
P(S_n<=-log((1-b)/a) | H_a) = b

SPRT and Engine testing

SPRT and Engine testing

Re: SPRT and Engine testing

Re: SPRT and Engine testing

Re: SPRT and Engine testing

Re: SPRT and Engine testing

Re: SPRT and Engine testing

Re: SPRT and Engine testing

Re: SPRT and Engine testing

Re: SPRT and Engine testing

Re: SPRT and Engine testing