SPRT and Engine testing

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

SPRT and Engine testing

Post by Adam Hair »

SPRT, " Sequential Probability Ratio Test", has been mentioned a couple of times before,

http://talkchess.com/forum/viewtopic.php?t=36979
http://talkchess.com/forum/viewtopic.php?t=31699

but no one has discussed it in detail. I will attempt to do so here, in
the hope that it may be helpful to any author how wants to reliably test
new versions of their engines but have limited CPU resources.

A brief, and not very clear to the layman, description is given here:
http://en.wikipedia.org/wiki/SPRT

The basic idea is to evaluate the data as it is accumulated and determine,
based on predetermined decision rules, whether to accept the null
hypothesis, reject the null hypothesis in favor of the alternative, or to
keep testing. The goal is to more efficiently determine, compared to
a fixed size test, which hypothesis is correct.

In context with the interest everybody here, the hypotheses are:

H_o : The current version and the new version are similar in strength
H_a : The new version is X elo stronger than the current version

A test statistic must be formulated in order to test the two hypotheses.
Let S_n be that statistic. The decision rules are:

If S_n > log A, then stop testing and accept H_a
If S_n < log B, then stop testing and accept H_o
If log B < S_n < log A, then keep testing

Now to determine what S_n, A, and B should be:

First, let's forget about draws for now in order to make the math simpler
and more understandable.

Let X_1, X_2, ..., X_i be random variables such that X_i = 1 with
probability p and = 0 with probability (1-p). This could thought of
as representing a game where if you win a round, you get a point, and
if you lose the round, you get no points. Now, lets test the hypothesis
that p = p_o versus p = p_1.
Let Z_i = log (P( X=x|p_1) / P(X=x|p_o))
Then Z_i = log (p_1/p_o) when x = 1
and Z_i = log ( (1- p_1)/(1 - p_o)) when x=0

Let S_n = Sum ( Z_i ) where 1 <= i <= n

Then S_n = w*log (p_1/p_o) + (n-w)*log((1-p_1)/(1-p_o)) where w
is the number of wins and n is the number of rounds.

A and B are chosen so that Type I errors ( false positive ) and
Type II errors ( false negative ) are held to a certain level.
If a = P( Type I error ) and b = P( Type II error ), then it can be shown
that A ~= log ((1-b)/a) and B ~= log ( a/(1-b))

So, if log(a/(1-b))< w*log(p_1/p_o) + (n-w)log((1-p_1)/(1-p_o))<log((1-b)/a),
keep testing. If not, the test is stopped and a determination is made.

With a little manipulation, it can be seen that:

If w > (log((1-b)/a) + n*log((1-p_o)/(1-p_1)))/ log((p_1*(1-p_o))/(p_o*(1-p_1))),
then stop testing and accept H_a.

If w<(-log((1-b)/a) + n*log((1-p_o)/(1-p_1))/log((p_1*(1-p_o))/(p_o*(1-p_1))),
then stop testing and accept H_o.

If w falls between those two bounds, then keep testing.

If we were testing p=0.5 versus p=0.5133 and we set
a=P(Type I error)=0.05 and b=P(Type II error)=0.05,
we would get the following bounds:

Code: Select all

Rounds  Lower Bound    Upper Bound
150	20.6641	  131.3312
200	45.9966	  156.6637
250	71.3292	  181.9962
300	96.6617	  207.3288
350	121.9942	  232.6613
400	147.3268	  257.9939
450	172.6593	  283.3264
500	197.9919	  308.6589
550	223.3244	  333.9915
600	248.6569	  359.3240
650	273.9895	  384.6566
700	299.3220	  409.9891
750	324.6545	  435.3216
800	349.9871	  460.6542
850	375.3196	  485.9867
900	400.6522	  511.3192
950	425.9847	  536.6518
1000	451.3172	  561.9843
Of course, the bounds change with each round. If the number of wins
stayed within the bounds through 1000 rounds, then nothing has been
determined and testing should continue. However, if w escapes the
bounds at some point, testing can be stopped. If w is greater than the upper bound
at that point, then we can accept that p = 0.5133 with a 5% probability
of a Type I error ( false positive ). If w is less than the lower bound,
then we can accept that p = 0.5 with a 5% probability of a Type II error
( false negative ). It can be shown, and I will show it later, that in this
case the expected stopping time (end of testing) with these parameters
is ~ 7490 rounds. If we conducted a fixed sized test with a=b=0.05,
the size of the test would be ~ 15290 rounds.

I am out of time tonight. I plan to continue in the discussion in the next
couple of days, provided that this interests anyone. I do have a question.
Is there some practical reason why no one seems concerned about
false negatives? Everybody uses a 95% confidence interval, which
means ( unless I am mistaken ) a=0.025. Even if you run a 30,000
game gauntlet, the possibility of a false negative is 20% or more when
there is an actual positive improvment of 5 to 6 elo.
Milos
Posts: 4190
Joined: Wed Nov 25, 2009 1:47 am

Re: SPRT and Engine testing

Post by Milos »

Adam Hair wrote:I do have a question.
Is there some practical reason why no one seems concerned about
false negatives? Everybody uses a 95% confidence interval, which
means ( unless I am mistaken ) a=0.025. Even if you run a 30,000
game gauntlet, the possibility of a false negative is 20% or more when
there is an actual positive improvment of 5 to 6 elo.
Interesting stuff, I'll also come back to it when I have a bit more time. In the meanwhile the answer to your question, coz calculating false negative is wrong in this way (I don't understand how you came up with 20% number, but that's completely wrong). 95% is the probability that the improvement is in the range of X Elo +/- 2xsigma.
If you want false negative, then (1-LOS) is what gives it, and for 30k games it's really, really small.
User avatar
Houdini
Posts: 1471
Joined: Tue Mar 16, 2010 12:00 am

Re: SPRT and Engine testing

Post by Houdini »

Adam Hair wrote:Even if you run a 30,000
game gauntlet, the possibility of a false negative is 20% or more when
there is an actual positive improvment of 5 to 6 elo.
Your numbers are slightly pessimistic.
30,000 game gauntlets typically produce 95% confidence interval of about +- 4 Elo, according to BayesElo.
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: SPRT and Engine testing

Post by Adam Hair »

Milos wrote:
Adam Hair wrote:I do have a question.
Is there some practical reason why no one seems concerned about
false negatives? Everybody uses a 95% confidence interval, which
means ( unless I am mistaken ) a=0.025. Even if you run a 30,000
game gauntlet, the possibility of a false negative is 20% or more when
there is an actual positive improvment of 5 to 6 elo.
Interesting stuff, I'll also come back to it when I have a bit more time. In the meanwhile the answer to your question, coz calculating false negative is wrong in this way (I don't understand how you came up with 20% number, but that's completely wrong). 95% is the probability that the improvement is in the range of X Elo +/- 2xsigma.
If you want false negative, then (1-LOS) is what gives it, and for 30k games it's really, really small.
Actually, the calculation of the true probablility of false negatives is
much more complicated than what I had in my head last night.
The distribution wins, losses, and draws is approximately a
bivariate normal. So I won't disagree with you on the actual percentage
of false negatives.

However, if we look at the distribution of scores ( Wins + 0.5*Draws ),
then ~20% false negative is not so wrong. That is what I had in mind
last night.
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: SPRT and Engine testing

Post by Adam Hair »

Milos wrote:
Adam Hair wrote:I do have a question.
Is there some practical reason why no one seems concerned about
false negatives? Everybody uses a 95% confidence interval, which
means ( unless I am mistaken ) a=0.025. Even if you run a 30,000
game gauntlet, the possibility of a false negative is 20% or more when
there is an actual positive improvment of 5 to 6 elo.
Interesting stuff, I'll also come back to it when I have a bit more time. In the meanwhile the answer to your question, coz calculating false negative is wrong in this way (I don't understand how you came up with 20% number, but that's completely wrong). 95% is the probability that the improvement is in the range of X Elo +/- 2xsigma.
If you want false negative, then (1-LOS) is what gives it, and for 30k games it's really, really small.
I now realize the answer to my question about false negatives ( Type II
errors). The fact is that P(Type II error) can not be computed for
composite alternative hypotheses, such as H_a: p > p_o. The alternative
value of p has to be specified, as well as specifying the P(Type I error).

Let p_hat be the measured estimate of p
Let a = P(Type I error)

Then a = P(Type I error) = P((p_hat - p_o)/sigma) > ((k-p_o)/sigma_o)) = P(Z>z_a) when p = p_o
That means that (k-p_o)/sigma = z_a
and then k= z_a*sigma - p_o ( z_a standard deviations from p_o)

Let b = P(Type II error) = P((p_hat-p_a)/sigma_a)<=((k-p_a)/sigma_a)) = P(Z<=-z_b) when p = p_a
So (k-p_a)/sigma_a = - z_b
(z_a*sigma_o - p_o -p_a)/sigma_a= -z_b which then can be used to compute b.

To determine n for specified a and b, use the two equations above and
solve for n.

n= (z_a*sigma_o + z_b*sigma_a)/(p_a - p_o)

Milos, can you explain to me why you say (1-LOS) gives the probability
of a false negative?
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: SPRT and Engine testing

Post by Adam Hair »

Houdini wrote:
Adam Hair wrote:Even if you run a 30,000
game gauntlet, the possibility of a false negative is 20% or more when
there is an actual positive improvment of 5 to 6 elo.
Your numbers are slightly pessimistic.
30,000 game gauntlets typically produce 95% confidence interval of about +- 4 Elo, according to BayesElo.
The only connection that the error bars have with the probability of a
false negative is that it gives the value of k for computing P(false negative).
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: SPRT and Engine testing

Post by Adam Hair »

Adam Hair wrote: To determine n for specified a and b, use the two equations above and
solve for n.

n= (z_a*sigma_o + z_b*sigma_a)/(p_a - p_o)
Actually, this makes no sense.

If sigma_o = sqrt(n*p_o*(1-p_o))
sigma_a = sqrt(n*p_a*(1-p_a))
and instead of p_o and p_a we had n*p_o and n*p_a, then

n=((z_a*sqrt(p_o*(1-p_o)) + z_b*sqrt(p_a*(1-p_a)))/(p_a - p_o))^2,
which gives the size of the test needed to ensure that
P(Type I error) = a and P(Type II error) = b.
Milos
Posts: 4190
Joined: Wed Nov 25, 2009 1:47 am

Re: SPRT and Engine testing

Post by Milos »

Adam Hair wrote:I now realize the answer to my question about false negatives ( Type II errors). The fact is that P(Type II error) can not be computed for composite alternative hypotheses, such as H_a: p > p_o. The alternative value of p has to be specified, as well as specifying the P(Type I error).

Let p_hat be the measured estimate of p
Let a = P(Type I error)

Then a = P(Type I error) = P((p_hat - p_o)/sigma) > ((k-p_o)/sigma_o)) = P(Z>z_a) when p = p_o
That means that (k-p_o)/sigma = z_a
and then k= z_a*sigma - p_o ( z_a standard deviations from p_o)

Let b = P(Type II error) = P((p_hat-p_a)/sigma_a)<=((k-p_a)/sigma_a)) = P(Z<=-z_b) when p = p_a
So (k-p_a)/sigma_a = - z_b
(z_a*sigma_o - p_o -p_a)/sigma_a= -z_b which then can be used to compute b.

To determine n for specified a and b, use the two equations above and
solve for n.

n= (z_a*sigma_o + z_b*sigma_a)/(p_a - p_o)

Milos, can you explain to me why you say (1-LOS) gives the probability
of a false negative?
I was wrong coz I had a different set of hypothesis in mind (a more standard I would say):
H0: Engine A is X Elo stronger than engine B
H1: Engine A is not stronger than engine B
micron
Posts: 155
Joined: Mon Feb 15, 2010 9:33 am
Location: New Zealand

Re: SPRT and Engine testing

Post by micron »

This reminds me of sequential paired trials in epidemiology and pharmacology. At one time these were analysed with the aid of specially prepared graph paper resembling this (but with a finer mesh):

Code: Select all

xxxxxxxxxxxxxxxxxxxxxx.......
xxxx A > B xxxxxx..........xx
xxxxxxxxxxxxx............xxxx
xxxxxxxx...............xxxxxx
xxxx.................xxxxxxxx
x..................xxxxxxxxxx
.................xxxxxxxxxxxx
..............xxxxxxxxxxxxxxx
0..........xxxxxxx A = B xxxx
..............xxxxxxxxxxxxxxx
.................xxxxxxxxxxxx
x..................xxxxxxxxxx
xxxx................xxxxxxxxx
xxxxxxxx...............xxxxxx
xxxxxxxxxxxxx............xxxx
xxxx B > A xxxxxx..........xx
xxxxxxxxxxxxxxxxxxxxxx.......
In engine-testing terms, the instructions would be
Starting from point 0, each time engine A wins, move to the right and up one square. Each time engine B wins, move to the right and down one square. Ignore draws.

The match ends if and when one of the three xxx regions is hit, the conclusion being the appropriate one of
[1] engine A is better than engine B, at such-and-such confidence level.
[2] engine B is better than engine A, at such-and-such confidence level.
[3] the match failed to detect a significant difference. A score difference of x or greater (if it existed) would be detected in so-and-so proportion of matches.

In those days, the graph was designed by a biostatistician, who would quiz you closely about type I and II errors, maximum sample size and so on. It should be possible nowadays to dispense with both the statistician and the graph paper.

Robert P.
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: SPRT and Engine testing

Post by Adam Hair »

micron wrote:This reminds me of sequential paired trials in epidemiology and pharmacology. At one time these were analysed with the aid of specially prepared graph paper resembling this (but with a finer mesh):

Code: Select all

xxxxxxxxxxxxxxxxxxxxxx.......
xxxx A > B xxxxxx..........xx
xxxxxxxxxxxxx............xxxx
xxxxxxxx...............xxxxxx
xxxx.................xxxxxxxx
x..................xxxxxxxxxx
.................xxxxxxxxxxxx
..............xxxxxxxxxxxxxxx
0..........xxxxxxx A = B xxxx
..............xxxxxxxxxxxxxxx
.................xxxxxxxxxxxx
x..................xxxxxxxxxx
xxxx................xxxxxxxxx
xxxxxxxx...............xxxxxx
xxxxxxxxxxxxx............xxxx
xxxx B > A xxxxxx..........xx
xxxxxxxxxxxxxxxxxxxxxx.......
In engine-testing terms, the instructions would be
Starting from point 0, each time engine A wins, move to the right and up one square. Each time engine B wins, move to the right and down one square. Ignore draws.

The match ends if and when one of the three xxx regions is hit, the conclusion being the appropriate one of
[1] engine A is better than engine B, at such-and-such confidence level.
[2] engine B is better than engine A, at such-and-such confidence level.
[3] the match failed to detect a significant difference. A score difference of x or greater (if it existed) would be detected in so-and-so proportion of matches.

In those days, the graph was designed by a biostatistician, who would quiz you closely about type I and II errors, maximum sample size and so on. It should be possible nowadays to dispense with both the statistician and the graph paper.

Robert P.
It is very much related. The example I gave and what you are talking
about are related to random walks.

Back to my example:

S_n = w*log(p_1/p_o) + (n-w)*log((1-p_1)/(1-p_o))
Of course, n-w = # of losses = l
And log((1-p_1)/(1-p_o)) = - log((1-p_o)/(1-p_1)) < 0 when p_1 > p_o
So,
S_n = w*log(p_1/p_o) - l*log((1-p_o)/(1-p_1))
With each win, S_n moves +log(p_1/p_o) units.
With each loss, it moves -log((1-p_o)/1-p_1)) units.

The upper and lower bounds are chosen in such a way so that
the type I and type II errors held to a predetermined level.

In fact, P(S_n>= log((1-b)/a)| H_o)= a
P(S_n<=log(a/(1-b))=-log((1-b)/a) | H_o) = 1-a
P(S_n>=log((1-b)/a) | H_a) = 1-b
P(S_n<=-log((1-b)/a) | H_a) = b