sprt and margin of error

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: sprt and margin of error

Post by Laskos »

Michel wrote:
I am amazed how very smart people don't understand that principle.
I think that is overstating it.

It is not hard to explain why LOS based stopping is wrong. The LOS computation is based on a uniform (or almost uniform) prior. If it were really true that for example a 10000 elo patch is just as likely as a 0 elo patch then LOS based stopping would be correct.

In practice you do not know the true elo distribution of patches. So your testing methodology must be robust to cope with that.
Could you explain why LOS based stopping is wrong? If it is wrong theoretically because it has an unbounded Type I error, then I will say that practically, in a certain range of a number of games, say 50-50,000, Type I error is easily controlled. I think you are messing up things, and LOS based stopping rule is good for unknown ELO changes in detecting true positives with a desired Type I error (false positive). It's not an optimal stopping rule, maybe that's what you are saying, and has only one objective, to detect the true positive, while SPRT optimizes for 2 objectives.
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: sprt and margin of error

Post by Michel »

I meant to say that stopping at a "Likelihood of Superiority" of 95% does not at all imply that a patch is an improvement with probability 95%.

The 95% is computed with respect to a uniform prior which does not correspond to the "real" elo distribution of patches (the latter is unknown).

If you read back then you will see that I gave Larry a reference for the theoretical treatment of stopping rules based on p-value.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: sprt and margin of error

Post by Laskos »

Michel wrote:I meant to say that stopping at a "Likelihood of Superiority" of 95% does not at all imply that a patch is an improvement with probability 95%.

The 95% is computed with respect to a uniform prior which does not correspond to the "real" elo distribution of patches (the latter is unknown).

If you read back then you will see that I gave Larry a reference for the theoretical treatment of stopping rules based on p-value.
Ah, ok. It's amazing how LOS of 95% accumulates Type I errors. To 3,000 games this stop gives with 68% a false positive. But I am managing to dampen this error with high LOS of 99.9 or 99.95%. These values control the Type I error on the target span of number of games, say 50-50,000.
User avatar
Ajedrecista
Posts: 1971
Joined: Wed Jul 13, 2011 9:04 pm
Location: Madrid, Spain.

Re: SPRT and margin of error.

Post by Ajedrecista »

Hello Larry:
lkaufman wrote:In order to better understand the behavior of SPRT, we ran the following test: Komodo at 8 ply vs Komodo at 7 ply. SPRT (using the Stockfish parameters of -1.5 and +4.5) stopped the test when the score was 149 wins, 30 losses, and 94 draws.
Ajedrecista wrote:I am very busy so I can not provide numbers right now. I will try to run some simulations the next weekend but I do not promise anything.
I ran some SPRT simulations:

Code: Select all

alpha = beta = 5%
drawelo = 165; bayeselo = 200 (it gives around 160.9 Elo difference)

SPRT&#40;-1.5, 4.5&#41; --> <Games>/simulation ~ 274 &#40;100% of passes after 80000 simulations&#41;.
Shortest simulation&#58; 171 games (+111 -5 =55&#41;.
Longest simulation&#58; 432 games (+193 -72 =167&#41;.

SPRT&#40;-1, 3&#41; --> <Games>/simulation ~ 408 &#40;100% of passes after 80000 simulations&#41;.
Shortest simulation&#58; 276 games (+177 -12 =87&#41;.
Longest simulation&#58; 574 games (+266 -86 =222&#41;.

SPRT&#40;0, 6&#41; --> <Games>/simulation ~ 276 &#40;100% of passes after 80000 simulations&#41;.
Shortest simulation&#58; 176 games (+118 -7 =51&#41;.
Longest simulation&#58; 414 games (+189 -66 =159&#41;.

SPRT&#40;0, 4&#41; --> <Games>/simulation ~ 410 &#40;100% of passes after 80000 simulations&#41;.
Shortest simulation&#58; 269 games (+173 -10 =86&#41;.
longest simulation&#58; 643 games (+288 -106 =249&#41;.
My old suggestion of SPRT(-15, 45) would bring an average number of games of 33 or 34 (too few games). Anyway, someone here said that SPRT is more intended for small Elo changes and I must agree.

If I simulate SPRT for smaller Elo changes:

Code: Select all

alpha = beta = 5%
drawelo = 240; bayeselo = 7.79 &#40;it gives around 5 Elo difference&#41;

SPRT&#40;-1.5, 4.5&#41; --> <Games>/simulation ~ 9209 &#40;99.92% of passes after 10000 simulations&#41;.
Shortest simulation&#58; 1329 games (+331 -220 =778&#41;.
Longest simulation&#58; 50337 games (+10300 -10052 =29985&#41;.

SPRT &#40;0, 6&#41; --> <Games>/simulation ~ 12011 &#40;99.14% of passes after 10000 simulations&#41;.
Shortest simulation&#58; 1554 games (+383 -266 =905&#41;.
Longest simulation&#58; 86150 games (+17624 -17039 =51487&#41;.

-----------------------------------------------------------------------------------------

alpha = beta = 5%
drawelo = 270; bayeselo = 8.68 &#40;it gives around 5 Elo difference&#41;

SPRT&#40;-1.5, 4.5&#41; --> <Games>/simulation ~ 8740 &#40;99.95% of passes after 10000 simulations&#41;.
Shortest simulation&#58; 1161 games (+279 -170 =712&#41;.
Longest simulation&#58; 46453 games (+8288 -8068 =30097&#41;.

SPRT&#40;0, 6&#41; --> <Games>/simulation ~ 11023 &#40;99.65% of passes after 10000 simulations&#41;.
Shortest simulation&#58; 1669 games (+358 -245 =1066&#41;.
Longest simulation&#58; 56650 games (+9983 -9600 =37067&#41;.

-----------------------------------------------------------------------------------------

alpha = beta = 5%
drawelo = 240; bayeselo = 15.58 &#40;it gives around 10 Elo difference&#41;

SPRT&#40;-1.5, 4.5&#41; --> <Games>/simulation ~ 4130 &#40;100% of passes after 10000 simulations&#41;.
Shortest simulation&#58; 916 games (+239 -130 =547&#41;.
Longest simulation&#58; 16773 games (+3398 -3245 =10130&#41;.

SPRT&#40;0, 6&#41; --> <Games>/simulation ~ 4621 &#40;100% of passes after 10000 simulations&#41;.
Shortest simulation&#58; 1197 games (+295 -182 =720&#41;.
Longest simulation&#58; 18676 games (+3803 -3594 =11279&#41;.

-----------------------------------------------------------------------------------------

alpha = beta = 5%
drawelo = 270; bayeselo = 17.36 &#40;it gives around 10 Elo difference&#41;

SPRT&#40;-1.5, 4.5&#41; --> <Games>/simulation ~ 3955 &#40;100% of passes after 10000 simulations&#41;.
Shortest simulation&#58; 1197 games (+269 -162 =766&#41;.
Longest simulation&#58; 13266 games (+2367 -2231 =8668&#41;.

SPRT&#40;0, 6&#41; --> <Games>/simulation ~ 4371 &#40;100% of passes after 10000 simulations&#41;.
Shortest simulation&#58; 999 games (+238 -129 =632&#41;.
Longest simulation&#58; 14152 games (+2532 -2359 =9261&#41;.
I hope that my results are in line with others' results and wish that they are somewhat useful to you.

------------------------

Regarding error bars in SPRT, it is familiar to me that Michel did some graphs for 95% confidence error bars and the x axis had the variable LLR/games, but I have not found this graph right now.

Regards from Spain.

Ajedrecista.
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: SPRT and margin of error.

Post by Michel »

Code: Select all

My old suggestion of SPRT&#40;-15, 45&#41; would bring an average number of games of 33 or 34 &#40;too few games&#41;. Anyway, someone here said that SPRT is more intended for small Elo changes and I must agree. 
I think what goes on is this. The optimality proof of the SPRT starts with a prior which can take just two values: elo0,elo1.

On the other hand the concept of LOS starts from a uniform prior. This may be the reason that in case elo is very different from elo0,elo1 LOS bases stopping (with suitable tresholds) performs better than the SPRT (as has been reported by Kai, I have not verified this).

[[ The theory behind the 2-SPRT starts with a prior which can take three values: el0,elo1 and the elo point at which you want to optimize the
average sample number. ]]
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: SPRT and margin of error.

Post by Michel »

Regarding error bars in SPRT, it is familiar to me that Michel did some graphs for 95% confidence error bars and the x axis had the variable LLR/games, but I have not found this graph right now.
They are here. Note that they are 90% confidence intervals.

http://hardy.uhasselt.be/Toga/cb_fishtest_10_STC.png
http://hardy.uhasselt.be/Toga/cb_fishtest_10_LTC.png
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: SPRT and margin of error.

Post by Michel »

Note that if you are _really_ concerned about performance for large elo differences you _should_ use the Schwarz sequential test.

This is a 2-SPRT where you optimize at the elo value estimated from the sample.

The added noise that goes into the estimated elo value makes it less efficient than either the SPRT or pure 2-SPRT at small elo values (although it is still much better than fixed length). However it should be better for large elo values.

Here is the paper that introduces the Schwarz test:

http://projecteuclid.org/euclid.aoms/1177704726
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: SPRT and margin of error.

Post by Michel »

Unlike talkchess claims, statistical tests are never plucked out of thin air. They are optimal answers to certain well posed problems.

So what optimization problem does the Schwarz test solve?

The answer is this: among the tests with prescribed error probabilities at elo0,elo1 it is the one with asymptotically minimal expected average sample number assuming a uniform prior.

This statement is still true if we replace "uniform prior" with any prior that is non-zero everywhere.
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: SPRT and margin of error.

Post by Michel »

The answer is this: among the tests with prescribed error probabilities at elo0,elo1 it is the one with asymptotically minimal expected average sample number assuming a uniform prior.
Sigh... "expected average" is of course redundant. One of those words is enough.
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: SPRT and margin of error.

Post by Michel »

I wrote a small simulator for the Schwarz test in python. The simulator is too slow for release but some testing shows that the Schwarz test is actually quite good. For small elo differences its performance seems to be similar to that of the SPRT and for large elo differences it finishes much more quickly.

For example for the 200 (Bayes)elo difference that started this topic it finishes in 47 games on average.