sprt and margin of error

lkaufman · Post by **lkaufman** » Tue Oct 15, 2013 9:04 pm

In order to better understand the behavior of SPRT, we ran the following test: Komodo at 8 ply vs Komodo at 7 ply. SPRT (using the Stockfish parameters of -1.5 and +4.5) stopped the test when the score was 149 wins, 30 losses, and 94 draws. The standard margin of error calculation showed that this result was more than 7 times the margin needed for the usual 95% confidence. So, in other words when the score was just one win less than this, although the result was about 14 standard deviations ahead (probability 99.99999999xxxx%, too many nines to write I think) SPRT still had not accepted the 9 ply version as a keeper. If I published a result of 148 wins to 30 losses and 94 draws and said "more games are needed to draw a conclusion" everyone would say that is ridiculous.
This seems totally ridiculous to me. Can anyone explain this and/or reconcile the enormous disparity between what SPRT concludes and what normal error calculations give? I think it has something to do with the fact that in this case the elo difference was huge, but still, I would want a test to be able to detect this and stop once superiority was clear. Is there a better way, or some modification to SPRT that would make it behave more reasonably?

Ajedrecista · Post by **Ajedrecista** » Tue Oct 15, 2013 9:43 pm

Hello Larry:

lkaufman wrote:In order to better understand the behavior of SPRT, we ran the following test: Komodo at 8 ply vs Komodo at 7 ply. SPRT (using the Stockfish parameters of -1.5 and +4.5) stopped the test when the score was 149 wins, 30 losses, and 94 draws. The standard margin of error calculation showed that this result was more than 7 times the margin needed for the usual 95% confidence. So, in other words when the score was just one win less than this, although the result was about 14 standard deviations ahead (probability 99.99999999xxxx%, too many nines to write I think) SPRT still had not accepted the 9 ply version as a keeper. If I published a result of 148 wins to 30 losses and 94 draws and said "more games are needed to draw a conclusion" everyone would say that is ridiculous.
This seems totally ridiculous to me. Can anyone explain this and/or reconcile the enormous disparity between what SPRT concludes and what normal error calculations give? I think it has something to do with the fact that in this case the elo difference was huge, but still, I would want a test to be able to detect this and stop once superiority was clear. Is there a better way, or some modification to SPRT that would make it behave more reasonably?

I am not an expert in SPRT but you can take a look at this topic:

SPRT and narrowing of (elo1 - elo0) difference.

It basically says that the average expected duration of the SPRT test is proportional to (elo1 - elo0)^(-2). I think that the problem is that the resolution (elo1 - elo0 = 6 Bayeselo units, not common Elo) is too high because everyone can expect a big difference (certainly more than 6 Bayeselo) in fixed depth 8 versus fixed depth 7. The solution in this case could be raise this resolution for reducing the expected length of the test. What about giving a try to SPRT(-15, 45)? It is only a random suggestion.

If you want to translate Bayeselo into common Elo (writing from memory):

Code: Select all

x = 10^&#40;drawelo/400&#41;.
Elo = &#123;4x/&#91;&#40;1 + x&#41;^2&#93;&#125;*Bayeselo.

Regards from Spain.

Ajedrecista.

lkaufman · Post by **lkaufman** » Tue Oct 15, 2013 9:54 pm

Ajedrecista wrote:Hello Larry:

lkaufman wrote:In order to better understand the behavior of SPRT, we ran the following test: Komodo at 8 ply vs Komodo at 7 ply. SPRT (using the Stockfish parameters of -1.5 and +4.5) stopped the test when the score was 149 wins, 30 losses, and 94 draws. The standard margin of error calculation showed that this result was more than 7 times the margin needed for the usual 95% confidence. So, in other words when the score was just one win less than this, although the result was about 14 standard deviations ahead (probability 99.99999999xxxx%, too many nines to write I think) SPRT still had not accepted the 9 ply version as a keeper. If I published a result of 148 wins to 30 losses and 94 draws and said "more games are needed to draw a conclusion" everyone would say that is ridiculous.
This seems totally ridiculous to me. Can anyone explain this and/or reconcile the enormous disparity between what SPRT concludes and what normal error calculations give? I think it has something to do with the fact that in this case the elo difference was huge, but still, I would want a test to be able to detect this and stop once superiority was clear. Is there a better way, or some modification to SPRT that would make it behave more reasonably?
I am not an expert in SPRT but you can take a look at this topic:

SPRT and narrowing of (elo1 - elo0) difference.

It basically says that the average expected duration of the SPRT test is proportional to (elo1 - elo0)^(-2). I think that the problem is that the resolution (elo1 - elo0 = 6 Bayeselo units, not common Elo) is too high because everyone can expect a big difference (certainly more than 6 Bayeselo) in fixed depth 8 versus fixed depth 7. The solution in this case could be raise this resolution for reducing the expected length of the test. What about giving a try to SPRT(-15, 45)? It is only a random suggestion.

If you want to translate Bayeselo into common Elo (writing from memory):
Code: Select all
x = 10^&#40;drawelo/400&#41;.
Elo = &#123;4x/&#91;&#40;1 + x&#41;^2&#93;&#125;*Bayeselo.
Regards from Spain.

Ajedrecista.

Thanks. Yes, I understand that a radical change of parameters would solve the problem. But in general, when we don't know if a change is worth one elo or ten, if we use values intended for a small change but we actually have a ten elo change, it will take way too long to confirm. It doesn't seem right to me that you must already know how good a change is to get SPRT to behave properly. Is there any solution that doesn't require knowing the result beforehand?

gladius · Post by **gladius** » Tue Oct 15, 2013 10:00 pm

lkaufman wrote:In order to better understand the behavior of SPRT, we ran the following test: Komodo at 8 ply vs Komodo at 7 ply. SPRT (using the Stockfish parameters of -1.5 and +4.5) stopped the test when the score was 149 wins, 30 losses, and 94 draws. The standard margin of error calculation showed that this result was more than 7 times the margin needed for the usual 95% confidence. So, in other words when the score was just one win less than this, although the result was about 14 standard deviations ahead (probability 99.99999999xxxx%, too many nines to write I think) SPRT still had not accepted the 9 ply version as a keeper. If I published a result of 148 wins to 30 losses and 94 draws and said "more games are needed to draw a conclusion" everyone would say that is ridiculous.
This seems totally ridiculous to me. Can anyone explain this and/or reconcile the enormous disparity between what SPRT concludes and what normal error calculations give? I think it has something to do with the fact that in this case the elo difference was huge, but still, I would want a test to be able to detect this and stop once superiority was clear. Is there a better way, or some modification to SPRT that would make it behave more reasonably?

The SPRT parameters used in fishtest are optimized for measuring small improvements. If you are expecting to test changes worth 60-70 ELO, you would probably pick different ones

.

I'm not a stats expert by any means, but imagine you had played your run and gotten 10 wins, 1 loss and 8 draws. Would you still be willing to call this test finished? If your expected test distribution was a 0 elo change, with std. dev of 4 elo, I certainly wouldn't. There is still a really high chance that the test is equal or better.

Observing the behavior of SPRT on fishtest, with the parameters we are using, it is quite forgiving for the first 1000 games or so. After that, if the score starts to drift below 50% it gets more and more unforgiving.

Laskos · Post by **Laskos** » Tue Oct 15, 2013 10:33 pm

lkaufman wrote:
Thanks. Yes, I understand that a radical change of parameters would solve the problem. But in general, when we don't know if a change is worth one elo or ten, if we use values intended for a small change but we actually have a ten elo change, it will take way too long to confirm. It doesn't seem right to me that you must already know how good a change is to get SPRT to behave properly. Is there any solution that doesn't require knowing the result beforehand?

You have to guess a bit with SPRT. There is a problem too if you set Elo0=0, Elo1=10, and the actual difference is 4 Elo points, it will take many games and will show the Elo0=0 as true in more than half of the cases. What I am using for unknown Elo changes is LOS of 99.9% as a stopping rule with more than 50 wins+losses, and less than 30,000 wins+losses. It has less than 5% false positives in this range of the number of games. But for well guessed changes SPRT is the way to go. So, -1.5 and 4.5 Elo points is excellent with SF testing framework, where the changes are small, although it cannot detect improvement which are less than 1.5 Elo points and will have hard time with 20 Elo points improvement.

Michel · Post by **Michel** » Wed Oct 16, 2013 12:32 am

In order to better understand the behavior of SPRT, we ran the following test: Komodo at 8 ply vs Komodo at 7 ply. SPRT (using the Stockfish parameters of -1.5 and +4.5) stopped the test when the score was 149 wins, 30 losses, and 94 draws. The standard margin of error calculation showed that this result was more than 7 times the margin needed for the usual 95% confidence. So, in other words when the score was just one win less than this, although the result was about 14 standard deviations ahead (probability 99.99999999xxxx%, too many nines to write I think) SPRT still had not accepted the 9 ply version as a keeper. If I published a result of 148 wins to 30 losses and 94 draws and said "more games are needed to draw a conclusion" everyone would say that is ridiculous.
This seems totally ridiculous to me. Can anyone explain this and/or reconcile the enormous disparity between what SPRT concludes and what normal error calculations give? I think it has something to do with the fact that in this case the elo difference was huge, but still, I would want a test to be able to detect this and stop once superiority was clear. Is there a better way, or some modification to SPRT that would make it behave more reasonably?

You did not specify the draw ratio you were using but if you used the
standard 60% (draw_elo=240) then the result would have been accepted by the SPRT. draw_elo=240 is a realistic value for self play (this is what
is used in fishtest).

But all this does not matter. Why would you worry about a test
of a few hundred games to detect a huge and unrealistic elo difference???

The big savings are in efficiently recognizing very small elo differences.
That is how the [-1.5,4.5] and [0,6] margins in fishtest were selected.

lkaufman · Post by **lkaufman** » Wed Oct 16, 2013 12:43 am

Michel wrote:
In order to better understand the behavior of SPRT, we ran the following test: Komodo at 8 ply vs Komodo at 7 ply. SPRT (using the Stockfish parameters of -1.5 and +4.5) stopped the test when the score was 149 wins, 30 losses, and 94 draws. The standard margin of error calculation showed that this result was more than 7 times the margin needed for the usual 95% confidence. So, in other words when the score was just one win less than this, although the result was about 14 standard deviations ahead (probability 99.99999999xxxx%, too many nines to write I think) SPRT still had not accepted the 9 ply version as a keeper. If I published a result of 148 wins to 30 losses and 94 draws and said "more games are needed to draw a conclusion" everyone would say that is ridiculous.
This seems totally ridiculous to me. Can anyone explain this and/or reconcile the enormous disparity between what SPRT concludes and what normal error calculations give? I think it has something to do with the fact that in this case the elo difference was huge, but still, I would want a test to be able to detect this and stop once superiority was clear. Is there a better way, or some modification to SPRT that would make it behave more reasonably?
You did not specify the draw ratio you were using but if you used the
standard 60% (draw_elo=240) then the result would have been accepted by the SPRT.

But all this does not matter. Why would you worry about a test
of a few hundred games to detect a huge and unrealistic elo difference???

The big savings are in efficiently recognizing very small elo differences.
That is how the [-1.5,4.5] and [0,6] margins in fishtest were selected.

The issue arose when a change was testing at + 13 elo after a couple thousand games, enough that the error margin was about 8 elo. So this was above 3 sigma as normally calculated. But SPRT had the LL only around 2, only about 2/3 of the way to a conclusion. This seemed strange to me. We eventually got a positive result after a couple thousand more games. We used (-2, +4) and 200 for the draw value. Most likely this change was in reality only worth something like the five elo you get by subtracting 8 from 13. Anyway, can you comment on whether the values were inappropriate, or whether this is just normal behavior for SPRT?
Thanks.

Larry

lucasart · Post by **lucasart** » Wed Oct 16, 2013 1:29 am

Michel wrote:You did not specify the draw ratio you were using but if you used the
standard 60% (draw_elo=240) then the result would have been accepted by the SPRT. draw_elo=240 is a realistic value for self play (this is what
is used in fishtest).

I can reproduce exactly Larry's result by computing draw_elo out of sample. That is what fishtest and cutechess-cli do. It is very dangerous to hardcode draw_elo, because in this case the value is mich lower (165.74), because they are using some low fixed depth testing producing poor quality games, hence not enough draws.

Don · Post by **Don** » Wed Oct 16, 2013 1:33 am

lucasart wrote:
Michel wrote:You did not specify the draw ratio you were using but if you used the
standard 60% (draw_elo=240) then the result would have been accepted by the SPRT. draw_elo=240 is a realistic value for self play (this is what
is used in fishtest).
I can reproduce exactly Larry's result by computing draw_elo out of sample. That is what fishtest and cutechess-cli do. It is very dangerous to hardcode draw_elo, because in this case the value is mich lower (165.74), because they are using some low fixed depth testing producing poor quality games, hence not enough draws.

Changing drawelo in any direction didn't make much difference on the behavior of not accepting the change even when it was many times over the error margin.

Don

Uri Blass · Post by **Uri Blass** » Wed Oct 16, 2013 2:43 am

lkaufman wrote:
Ajedrecista wrote:Hello Larry:

lkaufman wrote:In order to better understand the behavior of SPRT, we ran the following test: Komodo at 8 ply vs Komodo at 7 ply. SPRT (using the Stockfish parameters of -1.5 and +4.5) stopped the test when the score was 149 wins, 30 losses, and 94 draws. The standard margin of error calculation showed that this result was more than 7 times the margin needed for the usual 95% confidence. So, in other words when the score was just one win less than this, although the result was about 14 standard deviations ahead (probability 99.99999999xxxx%, too many nines to write I think) SPRT still had not accepted the 9 ply version as a keeper. If I published a result of 148 wins to 30 losses and 94 draws and said "more games are needed to draw a conclusion" everyone would say that is ridiculous.
This seems totally ridiculous to me. Can anyone explain this and/or reconcile the enormous disparity between what SPRT concludes and what normal error calculations give? I think it has something to do with the fact that in this case the elo difference was huge, but still, I would want a test to be able to detect this and stop once superiority was clear. Is there a better way, or some modification to SPRT that would make it behave more reasonably?
I am not an expert in SPRT but you can take a look at this topic:

SPRT and narrowing of (elo1 - elo0) difference.

It basically says that the average expected duration of the SPRT test is proportional to (elo1 - elo0)^(-2). I think that the problem is that the resolution (elo1 - elo0 = 6 Bayeselo units, not common Elo) is too high because everyone can expect a big difference (certainly more than 6 Bayeselo) in fixed depth 8 versus fixed depth 7. The solution in this case could be raise this resolution for reducing the expected length of the test. What about giving a try to SPRT(-15, 45)? It is only a random suggestion.

If you want to translate Bayeselo into common Elo (writing from memory):
Code: Select all
x = 10^&#40;drawelo/400&#41;.
Elo = &#123;4x/&#91;&#40;1 + x&#41;^2&#93;&#125;*Bayeselo.
Regards from Spain.

Ajedrecista.
Thanks. Yes, I understand that a radical change of parameters would solve the problem. But in general, when we don't know if a change is worth one elo or ten, if we use values intended for a small change but we actually have a ten elo change, it will take way too long to confirm. It doesn't seem right to me that you must already know how good a change is to get SPRT to behave properly. Is there any solution that doesn't require knowing the result beforehand?

SPRT is certainly not optimal but basically you are going to reject very bad changes (loss of 20 elo that can happen) relatively fast and you are going to accept very good changes relatively fast.

If you finish a test in 2000 games instead of 1000 games then the time that you lost is relatively small part of the time that you spend on tests(because in most of your tests you are going to have at least 10,000 games) so my guess is that there is no effective way to save more than 10% of the time by a better test and I think that it is practically good to have less games for very good or very bad changes because you practically get the information that the change is very good or very bad.

If you change SPRT to have less games for very good or very bad changes then it means that it is possible that you practically do not know if a test that failed relatively fast is a small regression or big regression and
I think that it is bad for future tests because the consequence can be different.

With SPRT that stockfish use today
if we reject a change after 1000 games
my thoughts are:
"probably there is some bug in the implementation"

if we reject a change after 10000 games
my thoughts are:
"maybe the parameters are not optimal and I can test the same idea with different parameters"

Rejecting the first change faster means losing the confidence that you believe that there is some bug in the implementation.

In the extreme case it is not very important if the confidence is 99.9999% or 99.99% but in less extreme cases that are more common it may be important if the confidence is 99% or 70%.

sprt and margin of error

sprt and margin of error

Re: SPRT and margin of error.

Re: SPRT and margin of error.

Re: sprt and margin of error

Re: SPRT and margin of error.

Re: sprt and margin of error

Re: sprt and margin of error

Re: sprt and margin of error

Re: sprt and margin of error

Re: SPRT and margin of error.