Note that the error margin is good to make conclusions mainly for fixed number of games.lkaufman wrote:The issue arose when a change was testing at + 13 elo after a couple thousand games, enough that the error margin was about 8 elo. So this was above 3 sigma as normally calculated. But SPRT had the LL only around 2, only about 2/3 of the way to a conclusion. This seemed strange to me. We eventually got a positive result after a couple thousand more games. We used (-2, +4) and 200 for the draw value. Most likely this change was in reality only worth something like the five elo you get by subtracting 8 from 13. Anyway, can you comment on whether the values were inappropriate, or whether this is just normal behavior for SPRT?Michel wrote:You did not specify the draw ratio you were using but if you used theIn order to better understand the behavior of SPRT, we ran the following test: Komodo at 8 ply vs Komodo at 7 ply. SPRT (using the Stockfish parameters of -1.5 and +4.5) stopped the test when the score was 149 wins, 30 losses, and 94 draws. The standard margin of error calculation showed that this result was more than 7 times the margin needed for the usual 95% confidence. So, in other words when the score was just one win less than this, although the result was about 14 standard deviations ahead (probability 99.99999999xxxx%, too many nines to write I think) SPRT still had not accepted the 9 ply version as a keeper. If I published a result of 148 wins to 30 losses and 94 draws and said "more games are needed to draw a conclusion" everyone would say that is ridiculous.
This seems totally ridiculous to me. Can anyone explain this and/or reconcile the enormous disparity between what SPRT concludes and what normal error calculations give? I think it has something to do with the fact that in this case the elo difference was huge, but still, I would want a test to be able to detect this and stop once superiority was clear. Is there a better way, or some modification to SPRT that would make it behave more reasonably?
standard 60% (draw_elo=240) then the result would have been accepted by the SPRT.
But all this does not matter. Why would you worry about a test
of a few hundred games to detect a huge and unrealistic elo difference???
The big savings are in efficiently recognizing very small elo differences.
That is how the [-1.5,4.5] and [0,6] margins in fishtest were selected.
Thanks.
Larry
When you see 13+-8 at the end of the test when you use confidence interval of 95% you can have 95% confidence that the change is between 5 and 21 elo and something near
97.5% confidence that the change is worth at least 5 elo.
If you see it in the middle of the test then it means that your confidence is clearly smaller assuming that you look at the results of the test many times.