sprt and margin of error

Uri Blass · Post by **Uri Blass** » Wed Oct 16, 2013 3:05 am

lkaufman wrote:
Michel wrote:
In order to better understand the behavior of SPRT, we ran the following test: Komodo at 8 ply vs Komodo at 7 ply. SPRT (using the Stockfish parameters of -1.5 and +4.5) stopped the test when the score was 149 wins, 30 losses, and 94 draws. The standard margin of error calculation showed that this result was more than 7 times the margin needed for the usual 95% confidence. So, in other words when the score was just one win less than this, although the result was about 14 standard deviations ahead (probability 99.99999999xxxx%, too many nines to write I think) SPRT still had not accepted the 9 ply version as a keeper. If I published a result of 148 wins to 30 losses and 94 draws and said "more games are needed to draw a conclusion" everyone would say that is ridiculous.
This seems totally ridiculous to me. Can anyone explain this and/or reconcile the enormous disparity between what SPRT concludes and what normal error calculations give? I think it has something to do with the fact that in this case the elo difference was huge, but still, I would want a test to be able to detect this and stop once superiority was clear. Is there a better way, or some modification to SPRT that would make it behave more reasonably?
You did not specify the draw ratio you were using but if you used the
standard 60% (draw_elo=240) then the result would have been accepted by the SPRT.

But all this does not matter. Why would you worry about a test
of a few hundred games to detect a huge and unrealistic elo difference???

The big savings are in efficiently recognizing very small elo differences.
That is how the [-1.5,4.5] and [0,6] margins in fishtest were selected.
The issue arose when a change was testing at + 13 elo after a couple thousand games, enough that the error margin was about 8 elo. So this was above 3 sigma as normally calculated. But SPRT had the LL only around 2, only about 2/3 of the way to a conclusion. This seemed strange to me. We eventually got a positive result after a couple thousand more games. We used (-2, +4) and 200 for the draw value. Most likely this change was in reality only worth something like the five elo you get by subtracting 8 from 13. Anyway, can you comment on whether the values were inappropriate, or whether this is just normal behavior for SPRT?
Thanks.

Larry

Note that the error margin is good to make conclusions mainly for fixed number of games.

When you see 13+-8 at the end of the test when you use confidence interval of 95% you can have 95% confidence that the change is between 5 and 21 elo and something near
97.5% confidence that the change is worth at least 5 elo.

If you see it in the middle of the test then it means that your confidence is clearly smaller assuming that you look at the results of the test many times.

lkaufman · Post by **lkaufman** » Wed Oct 16, 2013 4:46 am

Uri Blass wrote:
lkaufman wrote:
Michel wrote:
In order to better understand the behavior of SPRT, we ran the following test: Komodo at 8 ply vs Komodo at 7 ply. SPRT (using the Stockfish parameters of -1.5 and +4.5) stopped the test when the score was 149 wins, 30 losses, and 94 draws. The standard margin of error calculation showed that this result was more than 7 times the margin needed for the usual 95% confidence. So, in other words when the score was just one win less than this, although the result was about 14 standard deviations ahead (probability 99.99999999xxxx%, too many nines to write I think) SPRT still had not accepted the 9 ply version as a keeper. If I published a result of 148 wins to 30 losses and 94 draws and said "more games are needed to draw a conclusion" everyone would say that is ridiculous.
This seems totally ridiculous to me. Can anyone explain this and/or reconcile the enormous disparity between what SPRT concludes and what normal error calculations give? I think it has something to do with the fact that in this case the elo difference was huge, but still, I would want a test to be able to detect this and stop once superiority was clear. Is there a better way, or some modification to SPRT that would make it behave more reasonably?
You did not specify the draw ratio you were using but if you used the
standard 60% (draw_elo=240) then the result would have been accepted by the SPRT.

But all this does not matter. Why would you worry about a test
of a few hundred games to detect a huge and unrealistic elo difference???

The big savings are in efficiently recognizing very small elo differences.
That is how the [-1.5,4.5] and [0,6] margins in fishtest were selected.
The issue arose when a change was testing at + 13 elo after a couple thousand games, enough that the error margin was about 8 elo. So this was above 3 sigma as normally calculated. But SPRT had the LL only around 2, only about 2/3 of the way to a conclusion. This seemed strange to me. We eventually got a positive result after a couple thousand more games. We used (-2, +4) and 200 for the draw value. Most likely this change was in reality only worth something like the five elo you get by subtracting 8 from 13. Anyway, can you comment on whether the values were inappropriate, or whether this is just normal behavior for SPRT?
Thanks.

Larry
Note that the error margin is good to make conclusions mainly for fixed number of games.

When you see 13+-8 at the end of the test when you use confidence interval of 95% you can have 95% confidence that the change is between 5 and 21 elo and something near
97.5% confidence that the change is worth at least 5 elo.

If you see it in the middle of the test then it means that your confidence is clearly smaller assuming that you look at the results of the test many times.

Yes, we have discussed that issue of checking the result often and we concluded same as you. But I don't think it's a major effect if you just check it now and then, especially if you require that it stays above the error margin for two consecutive checks (not too close in time). However, I do like your point about knowing something of the magnitude of the gain or loss by using SPRT well beyond the margin of error; that is indeed useful information. So in view of that, I'm inclined to be satisfied with SPRT despite some inefficiency if your estimate of the gain is way off.

Michel · Post by **Michel** » Wed Oct 16, 2013 8:29 am

I can reproduce exactly Larry's result by computing draw_elo out of sample. That is what fishtest and cutechess-cli do. It is very dangerous to hardcode draw_elo, because in this case the value is mich lower (165.74), because they are using some low fixed depth testing producing poor quality games, hence not enough draws.

Ok I see! I agree that 165.74 is the best estimate for draw_elo.

So you are in fact doing an SPRT with "nuissance parameters" (draw_elo). The theory behind this is a bit more complicated than the ordinary SPRT since
you have to take into account the variation created by the changing estimates for the nuissance parameters. For small elo differences this
does not matter but for extremely large elo differences (as is the case here) it might perhaps have some effect.

What is your precise procedure? Do you recompute draw_elo after every game? Do you retroactively correct the log likelihood ratios with changing draw_elo?

Ajedrecista · Post by **Ajedrecista** » Wed Oct 16, 2013 9:15 am

Hello Michel:

Michel wrote:
I can reproduce exactly Larry's result by computing draw_elo out of sample. That is what fishtest and cutechess-cli do. It is very dangerous to hardcode draw_elo, because in this case the value is mich lower (165.74), because they are using some low fixed depth testing producing poor quality games, hence not enough draws.
Ok I see! I agree that 165.74 is the best estimate for draw_elo.

So you are in fact doing an SPRT with "nuissance parameters" (draw_elo). The theory behind this is a bit more complicated than the ordinary SPRT since
you have to take into account the variation created by the changing estimates for the nuissance parameters. For small elo differences this
does not matter but for extremely large elo differences (as is the case here) it might perhaps have some effect.

What is your precise procedure? Do you recompute draw_elo after every game? Do you retroactively correct the log likelihood ratios with changing draw_elo?

This is what I do in my SPRT simulator: I suppose a drawelo parameter and a score difference of a certain number of Bayeselo units. These two input parameters are used to assign probabilities of win, draw and lose via pseudo-random numbers given by an intrisic built-in function of my compiler (random@() IIRC in Silverfrost Plato IDE for Fortran 95). Then I recompute drawelo from the sample of games after each simulated game and thus this new info is used in the definition of LLR. In other words, drawelo changes dinamically after each game (I think that SF testing framework does exactly the same).

I am very busy so I can not provide numbers right now. I will try to run some simulations the next weekend but I do not promise anything. I can release my SPRT simulator (sources included) if there exists some interest although I seriously doubt it. It is needless to say that my simulator is far from perfect although I get reasonable results.

Regards from Spain.

Ajedrecista.

zamar · Post by **zamar** » Wed Oct 16, 2013 10:52 am

lkaufman wrote: This seems totally ridiculous to me. Can anyone explain this and/or reconcile the enormous disparity between what SPRT concludes and what normal error calculations give?

There is an ENORMOUS difference between A) stopping after fixed number of games B) stopping after reaching a certain threshold.

Michel · Post by **Michel** » Wed Oct 16, 2013 11:09 am

This seems totally ridiculous to me. Can anyone explain this and/or reconcile the enormous disparity between what SPRT concludes and what normal error calculations give?

The point is of course that there is no optimal test for all cases. Normally you can only optimize for one objective. The SPRT is already exceptional that it is able to optimize for two objectives (minimal average sample number for elo=elo0 _and_ elo=elo1).

The 2-SPRT optimizes (simplifying a bit) for the worst case average sample number at the cost of worse performance at elo0 and elo1.

What you are advocating is officially called a "repeated significance test". Such tests have their uses but the theory is much more complicated than the SPRT and I don't think they are useful for chess engine testing.

If you are really interested: the theory is described in this book (Chapter IV).

http://books.google.be/books/about/Sequ ... edir_esc=y

This book is the definite reference on sequential testing. It contain everything you would
ever want to know about it. Unfortunately it is not easy to read.

Don · Post by **Don** » Wed Oct 16, 2013 12:38 pm

zamar wrote:
lkaufman wrote: This seems totally ridiculous to me. Can anyone explain this and/or reconcile the enormous disparity between what SPRT concludes and what normal error calculations give?
There is an ENORMOUS difference between A) stopping after fixed number of games B) stopping after reaching a certain threshold.

I am amazed how very smart people don't understand that principle. There are many people who seem to believe that a good rule is to stop whenever the ELO gain exceeds or is below the error margin or they will make up rules (after looking at the current result) that they will stop after they get back from doing chores if it's still bad or if it still remains good.

But it's a lot like giving some event 1 chance to happen or giving it many chances to happen - it becomes a total different calculation.

I don't know what is worse, this, or the 100 game sample that many testers swear by.

Michel · Post by **Michel** » Wed Oct 16, 2013 1:20 pm

I am amazed how very smart people don't understand that principle.

I think that is overstating it.

It is not hard to explain why LOS based stopping is wrong. The LOS computation is based on a uniform (or almost uniform) prior. If it were really true that for example a 10000 elo patch is just as likely as a 0 elo patch then LOS based stopping would be correct.

In practice you do not know the true elo distribution of patches. So your testing methodology must be robust to cope with that.

lucasart · Post by **lucasart** » Wed Oct 16, 2013 1:27 pm

Michel wrote: What is your precise procedure? Do you recompute draw_elo after every game? Do you retroactively correct the log likelihood ratios with changing draw_elo?

yes, and yes.

It's not completely sound theoretically, but in *practice*, it is preferable to hardcoding draw_elo:
* when SPRT reaches the stopping point, it never happens after only a few games, it's always after at least a few thousands. At that point draw_elo is precise enough that typical draw_elo variations have negligible effect on the LLR.
* on the other hand draw_elo changes a lot with time control, also if you introduce contempt (whren testing contempt patches).

Here is the relevant part of code from fishtest:

Code: Select all

def bayeselo_to_proba&#40;elo, drawelo&#41;&#58;
  """
  elo is expressed in BayesELO &#40;relative to the choice drawelo&#41;.
  Returns a probability, P&#91;'win'&#93;, P&#91;'loss'&#93;, P&#91;'draw'&#93;
  """
  P = &#123;&#125;
  P&#91;'win'&#93; = 1.0 / &#40;1.0 + pow&#40;10.0, (-elo + drawelo&#41; / 400.0&#41;)
  P&#91;'loss'&#93; = 1.0 / &#40;1.0 + pow&#40;10.0, &#40;elo + drawelo&#41; / 400.0&#41;)
  P&#91;'draw'&#93; = 1.0 - P&#91;'win'&#93; - P&#91;'loss'&#93;
  return P

def proba_to_bayeselo&#40;P&#41;&#58;
  """
  Takes a probability&#58; P&#91;'win'&#93;, P&#91;'loss'&#93;
  Returns elo, drawelo
  """
  assert&#40;0 < P&#91;'win'&#93; and P&#91;'win'&#93; < 1 and 0 < P&#91;'loss'&#93; and P&#91;'loss'&#93; < 1&#41;
  elo = 200 * math.log10&#40;P&#91;'win'&#93;/P&#91;'loss'&#93; * &#40;1-P&#91;'loss'&#93;)/&#40;1-P&#91;'win'&#93;))
  drawelo = 200 * math.log10&#40;&#40;1-P&#91;'loss'&#93;)/P&#91;'loss'&#93; * &#40;1-P&#91;'win'&#93;)/P&#91;'win'&#93;)
  return elo, drawelo

def SPRT&#40;R, elo0, alpha, elo1, beta, drawelo&#41;&#58;
  """
  Sequential Probability Ratio Test
  H0&#58; elo = elo0
  H1&#58; elo = elo1
  alpha = max typeI error &#40;reached on elo = elo0&#41;
  beta = max typeII error for elo >= elo1 &#40;reached on elo = elo1&#41;
  R&#91;'wins'&#93;, R&#91;'losses'&#93;, R&#91;'draws'&#93; contains the number of wins, losses and draws

  Returns a dict&#58;
  finished - bool, True means test is finished, False means continue sampling
  state - string, 'accepted', 'rejected' or ''
  llr - Log-likelihood ratio
  lower_bound/upper_bound - SPRT bounds
  """

  result = &#123;
    'finished'&#58; False,
    'state'&#58; '',
    'llr'&#58; 0.0,
    'lower_bound'&#58; math.log&#40;beta/&#40;1-alpha&#41;),
    'upper_bound'&#58; math.log&#40;&#40;1-beta&#41;/alpha&#41;,
  &#125;

  # Estimate drawelo out of sample
  if &#40;R&#91;'wins'&#93; > 0 and R&#91;'losses'&#93; > 0 and R&#91;'draws'&#93; > 0&#41;&#58;
    N = R&#91;'wins'&#93; + R&#91;'losses'&#93; + R&#91;'draws'&#93;
    P = &#123;'win'&#58; float&#40;R&#91;'wins'&#93;)/N, 'loss'&#58; float&#40;R&#91;'losses'&#93;)/N, 'draw'&#58; float&#40;R&#91;'draws'&#93;)/N&#125;
    elo, drawelo = proba_to_bayeselo&#40;P&#41;
  else&#58;
    return result

  # Probability laws under H0 and H1
  P0 = bayeselo_to_proba&#40;elo0, drawelo&#41;
  P1 = bayeselo_to_proba&#40;elo1, drawelo&#41;

  # Log-Likelyhood Ratio
  result&#91;'llr'&#93; = R&#91;'wins'&#93;*math.log&#40;P1&#91;'win'&#93;/P0&#91;'win'&#93;) + R&#91;'losses'&#93;*math.log&#40;P1&#91;'loss'&#93;/P0&#91;'loss'&#93;) + R&#91;'draws'&#93;*math.log&#40;P1&#91;'draw'&#93;/P0&#91;'draw'&#93;)

  if result&#91;'llr'&#93; < result&#91;'lower_bound'&#93;&#58;
    result&#91;'finished'&#93; = True
    result&#91;'state'&#93; = 'rejected'
  elif result&#91;'llr'&#93; > result&#91;'upper_bound'&#93;&#58;
    result&#91;'finished'&#93; = True
    result&#91;'state'&#93; = 'accepted'

  return result

Michel · Post by **Michel** » Wed Oct 16, 2013 1:44 pm

Ok that is nice. It seems that the code has changed a lot since I last looked at it!

sprt and margin of error

Re: sprt and margin of error

Re: sprt and margin of error

Re: sprt and margin of error

Re: SPRT and margin of error.

Re: sprt and margin of error

Re: sprt and margin of error

Re: sprt and margin of error

Re: sprt and margin of error

Re: sprt and margin of error

Re: sprt and margin of error