Error margins via resampling (jackknifing)

Laskos · Post by **Laskos** » Mon Aug 15, 2016 8:18 am

Michel wrote:After a long and complicated computation I came up with a surprisingly simple formula for the log likelihood ratio for the GSPRT in the multinomial case (for small score differences). Since any distribution can be approximated by a multinomial one I assume it is a general fact (and the proof should be much easier than one given here).

See the boxed formula at the end of this file

http://hardy.uhasselt.be/Toga/MLE_for_multinomial.pdf

Great, can't believe it generalizes to multinomial by simplifying with expansion for small differences. I hope there is no need for ELO model, draw model, bias model and all these. They seemed to me a bit a complication to SPRT. What are nuissance parameters? Must be related to draw ratio and bias (empirical variance?). Empirical variance must be inputted, right? Will draw ratio subjected to condition be estimated by MLE or is inputted (without a draw model)? Can we include large bias (unbalanced openings, in Bayeselo terms of order of 200 ELO, probably by empirical variance again)? A simulator would be needed.

I jackknifed another database containing ultra-fast games at fixed nodes with balanced openings, the effect on variance is even larger. Probably the games are more deterministic and related. I would need some databases of SF self-games with reasonable time control.

Michel · Post by **Michel** » Mon Aug 15, 2016 9:13 am

To apply the formula to design a sequential test H0:elo=elo0 vs H1:elo=elo1 one needs to know the value of "expected score" as a function of elo. This depends on the elo model and the test set (e.g. balanced vs unbalanced).

But as you say: if test conditions are not too extreme then a crude logistic approximation may be sufficient in practice (the precise values of elo0, elo1 are more like guide posts anyway).

After this setup, the test becomes H0:expected_score=score0 versus H1:expected_score=score1. Then my formula applies.

The formula applies just as well to sequential testing against foreign engines. I am quite happy with this. I had already written code to do the GSPRT against foreign engines

http://talkchess.com/forum/viewtopic.ph ... 1&start=30

(responding to a question of Hyatt which in the end did not seem to be interested in the answer...) but it relied on scipy. Now one can do away with scipy.

Michel · Post by **Michel** » Mon Aug 15, 2016 10:40 pm

Yes the formula is general and easy to prove.

http://hardy.uhasselt.be/Toga/GSPRT_approximation.pdf

It is probably known in some form. But I am amazed it has not been noticed before in the computer chess community.

Example: a standard GSPRT for H0:elo=elo0 versus H1:elo=elo1 (elo=logistic elo) with alpha=beta=0.05 (with unknown draw ratio) corresponds to tracking the value of the simple expression

Code: Select all

&#40;1/2&#41;&#40;elo1-elo0&#41;&#40;2*elo-elo0-elo1&#41;/Var&#40;elo&#41;

The test stops when this value leaves the interval [-2.94,2.94]. The point is that formulated in this way the test remains optimal in more general testing scenarios. For example when using paired games which are correlated or which use unbalanced positions (in this case the computation of the variance is governed by a 5-nomial distribution instead of a 3-nomial one).

Note: one may just as well track

Code: Select all

&#40;1/2&#41;&#40;score1-score0&#41;&#40;2*score-score0-score1&#41;/Var&#40;score&#41;

This is equivalent and even much simpler.

Laskos · Post by **Laskos** » Tue Aug 16, 2016 7:35 am

Michel wrote:
Note: one may just as well track
Code: Select all
&#40;1/2&#41;&#40;score1-score0&#41;&#40;2*score-score0-score1&#41;/Var&#40;score&#41;
This is equivalent and even much simpler.

An amazing expression for a general stopping rule with well defined Type I, II errors. This must be spread to all testers. It is carried out sequentially, right? I am a bit at odds calculating the 5-nomial variance. For 3-nomial we have the variance:

Code: Select all

W*&#40;1-score&#41;*&#40;1-score&#41; + D*&#40;0.5-score&#41;*&#40;0.5-score&#41; + L*&#40;0-score&#41;*&#40;0-score&#41;

Could you show how it looks for 5-nomial with correlations or bias?

Michel · Post by **Michel** » Tue Aug 16, 2016 10:23 am

An amazing expression for a general stopping rule with well defined Type I, II errors. This must be spread to all testers. It is carried out sequentially, right?

Well for the trinomial model (w/d/l) it is equivalent to the SPRT as implemented in cutechess-cli and fishtest. Except that (with hindsight) the implementation in these programs is completely circuitous. Recall that it goes as follows: (1)convert logistic elo to Bayes elo, (2)estimate d from the sample, (3)use this d to compute draw elo for the Baye Elo model and _finally_ (4)compute the likelihood ratio's. It is now clear that all of this can be done in a single easy step.

A test based on likelhood ratio's can indeed always be carried out sequentially. This follows from Wald's original proof for the SPRT. This is not true for a test based on p-value.

Could you show how it looks for 5-nomial with correlations or bias?

Well any estimator for the variance of the estimator can be used in the formula, including a non-parametric one like the jackknife.

To use the 5-nomial model you can just work naively. Compute 5 empiric frequencies for the possible outcomes of the paired games (0,1/2,1,3/2,2) and compute the variance of the resulting empiric distribution. Then divide by 2N (we divide by 2N to compensate for the fact that we have treated two games together).

Unless something is terribly wrong the result should be the same as your jackknife computation.

Except that there is one more thing the 5-nomial model does not take into account: repeated opening positions (since the opening book is typically smaller than the number of games). Thhis could introduced further correlation between the games.

Laskos · Post by **Laskos** » Wed Aug 17, 2016 6:25 am

Michel wrote:
An amazing expression for a general stopping rule with well defined Type I, II errors. This must be spread to all testers. It is carried out sequentially, right?
Well for the trinomial model (w/d/l) it is equivalent to the SPRT as implemented in cutechess-cli and fishtest. Except that (with hindsight) the implementation in these programs is completely circuitous. Recall that it goes as follows: (1)convert logistic elo to Bayes elo, (2)estimate d from the sample, (3)use this d to compute draw elo for the Baye Elo model and _finally_ (4)compute the likelihood ratio's. It is now clear that all of this can be done in a single easy step.

A test based on likelhood ratio's can indeed always be carried out sequentially. This follows from Wald's original proof for the SPRT. This is not true for a test based on p-value.

Could you show how it looks for 5-nomial with correlations or bias?
Well any estimator for the variance of the estimator can be used in the formula, including a non-parametric one like the jackknife.

To use the 5-nomial model you can just work naively. Compute 5 empiric frequencies for the possible outcomes of the paired games (0,1/2,1,3/2,2) and compute the variance of the resulting empiric distribution. Then divide by 2N (we divide by 2N to compensate for the fact that we have treated two games together).

Unless something is terribly wrong the result should be the same as your jackknife computation.

Except that there is one more thing the 5-nomial model does not take into account: repeated opening positions (since the opening book is typically smaller than the number of games). Thhis could introduced further correlation between the games.

I played a bit with "naive" 5-nomial variance, the unbalanced is indeed smaller than balanced, but not to extent given by jackknifing, where a factor of 2 in variance is common. Also, in case of balanced, there is no way to input correlation between the games. The problem lies in that the probability of having same result game and game reversed is higher than W^2+D^2+L^2 as computed in the variance (for perfectly equal engines, total score is 1 in 2 games). I will not show the expressions, they are long and not illuminating, but I seem to have detected an empirical factor to fit to jackknifed data, the factor in variance for score=1 term. I would need some large PGN databases of games at reasonable time controls. Only that people may stay away from some empirical approximate constants, and if naive approach doesn't work well, one would have to find variance some way by case like jackknifing the current data.

Anyway, I am still amazed by your master formula, this must be exploited

Code: Select all

&#40;1/2&#41;&#40;score1-score0&#41;&#40;2*score-score0-score1&#41;/Var&#40;score&#41;

cdani · Post by **cdani** » Wed Aug 17, 2016 9:08 pm

I don't understand most of what you are talking about but this:

Laskos wrote:I would need some large PGN databases of games at reasonable time controls.

Here you have a file of 280 MB with selfplay games of Andscacs of similar versions at various time controls:

www.andscacs.com/varis/games.zip

40_05 means that the game is with 40 seconds with 0.05 seconds of increment.

I hope is the type of games you are searching for.

Laskos · Post by **Laskos** » Thu Aug 18, 2016 6:34 am

cdani wrote:I don't understand most of what you are talking about but this:

Laskos wrote:I would need some large PGN databases of games at reasonable time controls.
Here you have a file of 280 MB with selfplay games of Andscacs of similar versions at various time controls:

www.andscacs.com/varis/games.zip

40_05 means that the game is with 40 seconds with 0.05 seconds of increment.

I hope is the type of games you are searching for.

Excellent, Daniel, thank you very much! From what I see you play from a 5-8 move balanced book side-reversed, right? These databases would help see the balanced case, for unbalanced openings I have to rely on my small matches. Hope to jackknife today some of your databases. Thanks!

PS Congratulations for beautiful Andscacs win over Komodo in TCEC!

cdani · Post by **cdani** » Thu Aug 18, 2016 12:58 pm

Laskos wrote:Excellent, Daniel, thank you very much! From what I see you play from a 5-8 move balanced book side-reversed, right? These databases would help see the balanced case, for unbalanced openings I have to rely on my small matches. Hope to jackknife today some of your databases. Thanks!

Yes, more or less balanced side and reversed games.

Laskos wrote:PS Congratulations for beautiful Andscacs win over Komodo in TCEC!

Thanks!

Michel · Post by **Michel** » Thu Aug 18, 2016 4:35 pm

I played a bit with "naive" 5-nomial variance, the unbalanced is indeed smaller than balanced, but not to extent given by jackknifing, where a factor of 2 in variance is common.

This is weird. There is nothing special about the jackknife. It is just a (non parametric) estimator for the variance. So should give the same value as any other estimator.

As I said the 5-nomial model does not take into accound correllations between repeated opening positions. But in that case the jackknife is also wrong since it assumes independent identically distributed trials.

Also, in case of balanced, there is no way to input correlation between the games.

Not in the 3-nomial cas no. But with the 5-nomial model it is perfectly possible since you have 5 empiric frequencies to play with. For example in case of 100% correllation the outcome would be 1 (out of 2) with 100% probability.

Anyway, I am still amazed by your master formula, this must be exploited

Thanks. With hindsight the formula can be easily explained heuristically. The formula is exact in case we are computing the LLR for the mean of a normal distribution with known variance (and then it is standard).

If the variance is unknown then the formula says you are allowed to estimate it from the sample. This can be explained by observing that for a normal distribution the maximum likelihood estimators for the mean and the variance are uncorrelated.

Finally using the law of large numbers pretty much anything can be approximated by a normal distribution.

Error margins via resampling (jackknifing)

Re: Error margins via resampling (jackknifing)

Re: Error margins via resampling (jackknifing)

Re: Error margins via resampling (jackknifing)

Re: Error margins via resampling (jackknifing)

Re: Error margins via resampling (jackknifing)

Re: Error margins via resampling (jackknifing)

Re: Error margins via resampling (jackknifing)

Re: Error margins via resampling (jackknifing)

Re: Error margins via resampling (jackknifing)

Re: Error margins via resampling (jackknifing)