## Error margins via resampling (jackknifing)

Discussion of chess software programming and technical issues.

Moderators: bob, hgm, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
Posts: 9762
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

### Re: Error margins via resampling (jackknifing)

Michel wrote:
I played a bit with "naive" 5-nomial variance, the unbalanced is indeed smaller than balanced, but not to extent given by jackknifing, where a factor of 2 in variance is common.
This is weird. There is nothing special about the jackknife. It is just a (non parametric) estimator for the variance. So should give the same value as any other estimator.

As I said the 5-nomial model does not take into accound correllations between repeated opening positions. But in that case the jackknife is also wrong since it assumes independent identically distributed trials.
Usually not the case, most of databases are drawn from number of available openings much larger than the number of games.
Also, in case of balanced, there is no way to input correlation between the games.
Not in the 3-nomial cas no. But with the 5-nomial model it is perfectly possible since you have 5 empiric frequencies to play with. For example in case of 100% correllation the outcome would be 1 (out of 2) with 100% probability.
In case of completely balanced openings and equal engines W1=W2=L1=L2 there is no difference between trinomial and 5-nomial, but I found by jackknifing a systematic and stable compression of 12-15% in variance for balanced (both my "ultrabalanced" openings and "2moves_v1" from Stockfish testing framework) compared to naive trinomial. Naive 5-nomial doesn't account for compression of variance for completely balanced and equal. Also, the common factor of 2 in variance for unbalanced of order 90cp given by jackknifing is hard to come by with naive 5-nomial. I seem to have a modified empirical form of 5-nomial with just 1 empirical parameter which fits well the jackknifed data for all sorts of openings and strength differences I fed. But I guess jackknifing data on the test run is even more efficient.
Anyway, I am still amazed by your master formula, this must be exploited
Thanks. With hindsight the formula can be easily explained heuristically. The formula is exact in case we are computing the LLR for the mean of a normal distribution with known variance (and then it is standard).

If the variance is unknown then the formula says you are allowed to estimate it from the sample. This can be explained by observing that for a normal distribution the maximum likelihood estimators for the mean and the variance are uncorrelated.

Finally using the law of large numbers pretty much anything can be approximated by a normal distribution.

Michel
Posts: 2081
Joined: Sun Sep 28, 2008 11:50 pm

### Re: Error margins via resampling (jackknifing)

Usually not the case, most of databases are drawn from number of available openings much larger than the number of games.
Are you sure about that? For verifying small elo differences one often needs 50000-100000 games. I don't think the books that are being used are _that_ large (but I have not checked it).
In case of completely balanced openings and equal engines W1=W2=L1=L2 there is no difference between trinomial and 5-nomial
There _is_ a difference in case of correllations. In that case the 5-nomial empiric frequencies will _not_ be of the form

Code: Select all

``````0&#58;l1*l2
1/2&#58;l1*d2+d1*l2
1&#58;l1*w2+d1*d2+w1*l2
3/2&#58;d1*w2+w1*d2
2&#58;w1*w2 ``````
That is why you have to measure the 5-nomial frequencies, rather than computing them from the 3-nomial ones.
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.

Posts: 9762
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

### Re: Error margins via resampling (jackknifing)

Michel wrote:
Usually not the case, most of databases are drawn from number of available openings much larger than the number of games.
Are you sure about that? For verifying small elo differences one often needs 50000-100000 games. I don't think the books that are being used are _that_ large (but I have not checked it).
I expressed it incorrectly, I meant most of databases I used to measure variance. I real world like SF testing framework, the number of games and openings is comparable
In case of completely balanced openings and equal engines W1=W2=L1=L2 there is no difference between trinomial and 5-nomial
There _is_ a difference in case of correllations. In that case the 5-nomial empiric frequencies will _not_ be of the form

Code: Select all

``````0&#58;l1*l2
1/2&#58;l1*d2+d1*l2
1&#58;l1*w2+d1*d2+w1*l2
3/2&#58;d1*w2+w1*d2
2&#58;w1*w2 ``````
That is why you have to measure the 5-nomial frequencies, rather than computing them from the 3-nomial ones.
I meant these, as you put them down, as frequencies for "naive 5-nomial", and these frequencies don't fit well the jackknifed data. To input correlations I added just one empirical parameter to one point in two games frequency, and the variance already fits well the jackknifed variance. But this parameter is not of such universality, for example it depends very mildly on time control and engine (besides other dependencies).

Posts: 9762
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

### Re: Error margins via resampling (jackknifing)

Michel wrote:
Usually not the case, most of databases are drawn from number of available openings much larger than the number of games.
Are you sure about that? For verifying small elo differences one often needs 50000-100000 games. I don't think the books that are being used are _that_ large (but I have not checked it).
I expressed it incorrectly, I meant most of databases I used to measure variance. I real world like SF testing framework, the number of games and openings is comparable
In case of completely balanced openings and equal engines W1=W2=L1=L2 there is no difference between trinomial and 5-nomial
There _is_ a difference in case of correllations. In that case the 5-nomial empiric frequencies will _not_ be of the form

Code: Select all

``````0&#58;l1*l2
1/2&#58;l1*d2+d1*l2
1&#58;l1*w2+d1*d2+w1*l2
3/2&#58;d1*w2+w1*d2
2&#58;w1*w2 ``````
That is why you have to measure the 5-nomial frequencies, rather than computing them from the 3-nomial ones.
I meant these, as you put them down, as frequencies for "naive 5-nomial", and these frequencies don't fit well the jackknifed data. To input correlations I added just one empirical parameter to one point in two games frequency, and the variance already fits well the jackknifed variance. But this parameter is not of such universality, for example it depends very mildly on time control and engine (besides other dependencies).
I will try later today to input counted 5-nomial frequencies to variance and compare to the jackknifed one. I mistook the term "naive" for derived from trinomial.

Posts: 9762
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

### Re: Error margins via resampling (jackknifing)

Laskos wrote:I will try later today to input counted 5-nomial frequencies to variance and compare to the jackknifed one. I mistook the term "naive" for derived from trinomial.
Yes, 5-nomial with counted pairs of games fits well the jackknifed values. For the balanced ones the variance, as in the data, is 10%-15% smaller than the trinomial one, for unbalanced, a factor of 2-2.5 compared to trinomial. This adequacy of 5-nomial is great, and should be used for side - reversed matches. Also, it means that unbalanced positions will stop much faster (are more relevant), as from all my past tests, the (w-l) value is not much different in cases balanced-unbalanced.

PS Thanks for bringing 5-nomial upfront immediately, I had for a year or two only gut feeling that the games must be counted pairwise, but had very weird ideas how (for example, awhile ago, counting 50% outcome in 2 games as 2 draws, and still using trinomial frequencies).

Michel
Posts: 2081
Joined: Sun Sep 28, 2008 11:50 pm

### Re: Error margins via resampling (jackknifing)

Yes, 5-nomial with counted pairs of games fits well the jackknifed values.
Great. So theory and practice agree!
For the balanced ones the variance, as in the data, is 10%-15% smaller than the trinomial one, for unbalanced, a factor of 2-2.5 compared to trinomial. This adequacy of 5-nomial is great, and should be used for side - reversed matches. Also, it means that unbalanced positions will stop much faster (are more relevant),
You are right that all other things being equal, there is no down side to computing the variance correctly and it makes the (G)SPRT more efficient. If your findings are confirmed (I hope people try) then it shows that the trinomial model is quite wrong for paired games in typical testing scenarios and should be replaced by the 5-nomial model.
as from all my past tests, the (w-l) value is not much different in cases balanced-unbalanced.
Well an elo model can make some predictions about this. Alas not about the effect of correlations (your tests seem to indicate this is the dominant effect). So this is uncharted territory I think.
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.

Posts: 9762
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

### Re: Error margins via resampling (jackknifing)

Michel wrote:
Yes, 5-nomial with counted pairs of games fits well the jackknifed values.
Great. So theory and practice agree!
For the balanced ones the variance, as in the data, is 10%-15% smaller than the trinomial one, for unbalanced, a factor of 2-2.5 compared to trinomial. This adequacy of 5-nomial is great, and should be used for side - reversed matches. Also, it means that unbalanced positions will stop much faster (are more relevant),
You are right that all other things being equal, there is no down side to computing the variance correctly and it makes the (G)SPRT more efficient. If your findings are confirmed (I hope people try) then it shows that the trinomial model is quite wrong for paired games in typical testing scenarios and should be replaced by the 5-nomial model.
as from all my past tests, the (w-l) value is not much different in cases balanced-unbalanced.
Well an elo model can make some predictions about this. Alas not about the effect of correlations (your tests seem to indicate this is the dominant effect). So this is uncharted territory I think.
Yes, 5-nomial fits the jackknifed data _very_ well, to 0-3% in accuracy in variance across the tests. 3-nomial is off by minimum 8% for ultrabalanced and up to a factor of 2.5 in variance for unbalanced. I hope other people will confirm these findings, and both your "master LLR formula" and 5-nomial variance will be included in tests with side - reversed games (most of tests people do). I have some small tools for performing jackknifing and I have sets of openings ranging from extremely balanced to very unbalanced, if people are interested, just PM me.

There is an issue with repeated openings, where both multinomial and jackknifing fail. In some cases it's not important, when the number of openings is much larger than the number of games, but say in SF framework the sets are comparable in size. I took the following approach: I played sequentially the same opening twice (side-reversed) TWO times, total 4 consecutive games from the same opening. I performed jackknifing by slicing correctly in consecutive 8 games, 4 games, and incorrectly in only 2 consecutive games, and computed variances. Slicing in 2 games is equivalent to the jackknifing of a database with repeated openings where we don't know the place of repeats.

Fixed time control (ultra-fast 2''+0.02''):

Variance for jackknife with 8 games: 0.0378226
Variance for jackknife with 4 games: 0.0377792
Variance for jackknife with 2 games: 0.0366197

With fixed time, the error from incorrect sampling is 3%.

Sanity check - fixed nodes, stronger correlation between games:

Variance for jackknife with 8 games: 0.0515040
Variance for jackknife with 4 games: 0.0515589
Variance for jackknife with 2 games: 0.0279226

With fixed nodes the wrong jackknifing with 2 games gives a factor of 1.8 smaller variance than the real one.

All in all, at least at ultra-fast time controls, repeated openings seem to not be a large issue (but only one test performed).

Posts: 9762
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

### Re: Error margins via resampling (jackknifing)

Michel wrote: Well an elo model can make some predictions about this. Alas not about the effect of correlations (your tests seem to indicate this is the dominant effect). So this is uncharted territory I think.
Without correlations it is straightforward. I took Bayeselo model (ELO model, draw model, bias model) and found that for

eloDraw > (400/log(10)) * log(2+sqrt(3)) ~ 228.779 ELO points,

there is a non-trivial maximum function of eloBias in (w-l)/(w0-l0), where w,l are win/loss rates for unbalanced (biased) openings, w0,l0 are for eloBias=0 (perfectly balanced). (w-l) and eloDelta are assumed to be small. In the SF testing, typical eloDraw is 300 ELO points, above the critical eloDraw. I plot here the (w-l)/(w0-l0) ratio, which is independent of eloDelta for small eloDelta. As a function of eloBias and for eloDraw=300:

The correlations will change the picture, but I guess the curve will stay close to the value of 1. It stays anyway close to 1 for wide range of eloDraw and eloBias, only that there is no non-trivial maximum for (w-l)/(w0-l0) below critical eloDraw of ~229 ELO points.

Michel
Posts: 2081
Joined: Sun Sep 28, 2008 11:50 pm

### Re: Error margins via resampling (jackknifing)

Thanks! I did not compute anything but drew some graphs.

I indeed get a similar picture as you for the limit of elo->0 of (w-l)/sigma(w-l)/elo (which determines the sensitivity of a test) in the BayesElo model. For larger draw_elo it becomes more and more advantageous to use unbalanced positions (but _only_ if the variance is correctly computed). The optimal bias depends on the value of draw_elo.

This is of course without correlations (elo models do not account for correlations).
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.

Posts: 9762
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

### Re: Error margins via resampling (jackknifing)

Michel wrote:Thanks! I did not compute anything but drew some graphs.

I indeed get a similar picture as you for the limit of elo->0 of (w-l)/sigma(w-l)/elo (which determines the sensitivity of a test) in the BayesElo model. For larger draw_elo it becomes more and more advantageous to use unbalanced positions (but _only_ if the variance is correctly computed). The optimal bias depends on the value of draw_elo.

This is of course without correlations (elo models do not account for correlations).
Yes, with Bayeselo model, only by computing the correct 5-nomial variance, and for eloDraw above the critical 229 ELO points, the (w-l)/sigma(w-l) shows improvement with unbalanced (biased) openings for certain eloDraw and eloBias. I tried to see empirically if this is not ruined by correlations, but it seems Bayeselo model describes well even the full (correlations) cases. I observed that for very large eloDraw and eloBias, (w-l)/(w0-l0) (where w0-l0 is for balanced case) is no longer tame and close to 1, so Bayeselo model and the advantage of biased are testable. Here is the plot for (w-l)/(w0-l0) as a function of eloDraw and eloBias on a large range according to Bayeselo:

First test was an equivalent of LTC test for SF testing framework, where eloDraw is about 350 ELO points. For openings bias of order of 100cp (about 300 ELO points) the predicted by Bayeselo enhancement in (w-l)/(w0-l0) is about a factor of 1.4. The result is:

Code: Select all

``````Balanced&#58;
Score of SF2 vs SF1&#58; 291 - 178 - 1531  &#91;0.528&#93; 2000
ELO difference&#58; 19.65    +/- 3.67 1SD error by 3-nomial
19.65    +/- 3.35 1SD error by 5-nomial

t-value=5.87

Unbalanced&#58;
Score of SF2 vs SF1&#58; 750 - 596 - 654  &#91;0.538&#93; 2000
ELO difference&#58; 26.81    +/- 6.25 1SD error by 3-nomial
26.81    +/- 3.65 1SD error by 5-nomial

t-value=7.34``````
The values are in reasonable agreement with Bayeselo model, but still within error margins, and I cannot play much more LTC games. To check for large factors of (w-l)/(w0-l0) appearing for eloDraw of order of 600 ELO points, I need very drawish matches, and I played from balanced and unbalanced early endgame positions, they have a large eloDraw. eloDraw now is about 600 ELO points, eloBias about 500 ELO points (about 150cp advantage). The predicted by Bayeselo enhancement in (w-l)/(w0-l0) is about a factor of 4.The result is:

Code: Select all

``````Balanced&#58;
Score of SF2 vs SF1&#58; 58 - 33 - 1909  &#91;0.506&#93; 2000
ELO difference&#58; 4.34    +/- 1.62 1SD error by 3-nomial
4.34    +/- 1.52 1SD error by 5-nomial

t-value=2.86

Unbalanced&#58;
Score of SF2 vs SF1&#58; 693 - 563 - 744  &#91;0.532&#93; 2000
ELO difference&#58; 22.62    +/- 6.03 1SD error by 3-nomial
22.62    +/- 2.96 1SD error by 5-nomial

t-value=7.64
``````
Now the agreement with Bayeselo model is beyond error margins, and in this case the number of games for the same t- or p-value is reduced by a factor of 7 with biased postions. This might be important in the future of testing and tournaments in 10-20 years, when eloDraw will routinely be 500-600 points.
--------------------

All in all: it seems correlations don't spoil the predicted by Bayeselo behavior of (w-l)/sigma(w-l).