Error margins via resampling (jackknifing)

Laskos · Post by **Laskos** » Fri Aug 12, 2016 5:50 pm

It occured to me that in engine matches I trust so much the theoretical SD = sqrt(W*(N-W)+L*(N-L)+2*W*L)/sqrt(N-1) that I am not aware of any empirical computation of standard deviation in chess matches via resampling like bootstrapping. In fact, awhile ago, when playing with unbalanced starting positions, I did have a gut feeling that the computed by formula error margins are well off, but I failed to formulate the problem in some meaningful way.

I used jackknifing on two databases of 2000 games each between recent Stockfishes. The games are played side and reverse (as is usual in testing) for each opening, so the games are not completely independent.

Two cases of starting positions, for which I expected a bit different results:

* Very balanced postions, less than 10cp unbalance
* Unbalanced positions of the order 120cp

1/ Balanced

Code: Select all

Score of SF2 vs SF1: 485 - 316 - 1199  [0.542] 2000
ELO difference: 29.43 +/- 9.61
Finished match

Computed by jackknifing: error margins (2SD) = 8.01 ELO points

1.20 times smaller than computed by theoretical formula

2/ Unbalansed:

Code: Select all

Score of SF2 vs SF1: 712 - 579 - 709  [0.533] 2000
ELO difference: 23.14 +/- 12.23
Finished match

Computed by jackknifing: error margins (2SD) = 5.22 ELO points

2.34 times smaller than computed by theoretical formula

Usually in testing people use fairly balanced positions, and I expect smaller error margins than shown by rating calculators by a factor 1.3-1.4. Also, AFAIK SPRT as used in SF Testing Framework doesn't take into account the correlations between games. If it took (but I don't know how to implement this knowledge in SPRT), the length of the match to a SPRT stop would occur almost two times faster. It is very plausible (based on older experiments and "new" error margins) that using unbalanced positions will shorten further the matches. One more interesting thing: many people, even developers, use rating calculators' 2SD error margins as stopping rule, and I was puzzled about their good progress with such bad stops. Now it seems that they are translated to 2.6-2.8SD in reality, close to 3SD stopping rule, which is a reasonable stopping rule for less than 5% Type I error up to reasonable number of games (although unbounded Type I error for ngames -> infinity).

Michel · Post by **Michel** » Sat Aug 13, 2016 12:01 pm

We already know that in case games are replayed with opposite colors, the variance computed as if the distribution is trinomial is bigger than the correctly computed variance of the elo estimator (which can indeed be obtained by resampling). Note for the Jackkife you should consider the sample as a set of pairs of games, in order to have independent trials.

I think the effect will be small for balanced positions but I have not checked it.

For unbalanced positions the elo estimator becomes biased so the variance of the estimator is not the only thing that should be considered.

Example. Assume the positions are so unbalanced that the outcome is determined beforehand (one color always a queen ahead). Then the calculated elo difference will be 0 with variance 0 (in case of replayed games) independently of the real elo difference. So it is a meaningless statistic.

Laskos · Post by **Laskos** » Sat Aug 13, 2016 6:29 pm

Michel wrote:We already know that in case games are replayed with opposite colors, the variance computed as if the distribution is trinomial is bigger than the correctly computed variance of the elo estimator (which can indeed be obtained by resampling).

I was confused about this issue, and it seems to me crucial in calculating p-values and suitable stop rules.

Note for the Jackkife you should consider the sample as a set of pairs of games, in order to have independent trials.

This is how I did, in pairs.

I think the effect will be small for balanced positions but I have not checked it.

Well, a factor of 1.2 for balanced, not extremely small, and probably increases with time control (for sure decreases in faster games). This is the same p-value for 1.44 less games.

For unbalanced positions the elo estimator becomes biased so the variance of the estimator is not the only thing that should be considered.

Example. Assume the positions are so unbalanced that the outcome is determined beforehand (one color always a queen ahead). Then the calculated elo difference will be 0 with variance 0 (in case of replayed games) independently of the real elo difference. So it is a meaningless statistic.

What is "real ELO difference"? It depends on openings and time control. What I am interested in is p-value or LOS that ELO2>ELO1 fixing those. LOS and p-value seem to have a well defined limit 0/0 even in that extreme case. If that limit is 0, it seems to have non-trivial shape as a function of balance of openings. I don't understand why p-value is not well defined here.

Michel · Post by **Michel** » Sat Aug 13, 2016 6:52 pm

Well, a factor of 1.2 for balanced, not extremely small, and probably increases with time control (for sure decreases in faster games). This is the same p-value for 1.44 less games.

You should test for engines of nearly equal strength. This is what is relevant for engine testing. In that case there should be very little difference - I think.

What is "real ELO difference"? It depends on openings and time control. What I am interested in is p-value or LOS that ELO2>ELO1 fixing those. LOS and p-value seem to have a well defined limit 0/0 even in that extreme case. If that limit is 0, it seems to have non-trivial shape as a function of balance of openings. I don't understand why p-value is not well defined here.

Well you need an elo model if you step away from the trinomial case (unpaired games with randomly selected openings). I am just pointing out that the variance of the estimator is not all that matters.

Laskos · Post by **Laskos** » Sat Aug 13, 2016 7:10 pm

Michel wrote:
Well, a factor of 1.2 for balanced, not extremely small, and probably increases with time control (for sure decreases in faster games). This is the same p-value for 1.44 less games.
You should test for engines of nearly equal strength. This is what is relevant for engine testing. In that case there should be very little difference - I think.

I tested earlier two consecutive Stockfishes (3-4 ELO difference) with very similar (1.2 for balanced, 1.9-2.3 for unbalanced of order 100cp) results. I posted this, a larger difference, because I hoped to get a significant result showing that unbalanced has larger t-value, but they turned out within error margins.

Well you need an elo model if you step away from the trinomial case (unpaired games with randomly selected openings). I am just pointing out that the variance of the estimator is not all that matters.

Can't I get rid of ELO model (linearize) for such small ELO differences?

Michel · Post by **Michel** » Sat Aug 13, 2016 9:36 pm

I tested earlier two consecutive Stockfishes (3-4 ELO difference) with very similar (1.2 for balanced, 1.9-2.3 for unbalanced of order 100cp) results. I posted this, a larger difference, because I hoped to get a significant result showing that unbalanced has larger t-value, but they turned out within error margins.

This is weird. If the outcome probabilities are symmetrical (w=l) then both methods for computing the variance (the incorrect 3-nomial or the correct 5-nomial) should give the same result.

Unless the paired games are not independent. Which I guess could happen for engines which are very close to each other (not just in elo but also in code). Note that even if the paired games are dependent the 5-nomial computation is still correct. IMHO there is no need to use the full jackknife.

Perhaps you are right that more attention should be paid to the correct calculation of the variance.

Michel · Post by **Michel** » Sun Aug 14, 2016 6:48 am

Also, AFAIK SPRT as used in SF Testing Framework doesn't take into account the correlations between games. If it took (but I don't know how to implement this knowledge in SPRT),

Well one could use the GSPRT for a 5-nomial distributiion. How to do this is explained here for a 3-nomial distribution (with sample code)

http://talkchess.com/forum/viewtopic.ph ... =elo+model

This model has 3 nuissance parameters.

One could also use the GSPRT for a model with parameters elo,drawelo,expected bias (2 nuissance parameter).

The second option does not defend against the possibility that paired games are not independent and moreover relies on the correctness of the elo model. It is an interesting question which of two approaches would have the best performance in practice.

Michel · Post by **Michel** » Sun Aug 14, 2016 7:40 am

Ah no, it seems one may an elo model anyway to design a test.

One wants to test elo=elo0 against elo=elo1.

In the case of unbalanced positions the score of a test is no longer an unbiased estimator for elo (and the bias depends on the bias of the positions). This is true whether games are paired or not. Think of the example of extremely unbalanced positions.

Perhaps this is a second order effect which comes only into play for very unbalanced positions. I have not checked this.

Laskos · Post by **Laskos** » Sun Aug 14, 2016 10:49 am

Michel wrote:Ah no, it seems one may an elo model anyway to design a test.

One wants to test elo=elo0 against elo=elo1.

In the case of unbalanced positions the score of a test is no longer an unbiased estimator for elo (and the bias depends on the bias of the positions). This is true whether games are paired or not. Think of the example of extremely unbalanced positions.

Perhaps this is a second order effect which comes only into play for very unbalanced positions. I have not checked this.

I don't think this is a huge problem. The ELO of the position of say 100cp is no larger than 200, and on this range all reasonable ELO models are pretty linear, the effect seems indeed second order and small. For much larger bias, it's not clear to me that the ELO bias of the position can be additively combined with the ELO strength of the engine, this is model-dependent, and needs some empirical data. As for ELO model itself, logistic seems not only the simplest, but also the most empirically confirmed for computer engines.

Meanwhile I found a database of Komodo TB versus Komodo no TB (meaning very close engines) at longer time control (blitz) to jackknive for another engine with balanced positions. The result is 1.34 smaller SD to this longer TC for Komodo compared to trinomial one. Up to now seems consistent.

Thanks for pointing to GSPRT, interesting, I will read it.

Michel · Post by **Michel** » Sun Aug 14, 2016 7:52 pm

After a long and complicated computation I came up with a surprisingly simple formula for the log likelihood ratio for the GSPRT in the multinomial case (for small score differences). Since any distribution can be approximated by a multinomial one I assume it is a general fact (and the proof should be much easier than one given here).

See the boxed formula at the end of this file

http://hardy.uhasselt.be/Toga/MLE_for_multinomial.pdf

Error margins via resampling (jackknifing)

Error margins via resampling (jackknifing)

Re: Error margins via resampling (jackknifing)

Re: Error margins via resampling (jackknifing)

Re: Error margins via resampling (jackknifing)

Re: Error margins via resampling (jackknifing)

Re: Error margins via resampling (jackknifing)

Re: Error margins via resampling (jackknifing)

Re: Error margins via resampling (jackknifing)

Re: Error margins via resampling (jackknifing)

Re: Error margins via resampling (jackknifing)