It occured to me that in engine matches I trust so much the theoretical SD = sqrt(W*(N-W)+L*(N-L)+2*W*L)/sqrt(N-1) that I am not aware of any empirical computation of standard deviation in chess matches via resampling like bootstrapping. In fact, awhile ago, when playing with unbalanced starting positions, I did have a gut feeling that the computed by formula error margins are well off, but I failed to formulate the problem in some meaningful way.
I used jackknifing on two databases of 2000 games each between recent Stockfishes. The games are played side and reverse (as is usual in testing) for each opening, so the games are not completely independent.
Two cases of starting positions, for which I expected a bit different results:
- * Very balanced postions, less than 10cp unbalance
* Unbalanced positions of the order 120cp
1/ Balanced
Code: Select all
Score of SF2 vs SF1: 485 - 316 - 1199 [0.542] 2000
ELO difference: 29.43 +/- 9.61
Finished match
Computed by jackknifing: error margins (2SD) = 8.01 ELO points
- 1.20 times smaller than computed by theoretical formula
2/ Unbalansed:
Code: Select all
Score of SF2 vs SF1: 712 - 579 - 709 [0.533] 2000
ELO difference: 23.14 +/- 12.23
Finished match
Computed by jackknifing: error margins (2SD) = 5.22 ELO points
- 2.34 times smaller than computed by theoretical formula
Usually in testing people use fairly balanced positions, and I expect smaller error margins than shown by rating calculators by a factor 1.3-1.4. Also, AFAIK SPRT as used in SF Testing Framework doesn't take into account the correlations between games. If it took (but I don't know how to implement this knowledge in SPRT), the length of the match to a SPRT stop would occur almost two times faster. It is very plausible (based on older experiments and "new" error margins) that using unbalanced positions will shorten further the matches. One more interesting thing: many people, even developers, use rating calculators' 2SD error margins as stopping rule, and I was puzzled about their good progress with such bad stops. Now it seems that they are translated to 2.6-2.8SD in reality, close to 3SD stopping rule, which is a reasonable stopping rule for less than 5% Type I error up to reasonable number of games (although unbounded Type I error for ngames -> infinity).