It is anecdotal known that the reliability of the estimation that is possible to infer from the 2 game series is different.
Here I would like to introduce a practical way to "measure" this reliability.
Say we have a series S0 of 2000 games between 2 engines. For each game we store the result as Win /Lost/ Draw. We feed this series to an ELO estimator (Bayes or something else, does not matter as long as converges). ELO estimator will output an estimated ELO difference between the 2 engines, call it Elo0
Now we split the series in 2: the first 1000 games and the second 1000 games. We feed these 2 sub-series (S1_0, S1_1) to the same ELO estimator and we get the corresponding 2 ELO estimated differences: Elo1_0, Elo1_1
We repeat the previous step splitting in 2 series of 500 games each previous S1 series, getting a total of 4 sub-series: S2_0, S2_1, S2_3, S2_4. As in the previous step we ran the ELO estimators on them getting the estimated 4 ELO differences: Elo2_0, Elo2_1, Elo2_3, Elo2_4
We could further divide, but for this example just stop here. Now we consider the following quantity:
Code: Select all
(Elo0 - Elo0)/1 + (Elo1_0 - Elo0)/2 + (Elo1_1 - Elo0)/2 + (Elo2_0 - Elo0)/4 + (Elo2_1 - Elo0)/4 + (Elo2_3 - Elo0)/4 + (Elo2_4 - Elo0)/4
Because we are interested in a measure of the power of the noise, we consider the square of each term getting (first term is zero):
Code: Select all
Noise2 = (Elo1_0 - Elo0)^2/4 + (Elo1_1 - Elo0)^2/4 + (Elo2_0 - Elo0)^2/16 + (Elo2_1 - Elo0)^2/16 + (Elo2_3 - Elo0)^2/16 + (Elo2_4 - Elo0)^2/16
My guess is that this NoiseX quantity:
1) Is independent from the 2 engines under test, in particular is independent by their ELO difference.
2) Converges to zero when the number of games go to infinity (this is a consequence of ELO estimator convergences on the sub-series)
3) Is non negative and is higher when the single sub-series have very wide results. It is smaller in case single sub-series better approach full series.
4) At the moment does not converge with increasing number of splits due to the series of n term: a1/n^2 + a2/n^2 + ... + an/n^2 is asymptotically similar to n*ax/n^2 -> ax/n that is known to do not converge, but perhaps with a steeper denominator we can prove NoiseN is convergent for N -> infinite. This would be a very important result because we could extrapolate a NoiseINFINITY level independent by the games played and only due to the quality of testing framework !
In practical terms it would be very interesting to add some code to, say cutechess, to calculate the Noise2 (or NoiseX with x input parameter) on the fly while games are in progression (like it does with ELO) so to verify that:
a) For a given 'order' of Noise, say 2, then Noise2 decreases with the increase of number of played games (limit is zero at infinity)
b) For a fixed number of games, say 5000, a very noisy environment has Noise2 figure higher then a quality test framework
c) Given the a) and b) to find an equivalent reliability between two different testing framework and different number of games. IOW be able to say that for testing framework A (that is very noisy) I need to play 10000 games to reach the same reliability of testing framework B (that is good) is able to give with just 2000 games: This happens when the Noise2 of of the pair (framework, number of games) is the same.