Noise in ELO estimators: a quantitative approach
Posted: Sun Jan 06, 2013 6:46 pm
It is anecdotal known that noise level influences the convergence speed of the ELO estimators: it is not the same to play 1000 games in a controlled environment in single thread mode or the same 1000 games in a very noisy environment, say 8 threads per engine on a dual core with pondering on end eventually with some heavy activity in the background (like watching a movie).
It is anecdotal known that the reliability of the estimation that is possible to infer from the 2 game series is different.
Here I would like to introduce a practical way to "measure" this reliability.
Say we have a series S0 of 2000 games between 2 engines. For each game we store the result as Win /Lost/ Draw. We feed this series to an ELO estimator (Bayes or something else, does not matter as long as converges). ELO estimator will output an estimated ELO difference between the 2 engines, call it Elo0
Now we split the series in 2: the first 1000 games and the second 1000 games. We feed these 2 sub-series (S1_0, S1_1) to the same ELO estimator and we get the corresponding 2 ELO estimated differences: Elo1_0, Elo1_1
We repeat the previous step splitting in 2 series of 500 games each previous S1 series, getting a total of 4 sub-series: S2_0, S2_1, S2_3, S2_4. As in the previous step we ran the ELO estimators on them getting the estimated 4 ELO differences: Elo2_0, Elo2_1, Elo2_3, Elo2_4
We could further divide, but for this example just stop here. Now we consider the following quantity:
The above represents the sum of the errors intended as the difference between the best ELO difference estimation (Elo0) and each elo estimation calculated from a sub-series, divided (weighted) by the number of same type sub-series.
Because we are interested in a measure of the power of the noise, we consider the square of each term getting (first term is zero):
Where Noise2 is the power of noise calculated with 2 following sub-splits. We can also calculate Noise3 and in general NoiseN.
My guess is that this NoiseX quantity:
1) Is independent from the 2 engines under test, in particular is independent by their ELO difference.
2) Converges to zero when the number of games go to infinity (this is a consequence of ELO estimator convergences on the sub-series)
3) Is non negative and is higher when the single sub-series have very wide results. It is smaller in case single sub-series better approach full series.
4) At the moment does not converge with increasing number of splits due to the series of n term: a1/n^2 + a2/n^2 + ... + an/n^2 is asymptotically similar to n*ax/n^2 -> ax/n that is known to do not converge, but perhaps with a steeper denominator we can prove NoiseN is convergent for N -> infinite. This would be a very important result because we could extrapolate a NoiseINFINITY level independent by the games played and only due to the quality of testing framework !
In practical terms it would be very interesting to add some code to, say cutechess, to calculate the Noise2 (or NoiseX with x input parameter) on the fly while games are in progression (like it does with ELO) so to verify that:
a) For a given 'order' of Noise, say 2, then Noise2 decreases with the increase of number of played games (limit is zero at infinity)
b) For a fixed number of games, say 5000, a very noisy environment has Noise2 figure higher then a quality test framework
c) Given the a) and b) to find an equivalent reliability between two different testing framework and different number of games. IOW be able to say that for testing framework A (that is very noisy) I need to play 10000 games to reach the same reliability of testing framework B (that is good) is able to give with just 2000 games: This happens when the Noise2 of of the pair (framework, number of games) is the same.
It is anecdotal known that the reliability of the estimation that is possible to infer from the 2 game series is different.
Here I would like to introduce a practical way to "measure" this reliability.
Say we have a series S0 of 2000 games between 2 engines. For each game we store the result as Win /Lost/ Draw. We feed this series to an ELO estimator (Bayes or something else, does not matter as long as converges). ELO estimator will output an estimated ELO difference between the 2 engines, call it Elo0
Now we split the series in 2: the first 1000 games and the second 1000 games. We feed these 2 sub-series (S1_0, S1_1) to the same ELO estimator and we get the corresponding 2 ELO estimated differences: Elo1_0, Elo1_1
We repeat the previous step splitting in 2 series of 500 games each previous S1 series, getting a total of 4 sub-series: S2_0, S2_1, S2_3, S2_4. As in the previous step we ran the ELO estimators on them getting the estimated 4 ELO differences: Elo2_0, Elo2_1, Elo2_3, Elo2_4
We could further divide, but for this example just stop here. Now we consider the following quantity:
Code: Select all
(Elo0 - Elo0)/1 + (Elo1_0 - Elo0)/2 + (Elo1_1 - Elo0)/2 + (Elo2_0 - Elo0)/4 + (Elo2_1 - Elo0)/4 + (Elo2_3 - Elo0)/4 + (Elo2_4 - Elo0)/4
Because we are interested in a measure of the power of the noise, we consider the square of each term getting (first term is zero):
Code: Select all
Noise2 = (Elo1_0 - Elo0)^2/4 + (Elo1_1 - Elo0)^2/4 + (Elo2_0 - Elo0)^2/16 + (Elo2_1 - Elo0)^2/16 + (Elo2_3 - Elo0)^2/16 + (Elo2_4 - Elo0)^2/16
My guess is that this NoiseX quantity:
1) Is independent from the 2 engines under test, in particular is independent by their ELO difference.
2) Converges to zero when the number of games go to infinity (this is a consequence of ELO estimator convergences on the sub-series)
3) Is non negative and is higher when the single sub-series have very wide results. It is smaller in case single sub-series better approach full series.
4) At the moment does not converge with increasing number of splits due to the series of n term: a1/n^2 + a2/n^2 + ... + an/n^2 is asymptotically similar to n*ax/n^2 -> ax/n that is known to do not converge, but perhaps with a steeper denominator we can prove NoiseN is convergent for N -> infinite. This would be a very important result because we could extrapolate a NoiseINFINITY level independent by the games played and only due to the quality of testing framework !
In practical terms it would be very interesting to add some code to, say cutechess, to calculate the Noise2 (or NoiseX with x input parameter) on the fly while games are in progression (like it does with ELO) so to verify that:
a) For a given 'order' of Noise, say 2, then Noise2 decreases with the increase of number of played games (limit is zero at infinity)
b) For a fixed number of games, say 5000, a very noisy environment has Noise2 figure higher then a quality test framework
c) Given the a) and b) to find an equivalent reliability between two different testing framework and different number of games. IOW be able to say that for testing framework A (that is very noisy) I need to play 10000 games to reach the same reliability of testing framework B (that is good) is able to give with just 2000 games: This happens when the Noise2 of of the pair (framework, number of games) is the same.