Wilo rating properties from FGRL rating lists
Posted: Mon May 01, 2017 7:26 pm
"Wilo" is the name given by Miguel Ballicora to drawless Elo (draws are discarded). It was shown that rating model based on Wilo is sound, and Wilos are additive over logistic, just as Elos. No draw model is needed, as there are no draws. What I will show empirically from FGRL rating lists of Andreas Strangmüller Top 10 at 60''+0.6'' and 60'+15'' (roughly 60x time factor between the two lists) is:
1/ Wilo rating doesn't compress or dilate ratings from STC to LTC. Elo rating does compress ratings from STC to LTC
2/ Wilo rating give higher LOS (p-values), showing more sensitivity than Elo rating, therefore less games are needed for Wilo rating to show significant differences between engines than with Elo model
Therefore considering draws as non-games is better both as calibration of ratings (no time control dependence) and as number of games needed for significance.
------------------------------------
Results from FGRL rating lists ( http://www.fastgm.de/ ):
------------------------------------
I/ Compression/Dilation of ratings
Elo ratings:
Mean deviation of ratings 60''+0.6'': 119 Elo points
Mean deviation of ratings 60'+15'': 84 Elo points
Compression: 29.5% +/- 4% ---> significant compression
Wilo ratings (discarding draws):
Mean deviation of ratings 60''+0.6'': 251 Wilo points
Mean deviation of ratings 60'+15'': 260 Wilo points
Dilation: 3.6% +/- 8% ---> insignificant dilation
We see that Wilo ratings are within error margins of not compressing or dilating with 60x time control. Elo ratings do compress significantly at 60x time control
II/ Norm (regular Frobenius) of LOS (p-values) matrix.
The minimal Norm of p-value matrix is 0 (all engines are perfectly equal in strength, LOS=50% between any two engines, Null Hypothesis is accepted with 100% probability).
The maximal Norm of p-value 10x10 symmetric matrix is sqrt(90) ~ 9.487 (all engines are clearly separated by strength, LOS=100% between any two engines, Null Hypothesis is rejected with 100% probability)
Elo list 60''+0.6'': Norm of p-value matrix is 9.15
Wilo list 60''+0.6'': Norm of p-value matrix is 9.22
Wilo list shows generally higher LOS values (more sensitivity).
Elo list 60'+15'': Norm of p-value matrix is 9.09
Wilo list 60'+15'': Norm of p-value matrix is 9.13
Wilo list shows again higher sensitivity.
Both confirm that Wilo ratings show more sensitivity, therefore less games are needed for the same confidence compared to Elo ratings.
III/ Scaling
With Wilo empirically shown as better rating system than Elo in computer chess with our data from FGRL, not depending on time control and having higher sensitivity, I put in table the scaling of top 10 engines between bullet (60''+0.6'') and LTC (60'+15''), or roughly 60x time factor:
The highly significant results are that Andscacs 0.90 is the best scaling engine, with Fritz 15 nad Fizbo 1.9 the worst scaling. Also, that Houdini 5 scales significantly worse than other top engines.
1/ Wilo rating doesn't compress or dilate ratings from STC to LTC. Elo rating does compress ratings from STC to LTC
2/ Wilo rating give higher LOS (p-values), showing more sensitivity than Elo rating, therefore less games are needed for Wilo rating to show significant differences between engines than with Elo model
Therefore considering draws as non-games is better both as calibration of ratings (no time control dependence) and as number of games needed for significance.
------------------------------------
Results from FGRL rating lists ( http://www.fastgm.de/ ):
------------------------------------
I/ Compression/Dilation of ratings
Elo ratings:
Mean deviation of ratings 60''+0.6'': 119 Elo points
Mean deviation of ratings 60'+15'': 84 Elo points
Compression: 29.5% +/- 4% ---> significant compression
Wilo ratings (discarding draws):
Mean deviation of ratings 60''+0.6'': 251 Wilo points
Mean deviation of ratings 60'+15'': 260 Wilo points
Dilation: 3.6% +/- 8% ---> insignificant dilation
We see that Wilo ratings are within error margins of not compressing or dilating with 60x time control. Elo ratings do compress significantly at 60x time control
II/ Norm (regular Frobenius) of LOS (p-values) matrix.
The minimal Norm of p-value matrix is 0 (all engines are perfectly equal in strength, LOS=50% between any two engines, Null Hypothesis is accepted with 100% probability).
The maximal Norm of p-value 10x10 symmetric matrix is sqrt(90) ~ 9.487 (all engines are clearly separated by strength, LOS=100% between any two engines, Null Hypothesis is rejected with 100% probability)
Elo list 60''+0.6'': Norm of p-value matrix is 9.15
Wilo list 60''+0.6'': Norm of p-value matrix is 9.22
Wilo list shows generally higher LOS values (more sensitivity).
Elo list 60'+15'': Norm of p-value matrix is 9.09
Wilo list 60'+15'': Norm of p-value matrix is 9.13
Wilo list shows again higher sensitivity.
Both confirm that Wilo ratings show more sensitivity, therefore less games are needed for the same confidence compared to Elo ratings.
III/ Scaling
With Wilo empirically shown as better rating system than Elo in computer chess with our data from FGRL, not depending on time control and having higher sensitivity, I put in table the scaling of top 10 engines between bullet (60''+0.6'') and LTC (60'+15''), or roughly 60x time factor:
Code: Select all
Engine Scaling Wilo
--------------------------------------------
1 Andscacs 0.90 : 91
2 Komodo 10.4 : 39
3 Stockfish 8 : 30
4 Deep Shredder 13 : 19
5 Fire 5 : 18
6 Gull 3 : 2
7 Houdini 5.01 : -22
8 Chiron 4 : -24
9 Fizbo 1.9 : -76
10 Fritz 15 : -77