There is a simple rule for bad data science: throw away the data that doesn't fit your theory.
It is absolutely sensible that STC to LTC compression exists and should exist because it is real. Given more time, any entity can improve their
answer. Also, the best don't gain as much from extra time as a weaker entity. Once you find the right answer, extra time is not needed.
Against much weaker players, I handicap the matches with time odds. It really does reduce the strength gap.
There is too much data that shows STC to LTC compression is real. Your error margins being worse than Elo error margins suggests that the Elo
system is better.
Wilo rating properties from FGRL rating lists
Moderators: hgm, Rebel, chrisw
-
- Posts: 2056
- Joined: Mon Mar 13, 2006 2:31 am
- Location: North Carolina, USA
-
- Posts: 2273
- Joined: Mon Sep 29, 2008 1:50 am
Re: Wilo rating properties from FGRL rating lists
I did the computation for Davidson and in that case elo seems to be proportional to eps/a (with the notation introduced above). So Davidson behaves like pure wilo for small elo differences.
In this discussion we have to realize that an elo model is just a mapping (w,d)-->elo. Nothing more!!!
To me normalized elo is the most objective one, as it represents the effort needed to separate one engine from another with a given LOS.
This means that normalized elo is precisely what is important for testing and it would be more logical for fishtest to use normalized elo bounds instead of BayesElo bounds.
Another objective elo measure is the time handicap needed to equalize elo (which can be defined independently of any elo model as w=l). It would be interesting to compare this to the standard elo models and also to measure its scaling behaviour.
The above discussion is for two engines. For more engines there is also the additivity requirement
elo_diff(eng1,eng2)+elo_diff(eng2,eng3)=elo_diff(eng1,eng3).
In this discussion we have to realize that an elo model is just a mapping (w,d)-->elo. Nothing more!!!
To me normalized elo is the most objective one, as it represents the effort needed to separate one engine from another with a given LOS.
This means that normalized elo is precisely what is important for testing and it would be more logical for fishtest to use normalized elo bounds instead of BayesElo bounds.
Another objective elo measure is the time handicap needed to equalize elo (which can be defined independently of any elo model as w=l). It would be interesting to compare this to the standard elo models and also to measure its scaling behaviour.
The above discussion is for two engines. For more engines there is also the additivity requirement
elo_diff(eng1,eng2)+elo_diff(eng2,eng3)=elo_diff(eng1,eng3).
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.
Without ideas there is nothing to simplify.
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: Wilo rating properties from FGRL rating lists
From the most recent FGRL Top 10 (10min+6s and 60min+15s) rating lists of top 9 engines (not 10, because 1 engine is different), it seems again that Wilo is invariant to long time controls, and basically Win/Loss ratio to long time controls is pretty constant. At least FGRL results show that repeatedly.
Elo:
Top 9 10min + 6s:
Top 9 60min + 15s:
Elo as is expected is compressed with time control, the regression (Elo at 60min) versus (Elo at 10min) has a factor of 0.9351. Here is the plot with regression and 9 engines:
=============================================
Wilo:
Top 9 10min + 6s:
Top 9 60min + 15s:
Wilo is practically invariant with time control at LTC, the regression (Wilo at 60min) versus (Wilo at 10min) has a factor of 1.0061, very close to 1. Here is the plot with regression and 9 engines:
=================================
From the plots one can also see which engine scales better to LTC, although most points are within error margins in scaling.
Elo:
Top 9 10min + 6s:
Code: Select all
# PLAYER : RATING ERROR POINTS PLAYED (%) CFS(next)
1 Stockfish 9 : 181.6 9.2 1794.5 2400 74.8 100
2 Houdini 6 : 141.5 8.8 1676.5 2400 69.9 100
3 Komodo 11.2 : 122.0 8.9 1616.0 2400 67.3 100
4 Fire 6.1 : -18.5 8.3 1142.0 2400 47.6 100
5 Deep Shredder 13 : -44.4 8.1 1052.5 2400 43.9 100
6 Fizbo 2 : -61.4 8.2 994.0 2400 41.4 100
7 Andscacs 0.92 : -96.1 8.6 877.0 2400 36.5 88
8 Gull 3 : -103.6 8.4 852.0 2400 35.5 100
9 Booot 6.2 : -121.0 8.7 795.5 2400 33.1 ---
White advantage = 39.87 +/- 2.40
Draw rate (equal opponents) = 64.80 % +/- 0.61
Top 9 60min + 15s:
Code: Select all
# PLAYER : RATING ERROR POINTS PLAYED (%) CFS(next)
1 Stockfish 9 : 165.6 12.4 876.5 1200 73.0 100
2 Houdini 6 : 138.8 12.2 836.0 1200 69.7 100
3 Komodo 11.2 : 115.0 11.8 798.5 1200 66.5 100
4 Fire 6.1 : -23.7 11.4 561.0 1200 46.8 75
5 Deep Shredder 13 : -29.4 11.2 551.0 1200 45.9 100
6 Fizbo 2 : -71.8 10.7 477.5 1200 39.8 86
7 Andscacs 0.92 : -80.6 10.9 462.5 1200 38.5 100
8 Booot 6.2 : -102.6 11.5 425.5 1200 35.5 84
9 Gull 3 : -111.1 11.6 411.5 1200 34.3 ---
White advantage = 38.30 +/- 3.12
Draw rate (equal opponents) = 67.99 % +/- 0.83
=============================================
Wilo:
Top 9 10min + 6s:
Code: Select all
# PLAYER : RATING ERROR POINTS PLAYED (%) CFS(next)
1 Stockfish 9 : 579.9 51.9 1239.0 1289 96.1 100
2 Houdini 6 : 405.4 40.0 1074.0 1195 89.9 100
3 Komodo 11.2 : 329.2 36.4 1008.0 1184 85.1 100
4 Fire 6.1 : -64.9 28.0 479.0 1074 44.6 100
5 Deep Shredder 13 : -144.4 27.7 379.0 1053 36.0 100
6 Fizbo 2 : -199.0 25.8 385.0 1182 32.6 100
7 Andscacs 0.92 : -277.8 27.6 271.0 1188 22.8 89
8 Gull 3 : -300.8 29.3 247.0 1190 20.8 92
9 Booot 6.2 : -327.6 28.3 218.0 1245 17.5 ---
White advantage = 91.34 +/- 7.50
Draw rate (equal opponents) = 0.00 % +/- 0.00
Top 9 60min + 15s:
Code: Select all
# PLAYER : RATING ERROR POINTS PLAYED (%) CFS(next)
1 Stockfish 9 : 579.5 72.5 577.0 601 96.0 100
2 Houdini 6 : 409.1 60.7 527.0 582 90.5 95
3 Komodo 11.2 : 343.2 55.3 472.0 547 86.3 100
4 Fire 6.1 : -89.8 41.8 188.0 454 41.4 85
5 Deep Shredder 13 : -118.9 38.7 203.0 504 40.3 100
6 Fizbo 2 : -233.7 35.9 169.0 583 29.0 80
7 Andscacs 0.92 : -254.7 40.9 137.0 549 25.0 98
8 Booot 6.2 : -311.3 42.0 112.0 573 19.5 66
9 Gull 3 : -323.4 42.3 101.0 579 17.4 ---
White advantage = 85.94 +/- 10.81
Draw rate (equal opponents) = 0.00 % +/- 0.00
=================================
From the plots one can also see which engine scales better to LTC, although most points are within error margins in scaling.
-
- Posts: 5961
- Joined: Sun Jan 10, 2010 6:15 am
- Location: Maryland USA
Re: Wilo rating properties from FGRL rating lists
That is very interesting, it would indeed be nice if we could use WILO and expect results to be similar at any time control. I suspect that WILO might also make the elo differences less dependent on choice of opening book. With normal elo, books that end in equal positions will show smaller elo differences than books that end with one side half a pawn or so ahead. Maybe this would not be true for WILO.Laskos wrote:From the most recent FGRL Top 10 (10min+6s and 60min+15s) rating lists of top 9 engines (not 10, because 1 engine is different), it seems again that Wilo is invariant to long time controls, and basically Win/Loss ratio to long time controls is pretty constant. At least FGRL results show that repeatedly.
Elo:
Top 9 10min + 6s:
Code: Select all
# PLAYER : RATING ERROR POINTS PLAYED (%) CFS(next) 1 Stockfish 9 : 181.6 9.2 1794.5 2400 74.8 100 2 Houdini 6 : 141.5 8.8 1676.5 2400 69.9 100 3 Komodo 11.2 : 122.0 8.9 1616.0 2400 67.3 100 4 Fire 6.1 : -18.5 8.3 1142.0 2400 47.6 100 5 Deep Shredder 13 : -44.4 8.1 1052.5 2400 43.9 100 6 Fizbo 2 : -61.4 8.2 994.0 2400 41.4 100 7 Andscacs 0.92 : -96.1 8.6 877.0 2400 36.5 88 8 Gull 3 : -103.6 8.4 852.0 2400 35.5 100 9 Booot 6.2 : -121.0 8.7 795.5 2400 33.1 --- White advantage = 39.87 +/- 2.40 Draw rate (equal opponents) = 64.80 % +/- 0.61
Top 9 60min + 15s:
Elo as is expected is compressed with time control, the regression (Elo at 60min) versus (Elo at 10min) has a factor of 0.9351. Here is the plot with regression and 9 engines:Code: Select all
# PLAYER : RATING ERROR POINTS PLAYED (%) CFS(next) 1 Stockfish 9 : 165.6 12.4 876.5 1200 73.0 100 2 Houdini 6 : 138.8 12.2 836.0 1200 69.7 100 3 Komodo 11.2 : 115.0 11.8 798.5 1200 66.5 100 4 Fire 6.1 : -23.7 11.4 561.0 1200 46.8 75 5 Deep Shredder 13 : -29.4 11.2 551.0 1200 45.9 100 6 Fizbo 2 : -71.8 10.7 477.5 1200 39.8 86 7 Andscacs 0.92 : -80.6 10.9 462.5 1200 38.5 100 8 Booot 6.2 : -102.6 11.5 425.5 1200 35.5 84 9 Gull 3 : -111.1 11.6 411.5 1200 34.3 --- White advantage = 38.30 +/- 3.12 Draw rate (equal opponents) = 67.99 % +/- 0.83
=============================================
Wilo:
Top 9 10min + 6s:
Code: Select all
# PLAYER : RATING ERROR POINTS PLAYED (%) CFS(next) 1 Stockfish 9 : 579.9 51.9 1239.0 1289 96.1 100 2 Houdini 6 : 405.4 40.0 1074.0 1195 89.9 100 3 Komodo 11.2 : 329.2 36.4 1008.0 1184 85.1 100 4 Fire 6.1 : -64.9 28.0 479.0 1074 44.6 100 5 Deep Shredder 13 : -144.4 27.7 379.0 1053 36.0 100 6 Fizbo 2 : -199.0 25.8 385.0 1182 32.6 100 7 Andscacs 0.92 : -277.8 27.6 271.0 1188 22.8 89 8 Gull 3 : -300.8 29.3 247.0 1190 20.8 92 9 Booot 6.2 : -327.6 28.3 218.0 1245 17.5 --- White advantage = 91.34 +/- 7.50 Draw rate (equal opponents) = 0.00 % +/- 0.00
Top 9 60min + 15s:
Wilo is practically invariant with time control at LTC, the regression (Wilo at 60min) versus (Wilo at 10min) has a factor of 1.0061, very close to 1. Here is the plot with regression and 9 engines:Code: Select all
# PLAYER : RATING ERROR POINTS PLAYED (%) CFS(next) 1 Stockfish 9 : 579.5 72.5 577.0 601 96.0 100 2 Houdini 6 : 409.1 60.7 527.0 582 90.5 95 3 Komodo 11.2 : 343.2 55.3 472.0 547 86.3 100 4 Fire 6.1 : -89.8 41.8 188.0 454 41.4 85 5 Deep Shredder 13 : -118.9 38.7 203.0 504 40.3 100 6 Fizbo 2 : -233.7 35.9 169.0 583 29.0 80 7 Andscacs 0.92 : -254.7 40.9 137.0 549 25.0 98 8 Booot 6.2 : -311.3 42.0 112.0 573 19.5 66 9 Gull 3 : -323.4 42.3 101.0 579 17.4 --- White advantage = 85.94 +/- 10.81 Draw rate (equal opponents) = 0.00 % +/- 0.00
=================================
From the plots one can also see which engine scales better to LTC, although most points are within error margins in scaling.
Just curious, how would the conclusion have changed if you included the tenth engine?
Komodo rules!
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: Wilo rating properties from FGRL rating lists
Yes, maybe your guess on opening books is correct. Closer to very balanced positions books show both higher draw rates and smaller Elo differences. But somewhat more unbalanced positions books, with lower draw rates and larger Elo differences, probably exhibit more Wins and more Losses, with a bit larger effect on Wins. This might mean that Wins/Losses ratio might be pretty stable for both sorts of books.lkaufman wrote:
That is very interesting, it would indeed be nice if we could use WILO and expect results to be similar at any time control. I suspect that WILO might also make the elo differences less dependent on choice of opening book. With normal elo, books that end in equal positions will show smaller elo differences than books that end with one side half a pawn or so ahead. Maybe this would not be true for WILO.
Just curious, how would the conclusion have changed if you included the tenth engine?
I the downloaded databases, the tenth engine in one list was Fritz, and in the second list Chiron, so I couldn't check for their Elo/Wilo from 10min to 60min. I may ask Andreas to include both in a list of 11 engines and send me new databases, it requires some work from him, but in the past he was very helpful.
-
- Posts: 186
- Joined: Fri Oct 10, 2014 10:05 pm
- Location: Berkeley, CA
Re: Wilo rating properties from FGRL rating lists
The answer is obviously yes.lkaufman wrote:consider the following (exaggerated) scenario: Engine A scores 75 wins, 25 losses, no draws against Engine C. Engine B score 75 wins, 25 losses, and 100 draws against Engine C. Wilo would say that both A and B are equal, while normal Elo would say that Engine A is much stronger, roughly twice as far above C as B is above C. It seems obvious that A is the stronger engine. Does this show a flaw in WILO,
"Solving" a non-problem since playing strength difference between engines (and humans) obviously does shrink with LTC.WILO would solve the problem of elo differences shrinking with longer time controls, as you show.
-Carl
Last edited by clumma on Fri Mar 09, 2018 7:01 pm, edited 1 time in total.
-
- Posts: 186
- Joined: Fri Oct 10, 2014 10:05 pm
- Location: Berkeley, CA
Re: Wilo rating properties from FGRL rating lists
Correct. It's just a way to take advantage of statistically underpowered tournaments.hgm wrote:It seems to me that contempt should always be detrimental in a true strength measurement, where the engine is tested against equal numbers of stronger and weaker opponents. It can only be helpful in a flawed, lopsided measurement, which does not reflect playing strength, but some meaningless artifact. You can then use it to drive up the meaningless artifact.
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: Wilo rating properties from FGRL rating lists
Hypothetical extreme (to absurd) examples don't help much. That's why I am using empirical data from Andreas' excellent FGRL rating lists with consistent participants and significant number of games, from short to long time controls.clumma wrote:The answer is obviously yes.lkaufman wrote:consider the following (exaggerated) scenario: Engine A scores 75 wins, 25 losses, no draws against Engine C. Engine B score 75 wins, 25 losses, and 100 draws against Engine C. Wilo would say that both A and B are equal, while normal Elo would say that Engine A is much stronger, roughly twice as far above C as B is above C. It seems obvious that A is the stronger engine. Does this show a flaw in WILO,
What is "playing strength" in your understanding?"Solving" a non-problem since playing strength difference between engines (and humans) obviously does shrink with LTC.WILO would solve the problem of elo differences shrinking with longer time controls, as you show.
-Carl
-
- Posts: 186
- Joined: Fri Oct 10, 2014 10:05 pm
- Location: Berkeley, CA
Re: Wilo rating properties from FGRL rating lists
In another thread on intrinsic ratings, it appeared to me that you and I have different conceptions of Elo. For me it is not a statistical procedure to predict outcome of games. That is how it is defined, but it measures something much more important: playing strength.Laskos wrote:What is "playing strength" in your understanding?"Solving" a non-problem since playing strength difference between engines (and humans) obviously does shrink with LTC.
This playing strength could be rigorously defined for individual moves. I won't attempt to do so here but consider a hypothetical 32-man tablebase. In any position, it can order legal moves by depth to mate, then inverse of depth to draw, and finally inverse of depth to mate. The "strength" of any of these moves could be defined by this ordering. The "strength" of any list of moves could be some average of the individual move scores.
My assertion is that Elo approximates this score, and that WILO does so comparatively poorly.
-Carl
-
- Posts: 12562
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: Wilo rating properties from FGRL rating lists
I think it is essential that these things are addressed by the model.Laskos wrote: {snip}
Hypothetical extreme (to absurd) examples don't help much. That's why I am using empirical data from Andreas' excellent FGRL rating lists with consistent participants and significant number of games, from short to long time controls.
{snip}
For instance, if two opponents (engine A and engine B) hypothetically play a mole of games (6x10^23) and have all draws except 100 and engine A wins all 100, I maintain that the two engines have exactly the same strength for all practical purposes. Despite the win "domination" by A, it is not stronger than B.
On the other hand, a model has exactly the value of its predictive behavior. So if the wins and losses only model can predict better than a model which also includes draws, then the win loss only model is better.
If the model makes absurd predictions (such as "A is stronger than B" after the above experiment) then the model needs a tweak to be able to predict correctly.
IMO-YMMV.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.