Wilo rating properties from FGRL rating lists

CRoberson · Post by **CRoberson** » Sat May 13, 2017 10:22 pm

There is a simple rule for bad data science: throw away the data that doesn't fit your theory.

It is absolutely sensible that STC to LTC compression exists and should exist because it is real. Given more time, any entity can improve their
answer. Also, the best don't gain as much from extra time as a weaker entity. Once you find the right answer, extra time is not needed.

Against much weaker players, I handicap the matches with time odds. It really does reduce the strength gap.

There is too much data that shows STC to LTC compression is real. Your error margins being worse than Elo error margins suggests that the Elo
system is better.

Michel · Post by **Michel** » Sun May 14, 2017 7:40 am

I did the computation for Davidson and in that case elo seems to be proportional to eps/a (with the notation introduced above). So Davidson behaves like pure wilo for small elo differences.

In this discussion we have to realize that an elo model is just a mapping (w,d)-->elo. Nothing more!!!

To me normalized elo is the most objective one, as it represents the effort needed to separate one engine from another with a given LOS.

This means that normalized elo is precisely what is important for testing and it would be more logical for fishtest to use normalized elo bounds instead of BayesElo bounds.

Another objective elo measure is the time handicap needed to equalize elo (which can be defined independently of any elo model as w=l). It would be interesting to compare this to the standard elo models and also to measure its scaling behaviour.

The above discussion is for two engines. For more engines there is also the additivity requirement

elo_diff(eng1,eng2)+elo_diff(eng2,eng3)=elo_diff(eng1,eng3).

Laskos · Post by **Laskos** » Thu Mar 08, 2018 9:56 pm

From the most recent FGRL Top 10 (10min+6s and 60min+15s) rating lists of top 9 engines (not 10, because 1 engine is different), it seems again that Wilo is invariant to long time controls, and basically Win/Loss ratio to long time controls is pretty constant. At least FGRL results show that repeatedly.

Elo:

Top 9 10min + 6s:

Code: Select all

   # PLAYER              &#58; RATING  ERROR    POINTS  PLAYED     (%)   CFS&#40;next&#41;
   1 Stockfish 9         &#58;  181.6    9.2    1794.5    2400    74.8     100    
   2 Houdini 6           &#58;  141.5    8.8    1676.5    2400    69.9     100    
   3 Komodo 11.2         &#58;  122.0    8.9    1616.0    2400    67.3     100    
   4 Fire 6.1            &#58;  -18.5    8.3    1142.0    2400    47.6     100    
   5 Deep Shredder 13    &#58;  -44.4    8.1    1052.5    2400    43.9     100    
   6 Fizbo 2             &#58;  -61.4    8.2     994.0    2400    41.4     100    
   7 Andscacs 0.92       &#58;  -96.1    8.6     877.0    2400    36.5      88    
   8 Gull 3              &#58; -103.6    8.4     852.0    2400    35.5     100    
   9 Booot 6.2           &#58; -121.0    8.7     795.5    2400    33.1     ---    

White advantage = 39.87 +/- 2.40
Draw rate &#40;equal opponents&#41; = 64.80 % +/- 0.61

Top 9 60min + 15s:

Code: Select all

   # PLAYER              &#58; RATING  ERROR    POINTS  PLAYED     (%)   CFS&#40;next&#41;
   1 Stockfish 9         &#58;  165.6   12.4     876.5    1200    73.0     100    
   2 Houdini 6           &#58;  138.8   12.2     836.0    1200    69.7     100    
   3 Komodo 11.2         &#58;  115.0   11.8     798.5    1200    66.5     100    
   4 Fire 6.1            &#58;  -23.7   11.4     561.0    1200    46.8      75    
   5 Deep Shredder 13    &#58;  -29.4   11.2     551.0    1200    45.9     100    
   6 Fizbo 2             &#58;  -71.8   10.7     477.5    1200    39.8      86    
   7 Andscacs 0.92       &#58;  -80.6   10.9     462.5    1200    38.5     100    
   8 Booot 6.2           &#58; -102.6   11.5     425.5    1200    35.5      84    
   9 Gull 3              &#58; -111.1   11.6     411.5    1200    34.3     ---    

White advantage = 38.30 +/- 3.12
Draw rate &#40;equal opponents&#41; = 67.99 % +/- 0.83

Elo as is expected is compressed with time control, the regression (Elo at 60min) versus (Elo at 10min) has a factor of 0.9351. Here is the plot with regression and 9 engines:

=============================================

Wilo:

Top 9 10min + 6s:

Code: Select all

   # PLAYER              &#58; RATING  ERROR    POINTS  PLAYED     (%)   CFS&#40;next&#41;
   1 Stockfish 9         &#58;  579.9   51.9    1239.0    1289    96.1     100    
   2 Houdini 6           &#58;  405.4   40.0    1074.0    1195    89.9     100    
   3 Komodo 11.2         &#58;  329.2   36.4    1008.0    1184    85.1     100    
   4 Fire 6.1            &#58;  -64.9   28.0     479.0    1074    44.6     100    
   5 Deep Shredder 13    &#58; -144.4   27.7     379.0    1053    36.0     100    
   6 Fizbo 2             &#58; -199.0   25.8     385.0    1182    32.6     100    
   7 Andscacs 0.92       &#58; -277.8   27.6     271.0    1188    22.8      89    
   8 Gull 3              &#58; -300.8   29.3     247.0    1190    20.8      92    
   9 Booot 6.2           &#58; -327.6   28.3     218.0    1245    17.5     ---    

White advantage = 91.34 +/- 7.50
Draw rate &#40;equal opponents&#41; = 0.00 % +/- 0.00

Top 9 60min + 15s:

Code: Select all

   # PLAYER              &#58; RATING  ERROR    POINTS  PLAYED     (%)   CFS&#40;next&#41;
   1 Stockfish 9         &#58;  579.5   72.5     577.0     601    96.0     100    
   2 Houdini 6           &#58;  409.1   60.7     527.0     582    90.5      95    
   3 Komodo 11.2         &#58;  343.2   55.3     472.0     547    86.3     100    
   4 Fire 6.1            &#58;  -89.8   41.8     188.0     454    41.4      85    
   5 Deep Shredder 13    &#58; -118.9   38.7     203.0     504    40.3     100    
   6 Fizbo 2             &#58; -233.7   35.9     169.0     583    29.0      80    
   7 Andscacs 0.92       &#58; -254.7   40.9     137.0     549    25.0      98    
   8 Booot 6.2           &#58; -311.3   42.0     112.0     573    19.5      66    
   9 Gull 3              &#58; -323.4   42.3     101.0     579    17.4     ---    

White advantage = 85.94 +/- 10.81
Draw rate &#40;equal opponents&#41; = 0.00 % +/- 0.00

Wilo is practically invariant with time control at LTC, the regression (Wilo at 60min) versus (Wilo at 10min) has a factor of 1.0061, very close to 1. Here is the plot with regression and 9 engines:

=================================

From the plots one can also see which engine scales better to LTC, although most points are within error margins in scaling.

lkaufman · Post by **lkaufman** » Thu Mar 08, 2018 11:30 pm

Laskos wrote:From the most recent FGRL Top 10 (10min+6s and 60min+15s) rating lists of top 9 engines (not 10, because 1 engine is different), it seems again that Wilo is invariant to long time controls, and basically Win/Loss ratio to long time controls is pretty constant. At least FGRL results show that repeatedly.

Elo:

Top 9 10min + 6s:

Code: Select all

   # PLAYER              &#58; RATING  ERROR    POINTS  PLAYED     (%)   CFS&#40;next&#41;
   1 Stockfish 9         &#58;  181.6    9.2    1794.5    2400    74.8     100    
   2 Houdini 6           &#58;  141.5    8.8    1676.5    2400    69.9     100    
   3 Komodo 11.2         &#58;  122.0    8.9    1616.0    2400    67.3     100    
   4 Fire 6.1            &#58;  -18.5    8.3    1142.0    2400    47.6     100    
   5 Deep Shredder 13    &#58;  -44.4    8.1    1052.5    2400    43.9     100    
   6 Fizbo 2             &#58;  -61.4    8.2     994.0    2400    41.4     100    
   7 Andscacs 0.92       &#58;  -96.1    8.6     877.0    2400    36.5      88    
   8 Gull 3              &#58; -103.6    8.4     852.0    2400    35.5     100    
   9 Booot 6.2           &#58; -121.0    8.7     795.5    2400    33.1     ---    

White advantage = 39.87 +/- 2.40
Draw rate &#40;equal opponents&#41; = 64.80 % +/- 0.61

Top 9 60min + 15s:

Code: Select all

   # PLAYER              &#58; RATING  ERROR    POINTS  PLAYED     (%)   CFS&#40;next&#41;
   1 Stockfish 9         &#58;  165.6   12.4     876.5    1200    73.0     100    
   2 Houdini 6           &#58;  138.8   12.2     836.0    1200    69.7     100    
   3 Komodo 11.2         &#58;  115.0   11.8     798.5    1200    66.5     100    
   4 Fire 6.1            &#58;  -23.7   11.4     561.0    1200    46.8      75    
   5 Deep Shredder 13    &#58;  -29.4   11.2     551.0    1200    45.9     100    
   6 Fizbo 2             &#58;  -71.8   10.7     477.5    1200    39.8      86    
   7 Andscacs 0.92       &#58;  -80.6   10.9     462.5    1200    38.5     100    
   8 Booot 6.2           &#58; -102.6   11.5     425.5    1200    35.5      84    
   9 Gull 3              &#58; -111.1   11.6     411.5    1200    34.3     ---    

White advantage = 38.30 +/- 3.12
Draw rate &#40;equal opponents&#41; = 67.99 % +/- 0.83

Elo as is expected is compressed with time control, the regression (Elo at 60min) versus (Elo at 10min) has a factor of 0.9351. Here is the plot with regression and 9 engines:

=============================================

Wilo:

Top 9 10min + 6s:

Code: Select all

   # PLAYER              &#58; RATING  ERROR    POINTS  PLAYED     (%)   CFS&#40;next&#41;
   1 Stockfish 9         &#58;  579.9   51.9    1239.0    1289    96.1     100    
   2 Houdini 6           &#58;  405.4   40.0    1074.0    1195    89.9     100    
   3 Komodo 11.2         &#58;  329.2   36.4    1008.0    1184    85.1     100    
   4 Fire 6.1            &#58;  -64.9   28.0     479.0    1074    44.6     100    
   5 Deep Shredder 13    &#58; -144.4   27.7     379.0    1053    36.0     100    
   6 Fizbo 2             &#58; -199.0   25.8     385.0    1182    32.6     100    
   7 Andscacs 0.92       &#58; -277.8   27.6     271.0    1188    22.8      89    
   8 Gull 3              &#58; -300.8   29.3     247.0    1190    20.8      92    
   9 Booot 6.2           &#58; -327.6   28.3     218.0    1245    17.5     ---    

White advantage = 91.34 +/- 7.50
Draw rate &#40;equal opponents&#41; = 0.00 % +/- 0.00

Top 9 60min + 15s:

Code: Select all

   # PLAYER              &#58; RATING  ERROR    POINTS  PLAYED     (%)   CFS&#40;next&#41;
   1 Stockfish 9         &#58;  579.5   72.5     577.0     601    96.0     100    
   2 Houdini 6           &#58;  409.1   60.7     527.0     582    90.5      95    
   3 Komodo 11.2         &#58;  343.2   55.3     472.0     547    86.3     100    
   4 Fire 6.1            &#58;  -89.8   41.8     188.0     454    41.4      85    
   5 Deep Shredder 13    &#58; -118.9   38.7     203.0     504    40.3     100    
   6 Fizbo 2             &#58; -233.7   35.9     169.0     583    29.0      80    
   7 Andscacs 0.92       &#58; -254.7   40.9     137.0     549    25.0      98    
   8 Booot 6.2           &#58; -311.3   42.0     112.0     573    19.5      66    
   9 Gull 3              &#58; -323.4   42.3     101.0     579    17.4     ---    

White advantage = 85.94 +/- 10.81
Draw rate &#40;equal opponents&#41; = 0.00 % +/- 0.00

Wilo is practically invariant with time control at LTC, the regression (Wilo at 60min) versus (Wilo at 10min) has a factor of 1.0061, very close to 1. Here is the plot with regression and 9 engines:

=================================

From the plots one can also see which engine scales better to LTC, although most points are within error margins in scaling.

That is very interesting, it would indeed be nice if we could use WILO and expect results to be similar at any time control. I suspect that WILO might also make the elo differences less dependent on choice of opening book. With normal elo, books that end in equal positions will show smaller elo differences than books that end with one side half a pawn or so ahead. Maybe this would not be true for WILO.
Just curious, how would the conclusion have changed if you included the tenth engine?

Laskos · Post by **Laskos** » Fri Mar 09, 2018 12:20 am

lkaufman wrote:
That is very interesting, it would indeed be nice if we could use WILO and expect results to be similar at any time control. I suspect that WILO might also make the elo differences less dependent on choice of opening book. With normal elo, books that end in equal positions will show smaller elo differences than books that end with one side half a pawn or so ahead. Maybe this would not be true for WILO.
Just curious, how would the conclusion have changed if you included the tenth engine?

Yes, maybe your guess on opening books is correct. Closer to very balanced positions books show both higher draw rates and smaller Elo differences. But somewhat more unbalanced positions books, with lower draw rates and larger Elo differences, probably exhibit more Wins and more Losses, with a bit larger effect on Wins. This might mean that Wins/Losses ratio might be pretty stable for both sorts of books.

I the downloaded databases, the tenth engine in one list was Fritz, and in the second list Chiron, so I couldn't check for their Elo/Wilo from 10min to 60min. I may ask Andreas to include both in a list of 11 engines and send me new databases, it requires some work from him, but in the past he was very helpful.

clumma · Post by **clumma** » Fri Mar 09, 2018 6:47 pm

lkaufman wrote:consider the following (exaggerated) scenario: Engine A scores 75 wins, 25 losses, no draws against Engine C. Engine B score 75 wins, 25 losses, and 100 draws against Engine C. Wilo would say that both A and B are equal, while normal Elo would say that Engine A is much stronger, roughly twice as far above C as B is above C. It seems obvious that A is the stronger engine. Does this show a flaw in WILO,

The answer is obviously yes.

WILO would solve the problem of elo differences shrinking with longer time controls, as you show.

"Solving" a non-problem since playing strength difference between engines (and humans) obviously does shrink with LTC.

-Carl

clumma · Post by **clumma** » Fri Mar 09, 2018 6:58 pm

hgm wrote:It seems to me that contempt should always be detrimental in a true strength measurement, where the engine is tested against equal numbers of stronger and weaker opponents. It can only be helpful in a flawed, lopsided measurement, which does not reflect playing strength, but some meaningless artifact. You can then use it to drive up the meaningless artifact.

Correct. It's just a way to take advantage of statistically underpowered tournaments.

Laskos · Post by **Laskos** » Fri Mar 09, 2018 7:37 pm

clumma wrote:
lkaufman wrote:consider the following (exaggerated) scenario: Engine A scores 75 wins, 25 losses, no draws against Engine C. Engine B score 75 wins, 25 losses, and 100 draws against Engine C. Wilo would say that both A and B are equal, while normal Elo would say that Engine A is much stronger, roughly twice as far above C as B is above C. It seems obvious that A is the stronger engine. Does this show a flaw in WILO,
The answer is obviously yes.

Hypothetical extreme (to absurd) examples don't help much. That's why I am using empirical data from Andreas' excellent FGRL rating lists with consistent participants and significant number of games, from short to long time controls.

WILO would solve the problem of elo differences shrinking with longer time controls, as you show.
"Solving" a non-problem since playing strength difference between engines (and humans) obviously does shrink with LTC.

-Carl

What is "playing strength" in your understanding?

clumma · Post by **clumma** » Fri Mar 09, 2018 8:10 pm

Laskos wrote:
"Solving" a non-problem since playing strength difference between engines (and humans) obviously does shrink with LTC.
What is "playing strength" in your understanding?

In another thread on intrinsic ratings, it appeared to me that you and I have different conceptions of Elo. For me it is not a statistical procedure to predict outcome of games. That is how it is defined, but it measures something much more important: playing strength.

This playing strength could be rigorously defined for individual moves. I won't attempt to do so here but consider a hypothetical 32-man tablebase. In any position, it can order legal moves by depth to mate, then inverse of depth to draw, and finally inverse of depth to mate. The "strength" of any of these moves could be defined by this ordering. The "strength" of any list of moves could be some average of the individual move scores.

My assertion is that Elo approximates this score, and that WILO does so comparatively poorly.

-Carl

Dann Corbit · Post by **Dann Corbit** » Fri Mar 09, 2018 9:10 pm

Laskos wrote: {snip}
Hypothetical extreme (to absurd) examples don't help much. That's why I am using empirical data from Andreas' excellent FGRL rating lists with consistent participants and significant number of games, from short to long time controls.
{snip}

I think it is essential that these things are addressed by the model.
For instance, if two opponents (engine A and engine B) hypothetically play a mole of games (6x10^23) and have all draws except 100 and engine A wins all 100, I maintain that the two engines have exactly the same strength for all practical purposes. Despite the win "domination" by A, it is not stronger than B.

On the other hand, a model has exactly the value of its predictive behavior. So if the wins and losses only model can predict better than a model which also includes draws, then the win loss only model is better.

If the model makes absurd predictions (such as "A is stronger than B" after the above experiment) then the model needs a tweak to be able to predict correctly.

IMO-YMMV.

Wilo rating properties from FGRL rating lists

Re: Wilo rating properties from FGRL rating lists

Re: Wilo rating properties from FGRL rating lists

Re: Wilo rating properties from FGRL rating lists

Re: Wilo rating properties from FGRL rating lists

Re: Wilo rating properties from FGRL rating lists

Re: Wilo rating properties from FGRL rating lists

Re: Wilo rating properties from FGRL rating lists

Re: Wilo rating properties from FGRL rating lists

Re: Wilo rating properties from FGRL rating lists

Re: Wilo rating properties from FGRL rating lists