Wilo rating properties from FGRL rating lists

Laskos · Post by **Laskos** » Mon May 01, 2017 7:26 pm

"Wilo" is the name given by Miguel Ballicora to drawless Elo (draws are discarded). It was shown that rating model based on Wilo is sound, and Wilos are additive over logistic, just as Elos. No draw model is needed, as there are no draws. What I will show empirically from FGRL rating lists of Andreas Strangmüller Top 10 at 60''+0.6'' and 60'+15'' (roughly 60x time factor between the two lists) is:

1/ Wilo rating doesn't compress or dilate ratings from STC to LTC. Elo rating does compress ratings from STC to LTC
2/ Wilo rating give higher LOS (p-values), showing more sensitivity than Elo rating, therefore less games are needed for Wilo rating to show significant differences between engines than with Elo model

Therefore considering draws as non-games is better both as calibration of ratings (no time control dependence) and as number of games needed for significance.

------------------------------------

Results from FGRL rating lists ( http://www.fastgm.de/ ):

------------------------------------

I/ Compression/Dilation of ratings

Elo ratings:

Mean deviation of ratings 60''+0.6'': 119 Elo points
Mean deviation of ratings 60'+15'': 84 Elo points

Compression: 29.5% +/- 4% ---> significant compression

Wilo ratings (discarding draws):

Mean deviation of ratings 60''+0.6'': 251 Wilo points
Mean deviation of ratings 60'+15'': 260 Wilo points

Dilation: 3.6% +/- 8% ---> insignificant dilation

We see that Wilo ratings are within error margins of not compressing or dilating with 60x time control. Elo ratings do compress significantly at 60x time control

II/ Norm (regular Frobenius) of LOS (p-values) matrix.

The minimal Norm of p-value matrix is 0 (all engines are perfectly equal in strength, LOS=50% between any two engines, Null Hypothesis is accepted with 100% probability).
The maximal Norm of p-value 10x10 symmetric matrix is sqrt(90) ~ 9.487 (all engines are clearly separated by strength, LOS=100% between any two engines, Null Hypothesis is rejected with 100% probability)

Elo list 60''+0.6'': Norm of p-value matrix is 9.15
Wilo list 60''+0.6'': Norm of p-value matrix is 9.22

Wilo list shows generally higher LOS values (more sensitivity).

Elo list 60'+15'': Norm of p-value matrix is 9.09
Wilo list 60'+15'': Norm of p-value matrix is 9.13

Wilo list shows again higher sensitivity.

Both confirm that Wilo ratings show more sensitivity, therefore less games are needed for the same confidence compared to Elo ratings.

III/ Scaling

With Wilo empirically shown as better rating system than Elo in computer chess with our data from FGRL, not depending on time control and having higher sensitivity, I put in table the scaling of top 10 engines between bullet (60''+0.6'') and LTC (60'+15''), or roughly 60x time factor:

Code: Select all

     Engine                    Scaling Wilo
  --------------------------------------------
   1 Andscacs 0.90       &#58;          91
   2 Komodo 10.4         &#58;          39
   3 Stockfish 8         &#58;          30
   4 Deep Shredder 13    &#58;          19
   5 Fire 5              &#58;          18
   6 Gull 3              &#58;           2
   7 Houdini 5.01        &#58;         -22
   8 Chiron 4            &#58;         -24
   9 Fizbo 1.9           &#58;         -76
  10 Fritz 15            &#58;         -77

The highly significant results are that Andscacs 0.90 is the best scaling engine, with Fritz 15 nad Fizbo 1.9 the worst scaling. Also, that Houdini 5 scales significantly worse than other top engines.

lkaufman · Post by **lkaufman** » Tue May 02, 2017 6:49 am

I like your analysis, and I've always suspected that WILO would solve the problem of elo differences shrinking with longer time controls, as you show. But consider the following (exaggerated) scenario: Engine A scores 75 wins, 25 losses, no draws against Engine C. Engine B score 75 wins, 25 losses, and 100 draws against Engine C. Wilo would say that both A and B are equal, while normal Elo would say that Engine A is much stronger, roughly twice as far above C as B is above C. It seems obvious that A is the stronger engine. Does this show a flaw in WILO, or can you make a case that A is not clearly stronger than B, or perhaps you will say that in this case WILO is inferior, but in more typical cases it is superior? I'm looking forward to your reply, this is interesting. I suspect that Komodo would look worse on WILO than on Elo because it draws less often against weaker opponents, which helps with Elo but not with WILO. But perhaps this is not a big effect if it wins most of the games that might have been draws.

Laskos · Post by **Laskos** » Tue May 02, 2017 2:26 pm

lkaufman wrote:I like your analysis, and I've always suspected that WILO would solve the problem of elo differences shrinking with longer time controls, as you show. But consider the following (exaggerated) scenario: Engine A scores 75 wins, 25 losses, no draws against Engine C. Engine B score 75 wins, 25 losses, and 100 draws against Engine C. Wilo would say that both A and B are equal, while normal Elo would say that Engine A is much stronger, roughly twice as far above C as B is above C. It seems obvious that A is the stronger engine. Does this show a flaw in WILO, or can you make a case that A is not clearly stronger than B, or perhaps you will say that in this case WILO is inferior, but in more typical cases it is superior? I'm looking forward to your reply, this is interesting. I suspect that Komodo would look worse on WILO than on Elo because it draws less often against weaker opponents, which helps with Elo but not with WILO. But perhaps this is not a big effect if it wins most of the games that might have been draws.

The question about superiority is interesting. While it seems obvious and common sense that A is stronger than B, LOS (p-value) derived from t-value (Wins-Losses)/sqrt(Wins+Losses) is independent of Draws (for an unbiased prior). So, engine A is as likely to be superior to engine C as engine B is likely to be superior to engine C. Moreover, SPRT LLR (the rigorous stop of testing in pre-defined error conditions) scaling is also independent of Draws. This contradiction seems to be fed by human good understanding that Elo difference is larger for A, and poor understanding that error margins are similarly larger for A, and the significance is given by both in fraction (Elo_difference) over (Elo_errors).

Concerning Komodo, it indeed comes worse-off in FGRL Wilo rating list compared to Elo rating list, at least compared to Stockfish. One issue here is Contempt. It seems to help in Elo rating list and to harm in Wilo rating list. I played this morning a gauntlet of 6000 games at 5''+0.05'' between Andscacs and 3 versions of Komodo: one with Contempt=30 (adequate against Andscacs), one with Contempt=0 and one with Contempt=-10. Here are the Elo list and the Wilo list:

ELO:

Code: Select all

   # PLAYER            &#58; RATING  ERROR    POINTS  PLAYED     (%)   CFS&#40;next&#41;
   1 K Contempt 30     &#58; 3103.1   15.3    1762.0    2000    88.1      92    
   2 K No Contempt     &#58; 3083.8   15.3    1738.0    2000    86.9      95    
   3 K Contempt -10    &#58; 3063.0   14.4    1710.0    2000    85.5     100    
   4 Andscacs 0.90     &#58; 2750.1    8.1     790.0    6000    13.2     ---

WILO:

Code: Select all

   # PLAYER            &#58; RATING  ERROR    POINTS  PLAYED     (%)   CFS&#40;next&#41;
   1 K No Contempt     &#58; 3138.1   29.8    1569.0    1662    94.4      84    
   2 K Contempt 30     &#58; 3113.3   26.9    1636.0    1748    93.6      62    
   3 K Contempt -10    &#58; 3105.9   27.3    1529.0    1638    93.3     100    
   4 Andscacs 0.90     &#58; 2642.7   15.7     314.0    5048     6.2     ---

While in Elo rating list Contempt=30 Komodo came atop Contempt=0 by 20 Elo points, in Wilo rating list it came worse by 25 Wilo points. This effect must be present to even higher degree in longer TC lists. Contempt seems to transfer more draws to wins than to losses against weaker engine (so performs better than draw), but at lower ratios of wins/losses. It is against weaker engines somewhere in between ratio of 1 and ratio of Wins/Losses for non-Contempt one. Well, in that sense Wilo might be not very satisfying for you, but I am not sure what's better, that there is Contempt factor involved in a rating list or it is better to avoid it. I tried even a negative Contempt, because I was not sure Wilo wouldn't favor it.

lkaufman · Post by **lkaufman** » Tue May 02, 2017 6:10 pm

Laskos wrote:
lkaufman wrote:I like your analysis, and I've always suspected that WILO would solve the problem of elo differences shrinking with longer time controls, as you show. But consider the following (exaggerated) scenario: Engine A scores 75 wins, 25 losses, no draws against Engine C. Engine B score 75 wins, 25 losses, and 100 draws against Engine C. Wilo would say that both A and B are equal, while normal Elo would say that Engine A is much stronger, roughly twice as far above C as B is above C. It seems obvious that A is the stronger engine. Does this show a flaw in WILO, or can you make a case that A is not clearly stronger than B, or perhaps you will say that in this case WILO is inferior, but in more typical cases it is superior? I'm looking forward to your reply, this is interesting. I suspect that Komodo would look worse on WILO than on Elo because it draws less often against weaker opponents, which helps with Elo but not with WILO. But perhaps this is not a big effect if it wins most of the games that might have been draws.
The question about superiority is interesting. While it seems obvious and common sense that A is stronger than B, LOS (p-value) derived from t-value (Wins-Losses)/sqrt(Wins+Losses) is independent of Draws (for an unbiased prior). So, engine A is as likely to be superior to engine C as engine B is likely to be superior to engine C. Moreover, SPRT LLR (the rigorous stop of testing in pre-defined error conditions) scaling is also independent of Draws. This contradiction seems to be fed by human good understanding that Elo difference is larger for A, and poor understanding that error margins are similarly larger for A, and the significance is given by both in fraction (Elo_difference) over (Elo_errors).

Concerning Komodo, it indeed comes worse-off in FGRL Wilo rating list compared to Elo rating list, at least compared to Stockfish. One issue here is Contempt. It seems to help in Elo rating list and to harm in Wilo rating list. I played this morning a gauntlet of 6000 games at 5''+0.05'' between Andscacs and 3 versions of Komodo: one with Contempt=30 (adequate against Andscacs), one with Contempt=0 and one with Contempt=-10. Here are the Elo list and the Wilo list:

ELO:
Code: Select all
   # PLAYER            &#58; RATING  ERROR    POINTS  PLAYED     (%)   CFS&#40;next&#41;
   1 K Contempt 30     &#58; 3103.1   15.3    1762.0    2000    88.1      92    
   2 K No Contempt     &#58; 3083.8   15.3    1738.0    2000    86.9      95    
   3 K Contempt -10    &#58; 3063.0   14.4    1710.0    2000    85.5     100    
   4 Andscacs 0.90     &#58; 2750.1    8.1     790.0    6000    13.2     ---  
WILO:
Code: Select all
   # PLAYER            &#58; RATING  ERROR    POINTS  PLAYED     (%)   CFS&#40;next&#41;
   1 K No Contempt     &#58; 3138.1   29.8    1569.0    1662    94.4      84    
   2 K Contempt 30     &#58; 3113.3   26.9    1636.0    1748    93.6      62    
   3 K Contempt -10    &#58; 3105.9   27.3    1529.0    1638    93.3     100    
   4 Andscacs 0.90     &#58; 2642.7   15.7     314.0    5048     6.2     ---    
While in Elo rating list Contempt=30 Komodo came atop Contempt=0 by 20 Elo points, in Wilo rating list it came worse by 25 Wilo points. This effect must be present to even higher degree in longer TC lists. Contempt seems to transfer more draws to wins than to losses against weaker engine (so performs better than draw), but at lower ratios of wins/losses. It is against weaker engines somewhere in between ratio of 1 and ratio of Wins/Losses for non-Contempt one. Well, in that sense Wilo might be not very satisfying for you, but I am not sure what's better, that there is Contempt factor involved in a rating list or it is better to avoid it. I tried even a negative Contempt, because I was not sure Wilo wouldn't favor it.

Everything you say seems correct, but the issue is that we are not talking about the LOS of A or B over C, which is probably 99.999% or more in either case. The question is what would you expect the result of a match between A and B to be, given the above? I imagine everyone would bet on A; do you think it would really be a toss-up? It seems to me that the strength gap between B and C is clearly less than between A and C, but since B played twice as many games the LOS is the same despite this. Basically, while a draw does not help determine which of two engines is stronger, it does suggest that the strength difference is less than you would have estimated without this new information. What do you think?
About Komodo and contempt, it all makes sense, and if the rating lists ever switch to WILO I guess we would set the default to zero.

hgm · Post by **hgm** » Tue May 02, 2017 7:07 pm

It seems to me that contempt should always be detrimental in a true strength measurement, where the engine is tested against equal numbers of stronger and weaker opponents. It can only be helpful in a flawed, lopsided measurement, which does not reflect playing strength, but some meaningless artifact. You can then use it to drive up the meaningless artifact.

mjlef · Post by **mjlef** » Wed May 03, 2017 12:03 am

hgm wrote:It seems to me that contempt should always be detrimental in a true strength measurement, where the engine is tested against equal numbers of stronger and weaker opponents. It can only be helpful in a flawed, lopsided measurement, which does not reflect playing strength, but some meaningless artifact. You can then use it to drive up the meaningless artifact.

We tried to make Komodo's Contempt act like a human Grandmaster playing against a weaker opponent. I think a GM who knows his opponent is a weaker player is more likely to choose different moves than if he were playing an equal player. The GM is likely to avoid exchanges and keep the board open and complex to increase the chance his opponent with make a goof or a move based on his lesser understanding of chess, which the GM will then exploit. In an ideal world the chess engines would know the rating of their opponents and decide for themselves how to play. The UCI spec even has commands for this, but very few GUIs use them.

I completely agree that against a range of opponents (stronger and weaker), Contempt should be set to 0. And 0 is best also against closely matched programs.

Dirt · Post by **Dirt** » Wed May 03, 2017 12:40 am

lkaufman wrote:Everything you say seems correct, but the issue is that we are not talking about the LOS of A or B over C, which is probably 99.999% or more in either case. The question is what would you expect the result of a match between A and B to be, given the above? I imagine everyone would bet on A; do you think it would really be a toss-up?

I think I do. You can simulate a small part (perhaps 20 vs. 25 draws) of this by choosing fighting opening for A and drawish openings for B. If the openings are balanced then the extra draws against C will tell you nothing.

lkaufman · Post by **lkaufman** » Wed May 03, 2017 1:24 am

Dirt wrote:
lkaufman wrote:Everything you say seems correct, but the issue is that we are not talking about the LOS of A or B over C, which is probably 99.999% or more in either case. The question is what would you expect the result of a match between A and B to be, given the above? I imagine everyone would bet on A; do you think it would really be a toss-up?
I think I do. You can simulate a small part (perhaps 20 vs. 25 draws) of this by choosing fighting opening for A and drawish openings for B. If the openings are balanced then the extra draws against C will tell you nothing.

That may be true, but I'm not convinced that your model of fighting vs drawish openings has much to do with two matches using the same openings but getting very different results.

Laskos · Post by **Laskos** » Wed May 03, 2017 6:53 am

lkaufman wrote:
Laskos wrote:
lkaufman wrote:I like your analysis, and I've always suspected that WILO would solve the problem of elo differences shrinking with longer time controls, as you show. But consider the following (exaggerated) scenario: Engine A scores 75 wins, 25 losses, no draws against Engine C. Engine B score 75 wins, 25 losses, and 100 draws against Engine C. Wilo would say that both A and B are equal, while normal Elo would say that Engine A is much stronger, roughly twice as far above C as B is above C. It seems obvious that A is the stronger engine. Does this show a flaw in WILO, or can you make a case that A is not clearly stronger than B, or perhaps you will say that in this case WILO is inferior, but in more typical cases it is superior? I'm looking forward to your reply, this is interesting. I suspect that Komodo would look worse on WILO than on Elo because it draws less often against weaker opponents, which helps with Elo but not with WILO. But perhaps this is not a big effect if it wins most of the games that might have been draws.
The question about superiority is interesting. While it seems obvious and common sense that A is stronger than B, LOS (p-value) derived from t-value (Wins-Losses)/sqrt(Wins+Losses) is independent of Draws (for an unbiased prior). So, engine A is as likely to be superior to engine C as engine B is likely to be superior to engine C. Moreover, SPRT LLR (the rigorous stop of testing in pre-defined error conditions) scaling is also independent of Draws. This contradiction seems to be fed by human good understanding that Elo difference is larger for A, and poor understanding that error margins are similarly larger for A, and the significance is given by both in fraction (Elo_difference) over (Elo_errors).

Concerning Komodo, it indeed comes worse-off in FGRL Wilo rating list compared to Elo rating list, at least compared to Stockfish. One issue here is Contempt. It seems to help in Elo rating list and to harm in Wilo rating list. I played this morning a gauntlet of 6000 games at 5''+0.05'' between Andscacs and 3 versions of Komodo: one with Contempt=30 (adequate against Andscacs), one with Contempt=0 and one with Contempt=-10. Here are the Elo list and the Wilo list:

ELO:
Code: Select all
   # PLAYER            &#58; RATING  ERROR    POINTS  PLAYED     (%)   CFS&#40;next&#41;
   1 K Contempt 30     &#58; 3103.1   15.3    1762.0    2000    88.1      92    
   2 K No Contempt     &#58; 3083.8   15.3    1738.0    2000    86.9      95    
   3 K Contempt -10    &#58; 3063.0   14.4    1710.0    2000    85.5     100    
   4 Andscacs 0.90     &#58; 2750.1    8.1     790.0    6000    13.2     ---  
WILO:
Code: Select all
   # PLAYER            &#58; RATING  ERROR    POINTS  PLAYED     (%)   CFS&#40;next&#41;
   1 K No Contempt     &#58; 3138.1   29.8    1569.0    1662    94.4      84    
   2 K Contempt 30     &#58; 3113.3   26.9    1636.0    1748    93.6      62    
   3 K Contempt -10    &#58; 3105.9   27.3    1529.0    1638    93.3     100    
   4 Andscacs 0.90     &#58; 2642.7   15.7     314.0    5048     6.2     ---    
While in Elo rating list Contempt=30 Komodo came atop Contempt=0 by 20 Elo points, in Wilo rating list it came worse by 25 Wilo points. This effect must be present to even higher degree in longer TC lists. Contempt seems to transfer more draws to wins than to losses against weaker engine (so performs better than draw), but at lower ratios of wins/losses. It is against weaker engines somewhere in between ratio of 1 and ratio of Wins/Losses for non-Contempt one. Well, in that sense Wilo might be not very satisfying for you, but I am not sure what's better, that there is Contempt factor involved in a rating list or it is better to avoid it. I tried even a negative Contempt, because I was not sure Wilo wouldn't favor it.
Everything you say seems correct, but the issue is that we are not talking about the LOS of A or B over C, which is probably 99.999% or more in either case. The question is what would you expect the result of a match between A and B to be, given the above? I imagine everyone would bet on A; do you think it would really be a toss-up? It seems to me that the strength gap between B and C is clearly less than between A and C, but since B played twice as many games the LOS is the same despite this. Basically, while a draw does not help determine which of two engines is stronger, it does suggest that the strength difference is less than you would have estimated without this new information. What do you think?
About Komodo and contempt, it all makes sense, and if the rating lists ever switch to WILO I guess we would set the default to zero.

The issue of transitivity in Chess ratings is always spinous to me. Although it might be that engine A will come stronger than B in direct match in this highly hypothetical scenario (I cannot find engine candidates to model even remotely it, and the evidence must be empiric), the predictions of these models are ratings and error margins. And in your example, Elo rating predicts that Engine A is stronger than engine B by 102 +/- Elo error margins. Wilo rating predicts that the difference between them in 0 +/- Wilo error margins. In this case Wilo error margins are a little larger than Elo error margins. If the empirical result comes at 1.5 Elo SD from Elo value of 102 and 1.0 Wilo SD from Wilo value of 0, although Engine A might be indeed stronger than engine B, the Wilo still predicted better the difference between them. I don't know which model has more empirical transitivity in it. Observe, that we have a small empirical evidence that the Wilo ratings, showing larger difference between Stockfish and Komodo in rating lists, predict better than the Elo ratings the outcome of the direct match between Stockfish and Komodo, showing here a better transitivity property.

lkaufman · Post by **lkaufman** » Wed May 03, 2017 7:34 am

Laskos wrote:The issue of transitivity in Chess ratings is always spinous to me. Although it might be that engine A will come stronger than B in direct match in this highly hypothetical scenario (I cannot find engine candidates to model even remotely it, and the evidence must be empiric), the predictions of these models are ratings and error margins. And in your example, Elo rating predicts that Engine A is stronger than engine B by 102 +/- Elo error margins. Wilo rating predicts that the difference between them in 0 +/- Wilo error margins. In this case Wilo error margins are a little larger than Elo error margins. If the empirical result comes at 1.5 Elo SD from Elo value of 102 and 1.0 Wilo SD from Wilo value of 0, although Engine A might be indeed stronger than engine B, the Wilo still predicted better the difference between them. I don't know which model has more empirical transitivity in it. Observe, that we have a small empirical evidence that the Wilo ratings, showing larger difference between Stockfish and Komodo in rating lists, predict better than the Elo ratings the outcome of the direct match between Stockfish and Komodo, showing here a better transitivity property.

So it seems that even if it is illogical/unreasonable to ignore all evidence from draws, it may well be that WILO is superior to Elo even in an extreme case like this. I would say that your results show that draws get too much weight in normal Elo, even if zero weight is too little. This incidentally means (unless I am mistaken) that BayesElo is worse than normal ELO, because (it was once explained by HGM) BayesElo effectively weights draws twice as heavily as normal Elo. Probably you could come up with a system that predicts results even better than WILO (something intermediate between WILO and Elo, but closer to WILO), but probably you feel that WILO is good enough. One question: which of the two (WILO or Elo) would be less sensitive to the size of White's opening advantage coming out of book (assuming reversal testing)?

Wilo rating properties from FGRL rating lists

Wilo rating properties from FGRL rating lists

Re: Wilo rating properties from FGRL rating lists

Re: Wilo rating properties from FGRL rating lists

Re: Wilo rating properties from FGRL rating lists

Re: Wilo rating properties from FGRL rating lists

Re: Wilo rating properties from FGRL rating lists

Re: Wilo rating properties from FGRL rating lists

Re: Wilo rating properties from FGRL rating lists

Re: Wilo rating properties from FGRL rating lists

Re: Wilo rating properties from FGRL rating lists