Wilo rating properties from FGRL rating lists

Discussion of anything and everything relating to chess playing software and machines.

Moderators: bob, hgm, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
User avatar
hgm
Posts: 23624
Joined: Fri Mar 10, 2006 9:06 am
Location: Amsterdam
Full name: H G Muller
Contact:

Re: Wilo rating properties from FGRL rating lists

Post by hgm » Wed May 03, 2017 7:18 am

mjlef wrote:We tried to make Komodo's Contempt act like a human Grandmaster playing against a weaker opponent. I think a GM who knows his opponent is a weaker player is more likely to choose different moves than if he were playing an equal player. The GM is likely to avoid exchanges and keep the board open and complex to increase the chance his opponent with make a goof or a move based on his lesser understanding of chess, which the GM will then exploit. In an ideal world the chess engines would know the rating of their opponents and decide for themselves how to play. The UCI spec even has commands for this, but very few GUIs use them.

I completely agree that against a range of opponents (stronger and weaker), Contempt should be set to 0. And 0 is best also against closely matched programs.
Yes, I understand that. My point is mainly that the rating lists are flawed at the top, and not representing the true rating. And that setting contempt is just a method to exploit that flaw, to fake an even more erroneous rating. While in fact it is making the engine weaker.

If the engine would develop the contempt during the game, based on the quality of the moves of his opponent, it would be a different matter.

User avatar
Laskos
Posts: 9414
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Re: Wilo rating properties from FGRL rating lists

Post by Laskos » Wed May 03, 2017 4:09 pm

lkaufman wrote:
Laskos wrote:The issue of transitivity in Chess ratings is always spinous to me. Although it might be that engine A will come stronger than B in direct match in this highly hypothetical scenario (I cannot find engine candidates to model even remotely it, and the evidence must be empiric), the predictions of these models are ratings and error margins. And in your example, Elo rating predicts that Engine A is stronger than engine B by 102 +/- Elo error margins. Wilo rating predicts that the difference between them in 0 +/- Wilo error margins. In this case Wilo error margins are a little larger than Elo error margins. If the empirical result comes at 1.5 Elo SD from Elo value of 102 and 1.0 Wilo SD from Wilo value of 0, although Engine A might be indeed stronger than engine B, the Wilo still predicted better the difference between them. I don't know which model has more empirical transitivity in it. Observe, that we have a small empirical evidence that the Wilo ratings, showing larger difference between Stockfish and Komodo in rating lists, predict better than the Elo ratings the outcome of the direct match between Stockfish and Komodo, showing here a better transitivity property.
So it seems that even if it is illogical/unreasonable to ignore all evidence from draws, it may well be that WILO is superior to Elo even in an extreme case like this. I would say that your results show that draws get too much weight in normal Elo, even if zero weight is too little. This incidentally means (unless I am mistaken) that BayesElo is worse than normal ELO, because (it was once explained by HGM) BayesElo effectively weights draws twice as heavily as normal Elo. Probably you could come up with a system that predicts results even better than WILO (something intermediate between WILO and Elo, but closer to WILO), but probably you feel that WILO is good enough. One question: which of the two (WILO or Elo) would be less sensitive to the size of White's opening advantage coming out of book (assuming reversal testing)?
I don't know which model is more accurate in ratings and their transitivity. Looking at data, your question on sensitivity on White advantage is interesting. From LTC FGRL list, it comes in Ordo for Elo at 39.66 +/- 2.87 Elo points, for Wilo at 95.69 +/- 9.88 Wilo points. In other words, as sensitivity goes, ELO is at 13.8 SD, in WILO is at 9.7 SD, so less sensitivity for WILO for White advantage. With one reserve, that I am not sure how Ordo calculates error margins in this case.

Also, here is another result showing the advantage of WILO: testing from regular balanced openings and from Endgame balanced openings, on which ELO goes completely off, while WILO still behaves reasonably:

Regular openings:
Score of Stockfish 8 vs Andscacs 0.90: +279 -29 =92 [0.812] 400
ELO difference: 254.73 +/- 34.61

Endgame openings:
Score of Stockfish 8 vs Andscacs 0.90: +26 -3 =371 [0.529] 400
ELO difference: 20.00 +/- 8.98

We see that ELO is completely off, WILO is fine.

The question of Draws and our intuition is again spinose, things can be counterintuitive. All computations here (and in all rating schemes AFAIK) are assuming uniform prior (that's why LOS and p-value are independent of Draws), but humans rarely have uniform priors. Say, you pit Komodo against some strong opponent, you probably already know that the Elo difference is not that large. That prior would introduce a Draw dependence in LOS. And it can be counterintuitive. For example, in that case a score +4 -1 =12 has a higher LOS than a score +4 -1 =2. They have equal LOS with the uniform prior which is usually assumed. If I find time, I may post sometime simulations showing this effect for very small expected Elo differences (of order 10 Elo points) in development frameworks, which usually self-test closely related versions.

clumma
Posts: 177
Joined: Fri Oct 10, 2014 8:05 pm
Location: Berkeley, CA

Re: Wilo rating properties from FGRL rating lists

Post by clumma » Thu May 04, 2017 5:49 am

Laskos wrote:LOS and p-value are independent of Draws
I don't know where you're getting that but it's clearly not correct in this context. Consider an even simpler example than Larry's, and let's be sure to keep the number of games constant: A scores 100 wins against C; B scores 50 wins and 50 draws against C. Clearly A is much stronger than B, and B is closer to C's strength than Wilo would suggest. We are also more certain of B's strength with Elo since we use all 100 samples.

There's no getting around that draws happen in chess, and that they carry information.

Dann Corbit
Posts: 9994
Joined: Wed Mar 08, 2006 7:57 pm
Location: Redmond, WA USA
Contact:

Re: Wilo rating properties from FGRL rating lists

Post by Dann Corbit » Thu May 04, 2017 7:05 am

A plays B and they draw a trillion times
A plays C one trillion games and a gets 0 wins, 1 loss, the rest draws.

A B C are equal in strength. If anyone thinks the one loss matters, I would argue that math is useless when thrust upon them.

I hate with a flaming, boiling, angry, venomous passion the idea that we should simply throw out draws. That's because we are throwing out the math, along with it.

IMO, YMMV, BIID, YATOOYBM.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.

User avatar
Laskos
Posts: 9414
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Re: Wilo rating properties from FGRL rating lists

Post by Laskos » Thu May 04, 2017 8:22 am

clumma wrote:
Laskos wrote:LOS and p-value are independent of Draws
I don't know where you're getting that but it's clearly not correct in this context. Consider an even simpler example than Larry's, and let's be sure to keep the number of games constant: A scores 100 wins against C; B scores 50 wins and 50 draws against C. Clearly A is much stronger than B, and B is closer to C's strength than Wilo would suggest. We are also more certain of B's strength with Elo since we use all 100 samples.

There's no getting around that draws happen in chess, and that they carry information.
That LOS with a uniform prior is independent of draws? The math is not that complicated, probably best presented here:
http://www.talkchess.com/forum/viewtopi ... 05&t=30624

User avatar
Laskos
Posts: 9414
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Re: Wilo rating properties from FGRL rating lists

Post by Laskos » Thu May 04, 2017 8:25 am

Dann Corbit wrote:A plays B and they draw a trillion times
A plays C one trillion games and a gets 0 wins, 1 loss, the rest draws.

A B C are equal in strength. If anyone thinks the one loss matters, I would argue that math is useless when thrust upon them.

I hate with a flaming, boiling, angry, venomous passion the idea that we should simply throw out draws. That's because we are throwing out the math, along with it.

IMO, YMMV, BIID, YATOOYBM.
What you hate and what you love might be relevant with girls on some evenings.

Dann Corbit
Posts: 9994
Joined: Wed Mar 08, 2006 7:57 pm
Location: Redmond, WA USA
Contact:

Re: Wilo rating properties from FGRL rating lists

Post by Dann Corbit » Thu May 04, 2017 8:30 am

Laskos wrote:
Dann Corbit wrote:A plays B and they draw a trillion times
A plays C one trillion games and a gets 0 wins, 1 loss, the rest draws.

A B C are equal in strength. If anyone thinks the one loss matters, I would argue that math is useless when thrust upon them.

I hate with a flaming, boiling, angry, venomous passion the idea that we should simply throw out draws. That's because we are throwing out the math, along with it.

IMO, YMMV, BIID, YATOOYBM.
What you hate and what you love might be relevant with girls on some evenings.
Tell me truthfully, you do not see the single loss as noise?
This is not feeling. It is one of those obvious things.
If you do not see that A,B,C have equal strength, then what can you possibly imagine?
When the math tells you something you can instantly and intuitively feel is wrong, then examine the model and you will find the flaws.

You are better at math than I am (though my degree is in applied mathematics).

And yet you must see what I say is correct or you have some strange fog.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.

User avatar
Laskos
Posts: 9414
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Re: Wilo rating properties from FGRL rating lists

Post by Laskos » Thu May 04, 2017 8:56 am

Dann Corbit wrote:
Laskos wrote:
Dann Corbit wrote:A plays B and they draw a trillion times
A plays C one trillion games and a gets 0 wins, 1 loss, the rest draws.

A B C are equal in strength. If anyone thinks the one loss matters, I would argue that math is useless when thrust upon them.

I hate with a flaming, boiling, angry, venomous passion the idea that we should simply throw out draws. That's because we are throwing out the math, along with it.

IMO, YMMV, BIID, YATOOYBM.
What you hate and what you love might be relevant with girls on some evenings.
Tell me truthfully, you do not see the single loss as noise?
This is not feeling. It is one of those obvious things.
If you do not see that A,B,C have equal strength, then what can you possibly imagine?
When the math tells you something you can instantly and intuitively feel is wrong, then examine the model and you will find the flaws.

You are better at math than I am (though my degree is in applied mathematics).

And yet you must see what I say is correct or you have some strange fog.
Probably equal (and might even be at that time, but not later):

In 1995, Chinook defended its man-machine title against Don Lafferty in a 32-game match. The final score was 1–0 with 31 draws for Chinook over Lafferty. After the match, Jonathan Schaeffer decided not to let Chinook compete any more, but instead try to solve checkers.

lkaufman
Posts: 3684
Joined: Sun Jan 10, 2010 5:15 am
Location: Maryland USA
Contact:

Re: Wilo rating properties from FGRL rating lists

Post by lkaufman » Thu May 04, 2017 2:49 pm

Laskos wrote:
lkaufman wrote:
Laskos wrote:The issue of transitivity in Chess ratings is always spinous to me. Although it might be that engine A will come stronger than B in direct match in this highly hypothetical scenario (I cannot find engine candidates to model even remotely it, and the evidence must be empiric), the predictions of these models are ratings and error margins. And in your example, Elo rating predicts that Engine A is stronger than engine B by 102 +/- Elo error margins. Wilo rating predicts that the difference between them in 0 +/- Wilo error margins. In this case Wilo error margins are a little larger than Elo error margins. If the empirical result comes at 1.5 Elo SD from Elo value of 102 and 1.0 Wilo SD from Wilo value of 0, although Engine A might be indeed stronger than engine B, the Wilo still predicted better the difference between them. I don't know which model has more empirical transitivity in it. Observe, that we have a small empirical evidence that the Wilo ratings, showing larger difference between Stockfish and Komodo in rating lists, predict better than the Elo ratings the outcome of the direct match between Stockfish and Komodo, showing here a better transitivity property.
So it seems that even if it is illogical/unreasonable to ignore all evidence from draws, it may well be that WILO is superior to Elo even in an extreme case like this. I would say that your results show that draws get too much weight in normal Elo, even if zero weight is too little. This incidentally means (unless I am mistaken) that BayesElo is worse than normal ELO, because (it was once explained by HGM) BayesElo effectively weights draws twice as heavily as normal Elo. Probably you could come up with a system that predicts results even better than WILO (something intermediate between WILO and Elo, but closer to WILO), but probably you feel that WILO is good enough. One question: which of the two (WILO or Elo) would be less sensitive to the size of White's opening advantage coming out of book (assuming reversal testing)?
I don't know which model is more accurate in ratings and their transitivity. Looking at data, your question on sensitivity on White advantage is interesting. From LTC FGRL list, it comes in Ordo for Elo at 39.66 +/- 2.87 Elo points, for Wilo at 95.69 +/- 9.88 Wilo points. In other words, as sensitivity goes, ELO is at 13.8 SD, in WILO is at 9.7 SD, so less sensitivity for WILO for White advantage. With one reserve, that I am not sure how Ordo calculates error margins in this case.

Also, here is another result showing the advantage of WILO: testing from regular balanced openings and from Endgame balanced openings, on which ELO goes completely off, while WILO still behaves reasonably:

Regular openings:
Score of Stockfish 8 vs Andscacs 0.90: +279 -29 =92 [0.812] 400
ELO difference: 254.73 +/- 34.61

Endgame openings:
Score of Stockfish 8 vs Andscacs 0.90: +26 -3 =371 [0.529] 400
ELO difference: 20.00 +/- 8.98

We see that ELO is completely off, WILO is fine.

The question of Draws and our intuition is again spinose, things can be counterintuitive. All computations here (and in all rating schemes AFAIK) are assuming uniform prior (that's why LOS and p-value are independent of Draws), but humans rarely have uniform priors. Say, you pit Komodo against some strong opponent, you probably already know that the Elo difference is not that large. That prior would introduce a Draw dependence in LOS. And it can be counterintuitive. For example, in that case a score +4 -1 =12 has a higher LOS than a score +4 -1 =2. They have equal LOS with the uniform prior which is usually assumed. If I find time, I may post sometime simulations showing this effect for very small expected Elo differences (of order 10 Elo points) in development frameworks, which usually self-test closely related versions.
Another issue is that the programming of engines, as well as the play of human grandmasters, is aimed to maximize score with draws counting as 1/2, rather than just number of wins (although wins are sometimes used as a tiebreak). WILO might be better mathematically, but it does not correspond to the actual scoring of tournaments. This is not a minor issue. Suppose Komodo (or even Carlsen) reaches a middlegame position with a half-pawn advantage or so. He has to decide between retaining queens with let's say a 60% winning chance, a 20% losing chance, and a 20% drawing chance. Or he can simplify to an endgame with a 24% winning chance, a 75% drawing chance, and a 1% losing chance (i.e. a gross blunder or flag fall). In any normal tournament or match, he should keep queens on (assuming a neutral tournament/match situation) to maximize his expected score. But to maximize WILO, he should trade queens. Komodo has code to try to avoid simplifying in such a situation (maybe not very effective, but that's irrelevant); if we wanted to maximize WILO we would have to make significant program changes. In my view, we would have to return to the old practice of replaying draws until someone wins to justify switching to WILO. Elimination tournaments with playoffs at faster time limits to break ties are a version of this, but then you are rating blitz games together with slow ones. This is also my objection to Bayes Elo; it also makes an assumption that does not correspond to normal match/tournament scoring.
Komodo rules!

User avatar
hgm
Posts: 23624
Joined: Fri Mar 10, 2006 9:06 am
Location: Amsterdam
Full name: H G Muller
Contact:

Re: Wilo rating properties from FGRL rating lists

Post by hgm » Thu May 04, 2017 2:59 pm

Good point. You could be better than all your opponents in a tourney with a LOS of 99%, and still have less than 1% probability to win the tourney.

So how you play should depend on the type of tourney you are in. And even if you are in a two-player match, on the current score and how many games are left.

Post Reply