Laskos wrote:The issue of transitivity in Chess ratings is always spinous to me. Although it might be that engine A will come stronger than B in direct match in this highly hypothetical scenario (I cannot find engine candidates to model even remotely it, and the evidence must be empiric), the predictions of these models are ratings and error margins. And in your example, Elo rating predicts that Engine A is stronger than engine B by 102 +/- Elo error margins. Wilo rating predicts that the difference between them in 0 +/- Wilo error margins. In this case Wilo error margins are a little larger than Elo error margins. If the empirical result comes at 1.5 Elo SD from Elo value of 102 and 1.0 Wilo SD from Wilo value of 0, although Engine A might be indeed stronger than engine B, the Wilo still predicted better the difference between them. I don't know which model has more empirical transitivity in it. Observe, that we have a small empirical evidence that the Wilo ratings, showing larger difference between Stockfish and Komodo in rating lists, predict better than the Elo ratings the outcome of the direct match between Stockfish and Komodo, showing here a better transitivity property.
So it seems that even if it is illogical/unreasonable to ignore all evidence from draws, it may well be that WILO is superior to Elo even in an extreme case like this. I would say that your results show that draws get too much weight in normal Elo, even if zero weight is too little. This incidentally means (unless I am mistaken) that BayesElo is worse than normal ELO, because (it was once explained by HGM) BayesElo effectively weights draws twice as heavily as normal Elo. Probably you could come up with a system that predicts results even better than WILO (something intermediate between WILO and Elo, but closer to WILO), but probably you feel that WILO is good enough. One question: which of the two (WILO or Elo) would be less sensitive to the size of White's opening advantage coming out of book (assuming reversal testing)?
I don't know which model is more accurate in ratings and their transitivity. Looking at data, your question on sensitivity on White advantage is interesting. From LTC FGRL list, it comes in Ordo for Elo at 39.66 +/- 2.87 Elo points, for Wilo at 95.69 +/- 9.88 Wilo points. In other words, as sensitivity goes, ELO is at 13.8 SD, in WILO is at 9.7 SD, so less sensitivity for WILO for White advantage. With one reserve, that I am not sure how Ordo calculates error margins in this case.
Also, here is another result showing the advantage of WILO: testing from regular balanced openings and from Endgame balanced openings, on which ELO goes completely off, while WILO still behaves reasonably:
Score of Stockfish 8 vs Andscacs 0.90: +279 -29 =92 [0.812] 400
ELO difference: 254.73 +/- 34.61
Score of Stockfish 8 vs Andscacs 0.90: +26 -3 =371 [0.529] 400
ELO difference: 20.00 +/- 8.98
We see that ELO is completely off, WILO is fine.
The question of Draws and our intuition is again spinose, things can be counterintuitive. All computations here (and in all rating schemes AFAIK) are assuming uniform prior (that's why LOS and p-value are independent of Draws), but humans rarely have uniform priors. Say, you pit Komodo against some strong opponent, you probably already know that the Elo difference is not that large. That prior would introduce a Draw dependence in LOS. And it can be counterintuitive. For example, in that case a score +4 -1 =12 has a higher LOS than a score +4 -1 =2. They have equal LOS with the uniform prior which is usually assumed. If I find time, I may post sometime simulations showing this effect for very small expected Elo differences (of order 10 Elo points) in development frameworks, which usually self-test closely related versions.