Houdini, much weaker engines, and Arpad Elo

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Harvey Williamson, bob

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
User avatar
Laskos
Posts: 9408
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Re: Houdini, much weaker engines, and Arpad Elo

Post by Laskos » Thu Jan 02, 2014 1:15 pm

mohzus wrote:
Laskos wrote:The important thing in Elo's derivation is to assume normal distribution of rating and equal standard deviation of players
There's something I don't really understand by "normal distribution of rating". Does this mean that the "strength" of a single person is assumed to follow a normal distribution with standard deviation 200? Or does this mean that the histogram (number of people with a given elo vs elo) of all players follow a normal distribution?
No, at any given time, every chess player has a normal (Gaussian) distribution of chess levels, with the same standard deviation of 200 points.


Because the latter doesn't seem true for Elo rating system (I downloaded and plotted all the ratings of active FIDE players of 2 months of 2013 and the result wasn't a Gaussian but more like reversed Maxwell speed distribution where the tail is with the lower ratings. The sample of people was over 150k).
It sounds very strange to me to assume that the standard deviation in strength is constant and the same constant for every single player.

I've also plotted an histogram of FICS ratings calculated with BayesElo rather than Glicko or Elo, the plot was very close to what I'd call a Gaussian, or a logistic distribution (honestly I can't distinguish between the 2 from a look at a graph).

Wikipedia states that
Wikipedia wrote:Subsequent statistical tests have suggested that chess performance is almost certainly not distributed as a normal distribution, as weaker players have significantly (but not highly significantly) greater winning chances than Elo's model predicts.[citation needed] Therefore, the USCF and some chess sites use a formula based on the logistic distribution. Significant statistical anomalies have also been found when using the logistic distribution in chess.[5] FIDE continues to use the normal distribution. The normal and logistic distribution points are, in a way, arbitrary points in a spectrum of distributions which would work well. In practice, both of these distributions work very well for a number of different games.

User avatar
Laskos
Posts: 9408
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Re: Houdini, much weaker engines, and Arpad Elo

Post by Laskos » Thu Jan 02, 2014 1:25 pm

Rebel wrote:
Laskos wrote:I tested Stockfish DD x64 and Stockfish DD x32 against SOS 5.1 (250ms/move):

Code: Select all

    Program                            Score       %

  1 SF DD 64                       : 4858.0/5000  97.2
  2 SF DD 32                       : 4801.0/5000  96.0
  3 SOS 5.1                        : 341.0/10000   3.4
The real difference between SF DD x64 and SF DD x32 is 42 +/- 5 ELO points at this TC (Gaussian or Logistic).

The predictions of the two models based on this match are:
D_ELO Gaussian: 43 points
D_ELO Logistic: 61 points

Again, the Gaussian model is a better predictor on the tails, for large ELO differences.
I am layman when it comes to the refinements of elo calculations, therefore allow me a couple of stupid questions.

1. These elo differences between Gaussian and Logistic disappear when only engines of (about) equal strength play against each other?
Yes, as can be seen from derivatives in 0, at small ELO differences (smaller than ~200), the differences between Gaussian and Logistic are very small.
2. Rating lists tend to operate in pools of similar strength. Kind of elo incest. What happens to the elo of high rated engines if they placed into a pool of (much) weaker engines, would their elo drop?
Depends on the model. If we use the Logistic (as it is done today in computer chess), and then we discover that they really follow the Gaussian Model, the large ELO differences of Gaussian Model will be smaller compared to Logistic Model. The important thing is to decide which model to use, as to have additivity on large ELO span: predicted(ELO1-ELO2) + predicted(ELO2-ELO3) == predicted(ELO1-ELO3).

Sven
Posts: 3822
Joined: Thu May 15, 2008 7:57 pm
Location: Berlin, Germany
Full name: Sven Schüle
Contact:

Re: Houdini, much weaker engines, and Arpad Elo

Post by Sven » Thu Jan 02, 2014 11:06 pm

Rebel wrote:2. Rating lists tend to operate in pools of similar strength. Kind of elo incest. What happens to the elo of high rated engines if they placed into a pool of (much) weaker engines, would their elo drop?
I would say: impossible to answer! The rating of an engine A in pool P1 is completely unrelated to the rating of the same engine A in pool P2. So you can't say whether A's rating "drops". A has a rating relative to pool P1 and another rating relative to P2. There is nothing that connects P1 and P2.

Even if P2 is a subset of P1 that only contains A and "some weaker engines" then P2 is still a different pool. I would never attempt to compare P1 ratings to P2 ratings. Many people actually do things like that, e.g. they compare CCRL 40/4 and CCRL 40/40 ratings; I would never have done that ...

I think there are some people who might believe that an ELO rating is something like an absolute, physical property of a chess player (or engine). It simply isn't. An ELO rating is a number that can only be interpreted in a very limited context, and that context is the set of other ELO ratings in the same rating pool. That rating in relation to those other ratings tells us something about the expected outcome of games against the owners of the other ratings, nothing more. "ELO 3200" of a chess engine within a computer chess rating list does not mean an expected outcome of about 92% against a "2800 ELO" human GM, it means nothing like that, simply because engine and human GM have their ratings from different pools. One may argue that computer chess ratings are "meant" to resemble ratings of human chess players, and perhaps some engine ratings were "likely" comparable to human ratings, but in fact nothing tells us how far both pools are away from each other, unless you include a significant number of human players with a significant number of games into the engine pool (and even then you run into trouble since human playing strenght changes over time in contrast to that of engines), or vice versa.

For the same reason, a strong engine that is put into a pool with weaker engines simply gets a different rating there that can't be compared to those in other pools. It does not "drop" or "rise".

Sven

Post Reply