Houdini, much weaker engines, and Arpad Elo

Laskos · Post by **Laskos** » Fri Nov 29, 2013 11:31 am

1)
Most, if not all computer chess rating lists are based on logistic curve with,
D_ELO = -400/Log[10]* Log[1/s-1],
where s=(w+d/2)/(w+d+l)
It is based on the assumption that if player A scores k times as many points (with wins counting for 1, draws counting for 1/2) against player B, and player B scores k times as many points than his opponent C, then in a match between A and C, A should score k*k times as many points as C. Its inverse is

s=1/(1+10^(-D_ELO/400))

FIDE however uses Gaussian distribution for calculating ELO according to Arpad Elo:
http://www.fide.com/fide/handbook.html? ... ew=article
Arpad Elo assumed that, at any given time, every chess player has a normal (Gaussian) distribution of chess levels (i.e.ratings), all with the same standard deviation, sigma = 200 points, but each with a specific mean level. The distribution of the the difference of two Gaussian-distributed variables of the same standard deviation is itself a Gaussian, whose means the difference of the two means, and whose standard deviation is sqrt(2)*sigma.

s = (1+Erf[D_ELO/400])/2
D_ELO = 400 * InverseErf[2*s-1]

2)
What started as a game to compare two widely spread engines, turned out more serious. I took Houdini 1.5a 64 bit, Houdini 1.5a 32 bit, and SOS 5.1 engine (and AnMon engine, which behaves similarly to SOS, but I will talk about SOS 5.1 mainly). I know how H15a x64 is compared to H15a x32 at desired time control (250ms per move) on my PC, 64 bit is 26% faster and 36 +/- 3 ELO points stronger than 32 bit one in these conditions. I ran H1.5a x64 against SOS 5.1 for 10,000 games at 250ms per move:

Code: Select all

    Program                            Score  

  1 Houdini 1.5a 64                &#58; 9732.5/10000
  2 SOS 5.1                        &#58;  267.5/10000

The I ran H1.5a x32 against SOS 5.1

Code: Select all

    Program                            Score  

  1 Houdini 1.5a 32                &#58; 9638.0/10000
  2 SOS 5.1                        &#58;  362.0/10000

The prediction of the Logistic Model is that the difference between H15a x64 and H15a x32 is
-400/Log[10]* Log[1/0.97325-1]+400/Log[10]* Log[1/0.9638-1] = 54.2 ELO points
The prediction of the Gaussian Model is that the difference is
400 * InverseErf[-1+2*0.97325]-400 * InverseErf[-1+2*0.9638] = 38.0 ELO points

The real difference in both ELO models is 36+/-3 points, which is predicted well by the Gaussian Model. The Logistic Model is completely off. The same happened with another engine, AnMon, and it seems the Gaussian Model (Arpad Elo's) is a better predictor for engine ratings comprising large ELO differences. A larger study including many engines would be nice, taking something like CCRL or CEGT database to verify this. One either has few engines with many games played, or many engines with fewer games played to be statistically consistent.

3)
Some properties of the Gaussian Model compared to the Logistic one:
The ratio of the D_ELO Gaussian over D_ELO Logistic -400/Log[10]* Log[1/s-1] / (400* InverseErf[2*s-1]

The ratio of the derivatives Logistic/Gaussian (400/((-1 + 1/s) s^2 Log[10])) / (400 E^InverseErf[-1 + 2 s]^2 Sqrt[Pi])

As can be seen, the ELO differences on the tails can be by 50% off between two models, so knowing which model to use is important for large ELO differences. As it is in engine ratings, the ratings are probably inflated, and a comparison across a wide range gives bad predictions using the Logistic Model.

For close matches (s around 1/2) Taylor series expansions to first order of formula gives

Logistic: D_ELO = 1600/Log[10] * (s-1/2) ~ 694.9*(s-1/2)
Gaussian: D_ELO = 400*Sqrt[Pi] * (s-1/2) ~ 709.0*(s-1/2)

So, for small ELO differences, the model is not that important.

Eelco de Groot · Post by **Eelco de Groot** » Fri Nov 29, 2013 11:55 am

I had no idea that computer chess rating lists, or at least some of them use a logistic rating curve Kai! Are you sure about that? If some of them use BayesElo instead, is that closer to Gaussian or logistic or is the difference there more in the small rating differences scaling, as it seems to be the case in SPRT?

I think that most national human rating lists will follow FIDE and use Gaussian so I'm very surprised that computer rating lists apparently took a different turn. Did they maybe all follow the SSDF? I very vaguely remember seeing a logistc kind of formula there used but that was just some post years ago, on the old CCC forum I think.

Eelco

Laskos · Post by **Laskos** » Fri Nov 29, 2013 12:00 pm

Eelco de Groot wrote:I had no idea that computer chess rating lists, or at least some of them use a logistic rating curve Kai! Are you sure about that? If some of them use BayesElo instead, is that closer to Gaussian or logistic or is the difference there more in the small rating differences scaling, as it seems to be the case in SPRT?

I think that most national human rating lists will follow FIDE and use Gaussian so I'm very surprised that computer rating lists apparently took a different turn. Did they maybe all follow the SSDF? I very vaguely remember seeing a logistc kind of formula there used but that was just some post years ago, on the old CCC forum I think.

Eelco

All three, ELOStat, BayesElo and Ordo use logistic curves, therefore all the rating lists. Not a single one in computer chess uses Gaussian, but I may have missed some.

Milos · Post by **Milos** » Sat Nov 30, 2013 12:57 am

That is interesting, however it doesn't prove that logistic curve is bad per se.
Problem is in Houdini contempt (again). It exaggerates the strength against weaker engines, therefore even though both versions have exaggerated strength difference against weak engines x64 is more pronounced.
So it would really be interesting if you could repeat the test this time with contempt 0 of Houdini.

Laskos · Post by **Laskos** » Sat Nov 30, 2013 3:43 am

Milos wrote:That is interesting, however it doesn't prove that logistic curve is bad per se.
Problem is in Houdini contempt (again). It exaggerates the strength against weaker engines, therefore even though both versions have exaggerated strength difference against weak engines x64 is more pronounced.
So it would really be interesting if you could repeat the test this time with contempt 0 of Houdini.

It doesn't matter. FIDE rules are for humans, who have contempt and whatever. The important thing in Elo's derivation is to assume normal distribution of rating and equal standard deviation of players, and it's the same for x32 and x64 engines.

Milos · Post by **Milos** » Sat Nov 30, 2013 3:50 am

Laskos wrote:
Milos wrote:That is interesting, however it doesn't prove that logistic curve is bad per se.
Problem is in Houdini contempt (again). It exaggerates the strength against weaker engines, therefore even though both versions have exaggerated strength difference against weak engines x64 is more pronounced.
So it would really be interesting if you could repeat the test this time with contempt 0 of Houdini.
It doesn't matter. FIDE rules are for humans, who have contempt and whatever. The important thing in Elo's derivation is to assume normal distribution of rating and equal standard deviation of players, and it's the same for x32 and x64 engines.

I know it doesn't matter for normal distribution, I'm talking about logistic, there it has impact.

Laskos · Post by **Laskos** » Mon Dec 02, 2013 10:29 am

Milos wrote:
Laskos wrote:
Milos wrote:That is interesting, however it doesn't prove that logistic curve is bad per se.
Problem is in Houdini contempt (again). It exaggerates the strength against weaker engines, therefore even though both versions have exaggerated strength difference against weak engines x64 is more pronounced.
So it would really be interesting if you could repeat the test this time with contempt 0 of Houdini.
It doesn't matter. FIDE rules are for humans, who have contempt and whatever. The important thing in Elo's derivation is to assume normal distribution of rating and equal standard deviation of players, and it's the same for x32 and x64 engines.
I know it doesn't matter for normal distribution, I'm talking about logistic, there it has impact.

If the predictions of a Gaussian model are correct, and are in contradiction with the Logistic model, there is no way Logistic model works well for this ELO span.

I post here the D_ELO Gaussian vs. D_ELO Logistic, the curved one is the Gaussian model. The divergence at the tails is fairly pronounced.

Laskos · Post by **Laskos** » Wed Dec 04, 2013 11:44 am

I tested Stockfish DD x64 and Stockfish DD x32 against SOS 5.1 (250ms/move):

Code: Select all

    Program                            Score       %

  1 SF DD 64                       &#58; 4858.0/5000  97.2
  2 SF DD 32                       &#58; 4801.0/5000  96.0
  3 SOS 5.1                        &#58; 341.0/10000   3.4

The real difference between SF DD x64 and SF DD x32 is 42 +/- 5 ELO points at this TC (Gaussian or Logistic).

The predictions of the two models based on this match are:
D_ELO Gaussian: 43 points
D_ELO Logistic: 61 points

Again, the Gaussian model is a better predictor on the tails, for large ELO differences.

mohzus · Post by **mohzus** » Thu Jan 02, 2014 4:04 am

Laskos wrote:The important thing in Elo's derivation is to assume normal distribution of rating and equal standard deviation of players

There's something I don't really understand by "normal distribution of rating". Does this mean that the "strength" of a single person is assumed to follow a normal distribution with standard deviation 200? Or does this mean that the histogram (number of people with a given elo vs elo) of all players follow a normal distribution? Because the latter doesn't seem true for Elo rating system (I downloaded and plotted all the ratings of active FIDE players of 2 months of 2013 and the result wasn't a Gaussian but more like reversed Maxwell speed distribution where the tail is with the lower ratings. The sample of people was over 150k).
It sounds very strange to me to assume that the standard deviation in strength is constant and the same constant for every single player.

I've also plotted an histogram of FICS ratings calculated with BayesElo rather than Glicko or Elo, the plot was very close to what I'd call a Gaussian, or a logistic distribution (honestly I can't distinguish between the 2 from a look at a graph).

Wikipedia states that

Wikipedia wrote:Subsequent statistical tests have suggested that chess performance is almost certainly not distributed as a normal distribution, as weaker players have significantly (but not highly significantly) greater winning chances than Elo's model predicts.[citation needed] Therefore, the USCF and some chess sites use a formula based on the logistic distribution. Significant statistical anomalies have also been found when using the logistic distribution in chess.[5] FIDE continues to use the normal distribution. The normal and logistic distribution points are, in a way, arbitrary points in a spectrum of distributions which would work well. In practice, both of these distributions work very well for a number of different games.

Rebel · Post by **Rebel** » Thu Jan 02, 2014 8:52 am

Laskos wrote:I tested Stockfish DD x64 and Stockfish DD x32 against SOS 5.1 (250ms/move):
Code: Select all
    Program                            Score       %

  1 SF DD 64                       &#58; 4858.0/5000  97.2
  2 SF DD 32                       &#58; 4801.0/5000  96.0
  3 SOS 5.1                        &#58; 341.0/10000   3.4
The real difference between SF DD x64 and SF DD x32 is 42 +/- 5 ELO points at this TC (Gaussian or Logistic).

The predictions of the two models based on this match are:
D_ELO Gaussian: 43 points
D_ELO Logistic: 61 points

Again, the Gaussian model is a better predictor on the tails, for large ELO differences.

I am layman when it comes to the refinements of elo calculations, therefore allow me a couple of stupid questions.

1. These elo differences between Gaussian and Logistic disappear when only engines of (about) equal strength play against each other?

2. Rating lists tend to operate in pools of similar strength. Kind of elo incest. What happens to the elo of high rated engines if they placed into a pool of (much) weaker engines, would their elo drop?

Houdini, much weaker engines, and Arpad Elo

Houdini, much weaker engines, and Arpad Elo

Re: Houdini, much weaker engines, and Arpad Elo

Re: Houdini, much weaker engines, and Arpad Elo

Re: Houdini, much weaker engines, and Arpad Elo

Re: Houdini, much weaker engines, and Arpad Elo

Re: Houdini, much weaker engines, and Arpad Elo

Re: Houdini, much weaker engines, and Arpad Elo

Re: Houdini, much weaker engines, and Arpad Elo

Re: Houdini, much weaker engines, and Arpad Elo

Re: Houdini, much weaker engines, and Arpad Elo