## Houdini, much weaker engines, and Arpad Elo

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Dann Corbit, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
Posts: 10948
Joined: Wed Jul 26, 2006 8:21 pm

### Houdini, much weaker engines, and Arpad Elo

1)
Most, if not all computer chess rating lists are based on logistic curve with,
D_ELO = -400/Log[10]* Log[1/s-1],
where s=(w+d/2)/(w+d+l)
It is based on the assumption that if player A scores k times as many points (with wins counting for 1, draws counting for 1/2) against player B, and player B scores k times as many points than his opponent C, then in a match between A and C, A should score k*k times as many points as C. Its inverse is

s=1/(1+10^(-D_ELO/400))

FIDE however uses Gaussian distribution for calculating ELO according to Arpad Elo:
http://www.fide.com/fide/handbook.html? ... ew=article
Arpad Elo assumed that, at any given time, every chess player has a normal (Gaussian) distribution of chess levels (i.e.ratings), all with the same standard deviation, sigma = 200 points, but each with a specific mean level. The distribution of the the difference of two Gaussian-distributed variables of the same standard deviation is itself a Gaussian, whose means the difference of the two means, and whose standard deviation is sqrt(2)*sigma.

s = (1+Erf[D_ELO/400])/2
D_ELO = 400 * InverseErf[2*s-1]

2)
What started as a game to compare two widely spread engines, turned out more serious. I took Houdini 1.5a 64 bit, Houdini 1.5a 32 bit, and SOS 5.1 engine (and AnMon engine, which behaves similarly to SOS, but I will talk about SOS 5.1 mainly). I know how H15a x64 is compared to H15a x32 at desired time control (250ms per move) on my PC, 64 bit is 26% faster and 36 +/- 3 ELO points stronger than 32 bit one in these conditions. I ran H1.5a x64 against SOS 5.1 for 10,000 games at 250ms per move:

Code: Select all

``````    Program                            Score

1 Houdini 1.5a 64                &#58; 9732.5/10000
2 SOS 5.1                        &#58;  267.5/10000 ``````
The I ran H1.5a x32 against SOS 5.1

Code: Select all

``````    Program                            Score

1 Houdini 1.5a 32                &#58; 9638.0/10000
2 SOS 5.1                        &#58;  362.0/10000 ``````
The prediction of the Logistic Model is that the difference between H15a x64 and H15a x32 is
-400/Log[10]* Log[1/0.97325-1]+400/Log[10]* Log[1/0.9638-1] = 54.2 ELO points
The prediction of the Gaussian Model is that the difference is
400 * InverseErf[-1+2*0.97325]-400 * InverseErf[-1+2*0.9638] = 38.0 ELO points

The real difference in both ELO models is 36+/-3 points, which is predicted well by the Gaussian Model. The Logistic Model is completely off. The same happened with another engine, AnMon, and it seems the Gaussian Model (Arpad Elo's) is a better predictor for engine ratings comprising large ELO differences. A larger study including many engines would be nice, taking something like CCRL or CEGT database to verify this. One either has few engines with many games played, or many engines with fewer games played to be statistically consistent.

3)
Some properties of the Gaussian Model compared to the Logistic one:
The ratio of the D_ELO Gaussian over D_ELO Logistic -400/Log[10]* Log[1/s-1] / (400* InverseErf[2*s-1]

The ratio of the derivatives Logistic/Gaussian (400/((-1 + 1/s) s^2 Log[10])) / (400 E^InverseErf[-1 + 2 s]^2 Sqrt[Pi])

As can be seen, the ELO differences on the tails can be by 50% off between two models, so knowing which model to use is important for large ELO differences. As it is in engine ratings, the ratings are probably inflated, and a comparison across a wide range gives bad predictions using the Logistic Model.

For close matches (s around 1/2) Taylor series expansions to first order of formula gives

Logistic: D_ELO = 1600/Log[10] * (s-1/2) ~ 694.9*(s-1/2)
Gaussian: D_ELO = 400*Sqrt[Pi] * (s-1/2) ~ 709.0*(s-1/2)

So, for small ELO differences, the model is not that important.

Eelco de Groot
Posts: 4278
Joined: Sun Mar 12, 2006 1:40 am
Location: Groningen

### Re: Houdini, much weaker engines, and Arpad Elo

I had no idea that computer chess rating lists, or at least some of them use a logistic rating curve Kai! Are you sure about that? If some of them use BayesElo instead, is that closer to Gaussian or logistic or is the difference there more in the small rating differences scaling, as it seems to be the case in SPRT?

I think that most national human rating lists will follow FIDE and use Gaussian so I'm very surprised that computer rating lists apparently took a different turn. Did they maybe all follow the SSDF? I very vaguely remember seeing a logistc kind of formula there used but that was just some post years ago, on the old CCC forum I think.

Eelco
Debugging is twice as hard as writing the code in the first
place. Therefore, if you write the code as cleverly as possible, you
are, by definition, not smart enough to debug it.
-- Brian W. Kernighan

Posts: 10948
Joined: Wed Jul 26, 2006 8:21 pm

### Re: Houdini, much weaker engines, and Arpad Elo

Eelco de Groot wrote:I had no idea that computer chess rating lists, or at least some of them use a logistic rating curve Kai! Are you sure about that? If some of them use BayesElo instead, is that closer to Gaussian or logistic or is the difference there more in the small rating differences scaling, as it seems to be the case in SPRT?

I think that most national human rating lists will follow FIDE and use Gaussian so I'm very surprised that computer rating lists apparently took a different turn. Did they maybe all follow the SSDF? I very vaguely remember seeing a logistc kind of formula there used but that was just some post years ago, on the old CCC forum I think.

Eelco
All three, ELOStat, BayesElo and Ordo use logistic curves, therefore all the rating lists. Not a single one in computer chess uses Gaussian, but I may have missed some.

Milos
Posts: 3988
Joined: Wed Nov 25, 2009 12:47 am

### Re: Houdini, much weaker engines, and Arpad Elo

That is interesting, however it doesn't prove that logistic curve is bad per se.
Problem is in Houdini contempt (again). It exaggerates the strength against weaker engines, therefore even though both versions have exaggerated strength difference against weak engines x64 is more pronounced.
So it would really be interesting if you could repeat the test this time with contempt 0 of Houdini.

Posts: 10948
Joined: Wed Jul 26, 2006 8:21 pm

### Re: Houdini, much weaker engines, and Arpad Elo

Milos wrote:That is interesting, however it doesn't prove that logistic curve is bad per se.
Problem is in Houdini contempt (again). It exaggerates the strength against weaker engines, therefore even though both versions have exaggerated strength difference against weak engines x64 is more pronounced.
So it would really be interesting if you could repeat the test this time with contempt 0 of Houdini.
It doesn't matter. FIDE rules are for humans, who have contempt and whatever. The important thing in Elo's derivation is to assume normal distribution of rating and equal standard deviation of players, and it's the same for x32 and x64 engines.

Milos
Posts: 3988
Joined: Wed Nov 25, 2009 12:47 am

### Re: Houdini, much weaker engines, and Arpad Elo

Milos wrote:That is interesting, however it doesn't prove that logistic curve is bad per se.
Problem is in Houdini contempt (again). It exaggerates the strength against weaker engines, therefore even though both versions have exaggerated strength difference against weak engines x64 is more pronounced.
So it would really be interesting if you could repeat the test this time with contempt 0 of Houdini.
It doesn't matter. FIDE rules are for humans, who have contempt and whatever. The important thing in Elo's derivation is to assume normal distribution of rating and equal standard deviation of players, and it's the same for x32 and x64 engines.
I know it doesn't matter for normal distribution, I'm talking about logistic, there it has impact.

Posts: 10948
Joined: Wed Jul 26, 2006 8:21 pm

### Re: Houdini, much weaker engines, and Arpad Elo

Milos wrote:
Milos wrote:That is interesting, however it doesn't prove that logistic curve is bad per se.
Problem is in Houdini contempt (again). It exaggerates the strength against weaker engines, therefore even though both versions have exaggerated strength difference against weak engines x64 is more pronounced.
So it would really be interesting if you could repeat the test this time with contempt 0 of Houdini.
It doesn't matter. FIDE rules are for humans, who have contempt and whatever. The important thing in Elo's derivation is to assume normal distribution of rating and equal standard deviation of players, and it's the same for x32 and x64 engines.
I know it doesn't matter for normal distribution, I'm talking about logistic, there it has impact.
If the predictions of a Gaussian model are correct, and are in contradiction with the Logistic model, there is no way Logistic model works well for this ELO span.

I post here the D_ELO Gaussian vs. D_ELO Logistic, the curved one is the Gaussian model. The divergence at the tails is fairly pronounced.

Posts: 10948
Joined: Wed Jul 26, 2006 8:21 pm

### Re: Houdini, much weaker engines, and Arpad Elo

I tested Stockfish DD x64 and Stockfish DD x32 against SOS 5.1 (250ms/move):

Code: Select all

``````    Program                            Score       %

1 SF DD 64                       &#58; 4858.0/5000  97.2
2 SF DD 32                       &#58; 4801.0/5000  96.0
3 SOS 5.1                        &#58; 341.0/10000   3.4
``````
The real difference between SF DD x64 and SF DD x32 is 42 +/- 5 ELO points at this TC (Gaussian or Logistic).

The predictions of the two models based on this match are:
D_ELO Gaussian: 43 points
D_ELO Logistic: 61 points

Again, the Gaussian model is a better predictor on the tails, for large ELO differences.

mohzus
Posts: 106
Joined: Tue Sep 24, 2013 12:54 am

### Re: Houdini, much weaker engines, and Arpad Elo

Laskos wrote:The important thing in Elo's derivation is to assume normal distribution of rating and equal standard deviation of players
There's something I don't really understand by "normal distribution of rating". Does this mean that the "strength" of a single person is assumed to follow a normal distribution with standard deviation 200? Or does this mean that the histogram (number of people with a given elo vs elo) of all players follow a normal distribution? Because the latter doesn't seem true for Elo rating system (I downloaded and plotted all the ratings of active FIDE players of 2 months of 2013 and the result wasn't a Gaussian but more like reversed Maxwell speed distribution where the tail is with the lower ratings. The sample of people was over 150k).
It sounds very strange to me to assume that the standard deviation in strength is constant and the same constant for every single player.

I've also plotted an histogram of FICS ratings calculated with BayesElo rather than Glicko or Elo, the plot was very close to what I'd call a Gaussian, or a logistic distribution (honestly I can't distinguish between the 2 from a look at a graph).

Wikipedia states that
Wikipedia wrote:Subsequent statistical tests have suggested that chess performance is almost certainly not distributed as a normal distribution, as weaker players have significantly (but not highly significantly) greater winning chances than Elo's model predicts.[citation needed] Therefore, the USCF and some chess sites use a formula based on the logistic distribution. Significant statistical anomalies have also been found when using the logistic distribution in chess.[5] FIDE continues to use the normal distribution. The normal and logistic distribution points are, in a way, arbitrary points in a spectrum of distributions which would work well. In practice, both of these distributions work very well for a number of different games.

Rebel
Posts: 5786
Joined: Thu Aug 18, 2011 10:04 am

### Re: Houdini, much weaker engines, and Arpad Elo

Laskos wrote:I tested Stockfish DD x64 and Stockfish DD x32 against SOS 5.1 (250ms/move):

Code: Select all

``````    Program                            Score       %

1 SF DD 64                       &#58; 4858.0/5000  97.2
2 SF DD 32                       &#58; 4801.0/5000  96.0
3 SOS 5.1                        &#58; 341.0/10000   3.4
``````
The real difference between SF DD x64 and SF DD x32 is 42 +/- 5 ELO points at this TC (Gaussian or Logistic).

The predictions of the two models based on this match are:
D_ELO Gaussian: 43 points
D_ELO Logistic: 61 points

Again, the Gaussian model is a better predictor on the tails, for large ELO differences.
I am layman when it comes to the refinements of elo calculations, therefore allow me a couple of stupid questions.

1. These elo differences between Gaussian and Logistic disappear when only engines of (about) equal strength play against each other?

2. Rating lists tend to operate in pools of similar strength. Kind of elo incest. What happens to the elo of high rated engines if they placed into a pool of (much) weaker engines, would their elo drop?