Empirically Logistic ELO model better suited than Gaussian

Laskos · Post by **Laskos** » Tue Jul 12, 2016 8:37 am

I let play a massive amount of games (total 105,000) in round-robin at fixed nodes between different engines like Stockfish, Texel, Andscacs, etc. for accuracy. The engines were distanced between themselves by an order of 200 ELO points each, so that each individual ELO interval between them is almost linear in ELO-score and independent of the ELO model. The largest total difference between engines was of order of 1400 ELO points, I needed large differences because large differences between ELO models occur for large ELO differences. For each individual match I computed the total Logistic ELO difference, on large ELO intervals. This is the horizontal axis. Then, the consistent ELO is the sum of small differences between engines cumulated to give the total difference. If the Logistic model is consistent these two should be equal, and the diagonal from (0,0) to (1400,1400) would be the fit. If the Gaussian or other model is more consistent, the dots should deviate from the diagonal. They do not very much. Gaussian model seems ruled out, and Logistic ELO model for computer chess engines seems to stand well on this try. My earlier results were mixed because of fewer data points and fewer games for each data point.

The data:

Code: Select all

Individual statistics&#58;

1 SF2                       &#58; 2381  35000 (+32134,=1275,-1591&#41;, 93.6 %

T1                            &#58; 7000 (+6950,= 44,-  6&#41;, 99.6 %
Ha1                           &#58; 7000 (+6996,=  4,-  0&#41;, 100.0 %
T2                            &#58; 7000 (+4574,=969,-1457&#41;, 72.3 %
R2                            &#58; 7000 (+6625,=248,-127&#41;, 96.4 %
R1                            &#58; 7000 (+6989,= 10,-  1&#41;, 99.9 %

2 T2                        &#58; 2232  35000 (+28760,=1323,-4917&#41;, 84.1 %

SF2                           &#58; 7000 (+1457,=969,-4574&#41;, 27.7 %
T1                            &#58; 7000 (+6968,= 29,-  3&#41;, 99.8 %
Ha1                           &#58; 7000 (+6991,=  8,-  1&#41;, 99.9 %
R2                            &#58; 7000 (+6355,=308,-337&#41;, 93.0 %
R1                            &#58; 7000 (+6989,=  9,-  2&#41;, 99.9 %

3 R2                        &#58; 2051  35000 (+20528,=1016,-13456&#41;, 60.1 %

SF2                           &#58; 7000 (+127,=248,-6625&#41;,  3.6 %
T1                            &#58; 7000 (+6302,=332,-366&#41;, 92.4 %
Ha1                           &#58; 7000 (+6910,= 45,- 45&#41;, 99.0 %
T2                            &#58; 7000 (+337,=308,-6355&#41;,  7.0 %
R1                            &#58; 7000 (+6852,= 83,- 65&#41;, 98.5 %

4 T1                        &#58; 1898  35000 (+11060,=1952,-21988&#41;, 34.4 %

SF2                           &#58; 7000 (+  6,= 44,-6950&#41;,  0.4 %
Ha1                           &#58; 7000 (+5750,=554,-696&#41;, 86.1 %
T2                            &#58; 7000 (+  3,= 29,-6968&#41;,  0.2 %
R2                            &#58; 7000 (+366,=332,-6302&#41;,  7.6 %
R1                            &#58; 7000 (+4935,=993,-1072&#41;, 77.6 %

5 R1                        &#58; 1778  35000 (+5667,=1666,-27667&#41;, 18.6 %

SF2                           &#58; 7000 (+  1,= 10,-6989&#41;,  0.1 %
T1                            &#58; 7000 (+1072,=993,-4935&#41;, 22.4 %
Ha1                           &#58; 7000 (+4527,=571,-1902&#41;, 68.8 %
T2                            &#58; 7000 (+  2,=  9,-6989&#41;,  0.1 %
R2                            &#58; 7000 (+ 65,= 83,-6852&#41;,  1.5 %

6 Ha1                       &#58; 1661  35000 (+2644,=1182,-31174&#41;,  9.2 %

SF2                           &#58; 7000 (+  0,=  4,-6996&#41;,  0.0 %
T1                            &#58; 7000 (+696,=554,-5750&#41;, 13.9 %
T2                            &#58; 7000 (+  1,=  8,-6991&#41;,  0.1 %
R2                            &#58; 7000 (+ 45,= 45,-6910&#41;,  1.0 %
R1                            &#58; 7000 (+1902,=571,-4527&#41;, 31.2 %

The plot:

Laskos · Post by **Laskos** » Wed Jul 13, 2016 7:29 am

These matches with wildly differing in strength engines are also good to see how rating tools behave on wide ELO span. Here are the ratings of the three rating tools used in computer chess, ELOStat, BayesELO and Ordo.

ELOStat:

Code: Select all

    Program                            Score     %    Av.Op.  Elo    +   -    Draws

  1 SF2                            &#58; 32771.5/35000  93.6   1914   2381    7   7    3.6 %
  2 T2                             &#58; 29421.5/35000  84.1   1943   2232    5   5    3.8 %
  3 R2                             &#58; 21036.0/35000  60.1   1980   2051    4   4    2.9 %
  4 T1                             &#58; 12036.0/35000  34.4   2010   1898    4   4    5.6 %
  5 R1                             &#58;  6500.0/35000  18.6   2034   1778    4   5    4.8 %
  6 Ha1                            &#58;  3235.0/35000   9.2   2058   1661    6   6    3.4 %

BayesELO (default):

Code: Select all

Rank Name   Elo    +    - games score oppo. draws
   1 SF2    691    7    7 35000   94%  -138    4%
   2 T2     553    6    6 35000   84%  -111    4%
   3 R2     143    7    7 35000   60%   -29    3%
   4 T1    -293    6    6 35000   34%    59    6%
   5 R1    -486    6    5 35000   19%    97    5%
   6 Ha1   -607    6    6 35000    9%   121    3%

Ordo:

Code: Select all

   # PLAYER    &#58; RATING  ERROR   POINTS  PLAYED    (%)
   1 SF2       &#58; 2745.8    7.7  32771.5   35000   93.6%
   2 T2        &#58; 2585.9    7.3  29421.5   35000   84.1%
   3 R2        &#58; 2145.7    5.7  21036.0   35000   60.1%
   4 T1        &#58; 1689.5    5.7  12036.0   35000   34.4%
   5 R1        &#58; 1479.9    6.0   6500.0   35000   18.6%
   6 Ha1       &#58; 1353.1    6.3   3235.0   35000    9.2%

The plots for consistency, where the perfect logistic rating is shown with diagonal line, are showing that ELOStat ratings on large ELO spans are heavily and non-linearly distorted. BayesELO has only a tiny scaling factor, but behaves well on large ELO spans. Ordo behaves well and without even a scaling issue.

JJJ · Post by **JJJ** » Wed Jul 13, 2016 3:52 pm

Laskos wrote:These matches with wildly differing in strength engines are also good to see how rating tools behave on wide ELO span. Here are the ratings of the three rating tools used in computer chess, ELOStat, BayesELO and Ordo.

ELOStat:
Code: Select all
    Program                            Score     %    Av.Op.  Elo    +   -    Draws

  1 SF2                            &#58; 32771.5/35000  93.6   1914   2381    7   7    3.6 %
  2 T2                             &#58; 29421.5/35000  84.1   1943   2232    5   5    3.8 %
  3 R2                             &#58; 21036.0/35000  60.1   1980   2051    4   4    2.9 %
  4 T1                             &#58; 12036.0/35000  34.4   2010   1898    4   4    5.6 %
  5 R1                             &#58;  6500.0/35000  18.6   2034   1778    4   5    4.8 %
  6 Ha1                            &#58;  3235.0/35000   9.2   2058   1661    6   6    3.4 %
BayesELO (default):
Code: Select all
Rank Name   Elo    +    - games score oppo. draws
   1 SF2    691    7    7 35000   94%  -138    4%
   2 T2     553    6    6 35000   84%  -111    4%
   3 R2     143    7    7 35000   60%   -29    3%
   4 T1    -293    6    6 35000   34%    59    6%
   5 R1    -486    6    5 35000   19%    97    5%
   6 Ha1   -607    6    6 35000    9%   121    3%
Ordo:
Code: Select all
   # PLAYER    &#58; RATING  ERROR   POINTS  PLAYED    (%)
   1 SF2       &#58; 2745.8    7.7  32771.5   35000   93.6%
   2 T2        &#58; 2585.9    7.3  29421.5   35000   84.1%
   3 R2        &#58; 2145.7    5.7  21036.0   35000   60.1%
   4 T1        &#58; 1689.5    5.7  12036.0   35000   34.4%
   5 R1        &#58; 1479.9    6.0   6500.0   35000   18.6%
   6 Ha1       &#58; 1353.1    6.3   3235.0   35000    9.2%
The plots for consistency, where the perfect logistic rating is shown with diagonal line, are showing that ELOStat ratings on large ELO spans are heavily and non-linearly distorted. BayesELO has only a tiny scaling factor, but behaves well on large ELO spans. Ordo behaves well and without even a scaling issue.

both method seems acurrate

Laskos · Post by **Laskos** » Wed Jul 13, 2016 4:41 pm

JJJ wrote:
both method seems acurrate

Yes, both BayesELO and Ordo give pretty accurate logistic-invertible results for large ELO spans, with BayesELO having a 5-10% scale issue here. It's only 5-10% because the drawelo here is small. It could go to larger scale issue related to drawelo for longer tc games. It was a known issue and only of scale (BayesELO has a scale parameter). ELOStat, on the other hand, is completely off the mark for large ELO spans, it gives good results for very small ELO differences, but goes completely off logistic for larger.

Laskos · Post by **Laskos** » Thu Jul 14, 2016 8:22 pm

I also tried to check for the draw model, Davidson or Rao-Kupper. The result seems to indicate Davidson as better suited, but basically is inconclusive, too few data points, no significance test can pass.

lkaufman · Post by **lkaufman** » Thu Jul 14, 2016 9:04 pm

I think this means that the elo values for various handicaps should be higher than you originally thought, especially for larger handicaps.

Uri Blass · Post by **Uri Blass** » Thu Jul 14, 2016 9:23 pm

lkaufman wrote:I think this means that the elo values for various handicaps should be higher than you originally thought, especially for larger handicaps.

I do not think that there is an elo value for various handicap because the elo difference is not constant.

A player with rating 2000 may beat a player with rating 1000 with queen handicap but nobody is going to beat a player with rating 2000 with queen handicap.

I also think that it may be interesting to have programs that based their search on the assumption that the opponent is going to go wrong.

For example you can make a version of komodo that will be anti-komodo
at small depth when the idea is that for example if the remaining depth is more than 10 and it is the opponent to move you simply prune all the moves of the opponent in the search except the move that komodo play at depth 10.

You can have by that way komodo with anti-depth 10 style and it may be interesting to see the maximal handicap that komodo with anti-depth n style can give to komodo and still win as function of n.

I expect the maximal handicap to be significantly bigger than the maximal handicap that normal komodo can give to komodo at depth n.

fern · Post by **fern** » Thu Jul 14, 2016 9:33 pm

mY gOD, kAI, YOU ARE A BLOODY GENIUS!!!

fERN

Laskos · Post by **Laskos** » Thu Jul 14, 2016 10:10 pm

lkaufman wrote:I think this means that the elo values for various handicaps should be higher than you originally thought, especially for larger handicaps.

I don't think so, as I usually equaled strength by time odds, and the time odds ELO values were derived using small intervals. We did have a discussion when you found a very large ELO difference using a direct handicap match at equal TC. I surmised to that the large ELO value was because of computing it by logistic, while the "true ELO" might obey Gaussian.

Then, there is the scaling of handicap, as Uri noted. We knew that, and that Knight odds are maybe 1400 ELO points against 2000 FM, but might be 2000+ ELO points against 2500 GM.

Laskos · Post by **Laskos** » Thu Jul 14, 2016 10:21 pm

fern wrote:mY gOD, kAI, YOU ARE A BLOODY GENIUS!!!

fERN

Hello Fern,

Gaius Mucius Scaevola regards!

Empirically Logistic ELO model better suited than Gaussian

Empirically Logistic ELO model better suited than Gaussian

Re: Empirically Logistic ELO model better suited than Gaussi

Re: Empirically Logistic ELO model better suited than Gaussi

Re: Empirically Logistic ELO model better suited than Gaussi

Re: Empirically Logistic ELO model better suited than Gaussi

Re: Empirically Logistic ELO model better suited than Gaussi

Re: Empirically Logistic ELO model better suited than Gaussi

Re: Empirically Logistic ELO model better suited than Gaussi

Re: Empirically Logistic ELO model better suited than Gaussi

Re: Empirically Logistic ELO model better suited than Gaussi