Empirically Logistic ELO model better suited than Gaussian

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Harvey Williamson, bob

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
User avatar
Laskos
Posts: 9408
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Empirically Logistic ELO model better suited than Gaussian

Post by Laskos » Tue Jul 12, 2016 6:37 am

I let play a massive amount of games (total 105,000) in round-robin at fixed nodes between different engines like Stockfish, Texel, Andscacs, etc. for accuracy. The engines were distanced between themselves by an order of 200 ELO points each, so that each individual ELO interval between them is almost linear in ELO-score and independent of the ELO model. The largest total difference between engines was of order of 1400 ELO points, I needed large differences because large differences between ELO models occur for large ELO differences. For each individual match I computed the total Logistic ELO difference, on large ELO intervals. This is the horizontal axis. Then, the consistent ELO is the sum of small differences between engines cumulated to give the total difference. If the Logistic model is consistent these two should be equal, and the diagonal from (0,0) to (1400,1400) would be the fit. If the Gaussian or other model is more consistent, the dots should deviate from the diagonal. They do not very much. Gaussian model seems ruled out, and Logistic ELO model for computer chess engines seems to stand well on this try. My earlier results were mixed because of fewer data points and fewer games for each data point.

The data:

Code: Select all

Individual statistics:

1 SF2                       : 2381  35000 (+32134,=1275,-1591), 93.6 %

T1                            : 7000 (+6950,= 44,-  6), 99.6 %
Ha1                           : 7000 (+6996,=  4,-  0), 100.0 %
T2                            : 7000 (+4574,=969,-1457), 72.3 %
R2                            : 7000 (+6625,=248,-127), 96.4 %
R1                            : 7000 (+6989,= 10,-  1), 99.9 %

2 T2                        : 2232  35000 (+28760,=1323,-4917), 84.1 %

SF2                           : 7000 (+1457,=969,-4574), 27.7 %
T1                            : 7000 (+6968,= 29,-  3), 99.8 %
Ha1                           : 7000 (+6991,=  8,-  1), 99.9 %
R2                            : 7000 (+6355,=308,-337), 93.0 %
R1                            : 7000 (+6989,=  9,-  2), 99.9 %

3 R2                        : 2051  35000 (+20528,=1016,-13456), 60.1 %

SF2                           : 7000 (+127,=248,-6625),  3.6 %
T1                            : 7000 (+6302,=332,-366), 92.4 %
Ha1                           : 7000 (+6910,= 45,- 45), 99.0 %
T2                            : 7000 (+337,=308,-6355),  7.0 %
R1                            : 7000 (+6852,= 83,- 65), 98.5 %

4 T1                        : 1898  35000 (+11060,=1952,-21988), 34.4 %

SF2                           : 7000 (+  6,= 44,-6950),  0.4 %
Ha1                           : 7000 (+5750,=554,-696), 86.1 %
T2                            : 7000 (+  3,= 29,-6968),  0.2 %
R2                            : 7000 (+366,=332,-6302),  7.6 %
R1                            : 7000 (+4935,=993,-1072), 77.6 %

5 R1                        : 1778  35000 (+5667,=1666,-27667), 18.6 %

SF2                           : 7000 (+  1,= 10,-6989),  0.1 %
T1                            : 7000 (+1072,=993,-4935), 22.4 %
Ha1                           : 7000 (+4527,=571,-1902), 68.8 %
T2                            : 7000 (+  2,=  9,-6989),  0.1 %
R2                            : 7000 (+ 65,= 83,-6852),  1.5 %

6 Ha1                       : 1661  35000 (+2644,=1182,-31174),  9.2 %

SF2                           : 7000 (+  0,=  4,-6996),  0.0 %
T1                            : 7000 (+696,=554,-5750), 13.9 %
T2                            : 7000 (+  1,=  8,-6991),  0.1 %
R2                            : 7000 (+ 45,= 45,-6910),  1.0 %
R1                            : 7000 (+1902,=571,-4527), 31.2 %
The plot:

Image

User avatar
Laskos
Posts: 9408
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Re: Empirically Logistic ELO model better suited than Gaussi

Post by Laskos » Wed Jul 13, 2016 5:29 am

These matches with wildly differing in strength engines are also good to see how rating tools behave on wide ELO span. Here are the ratings of the three rating tools used in computer chess, ELOStat, BayesELO and Ordo.

ELOStat:

Code: Select all

    Program                            Score     %    Av.Op.  Elo    +   -    Draws

  1 SF2                            : 32771.5/35000  93.6   1914   2381    7   7    3.6 %
  2 T2                             : 29421.5/35000  84.1   1943   2232    5   5    3.8 %
  3 R2                             : 21036.0/35000  60.1   1980   2051    4   4    2.9 %
  4 T1                             : 12036.0/35000  34.4   2010   1898    4   4    5.6 %
  5 R1                             :  6500.0/35000  18.6   2034   1778    4   5    4.8 %
  6 Ha1                            :  3235.0/35000   9.2   2058   1661    6   6    3.4 %
BayesELO (default):

Code: Select all

Rank Name   Elo    +    - games score oppo. draws
   1 SF2    691    7    7 35000   94%  -138    4%
   2 T2     553    6    6 35000   84%  -111    4%
   3 R2     143    7    7 35000   60%   -29    3%
   4 T1    -293    6    6 35000   34%    59    6%
   5 R1    -486    6    5 35000   19%    97    5%
   6 Ha1   -607    6    6 35000    9%   121    3%
Ordo:

Code: Select all

   # PLAYER    : RATING  ERROR   POINTS  PLAYED    (%)
   1 SF2       : 2745.8    7.7  32771.5   35000   93.6%
   2 T2        : 2585.9    7.3  29421.5   35000   84.1%
   3 R2        : 2145.7    5.7  21036.0   35000   60.1%
   4 T1        : 1689.5    5.7  12036.0   35000   34.4%
   5 R1        : 1479.9    6.0   6500.0   35000   18.6%
   6 Ha1       : 1353.1    6.3   3235.0   35000    9.2%

The plots for consistency, where the perfect logistic rating is shown with diagonal line, are showing that ELOStat ratings on large ELO spans are heavily and non-linearly distorted. BayesELO has only a tiny scaling factor, but behaves well on large ELO spans. Ordo behaves well and without even a scaling issue.

Image


Image


Image

JJJ
Posts: 1285
Joined: Sat Apr 19, 2014 11:47 am

Re: Empirically Logistic ELO model better suited than Gaussi

Post by JJJ » Wed Jul 13, 2016 1:52 pm

Laskos wrote:These matches with wildly differing in strength engines are also good to see how rating tools behave on wide ELO span. Here are the ratings of the three rating tools used in computer chess, ELOStat, BayesELO and Ordo.

ELOStat:

Code: Select all

    Program                            Score     %    Av.Op.  Elo    +   -    Draws

  1 SF2                            : 32771.5/35000  93.6   1914   2381    7   7    3.6 %
  2 T2                             : 29421.5/35000  84.1   1943   2232    5   5    3.8 %
  3 R2                             : 21036.0/35000  60.1   1980   2051    4   4    2.9 %
  4 T1                             : 12036.0/35000  34.4   2010   1898    4   4    5.6 %
  5 R1                             :  6500.0/35000  18.6   2034   1778    4   5    4.8 %
  6 Ha1                            :  3235.0/35000   9.2   2058   1661    6   6    3.4 %
BayesELO (default):

Code: Select all

Rank Name   Elo    +    - games score oppo. draws
   1 SF2    691    7    7 35000   94%  -138    4%
   2 T2     553    6    6 35000   84%  -111    4%
   3 R2     143    7    7 35000   60%   -29    3%
   4 T1    -293    6    6 35000   34%    59    6%
   5 R1    -486    6    5 35000   19%    97    5%
   6 Ha1   -607    6    6 35000    9%   121    3%
Ordo:

Code: Select all

   # PLAYER    : RATING  ERROR   POINTS  PLAYED    (%)
   1 SF2       : 2745.8    7.7  32771.5   35000   93.6%
   2 T2        : 2585.9    7.3  29421.5   35000   84.1%
   3 R2        : 2145.7    5.7  21036.0   35000   60.1%
   4 T1        : 1689.5    5.7  12036.0   35000   34.4%
   5 R1        : 1479.9    6.0   6500.0   35000   18.6%
   6 Ha1       : 1353.1    6.3   3235.0   35000    9.2%

The plots for consistency, where the perfect logistic rating is shown with diagonal line, are showing that ELOStat ratings on large ELO spans are heavily and non-linearly distorted. BayesELO has only a tiny scaling factor, but behaves well on large ELO spans. Ordo behaves well and without even a scaling issue.

Image


Image


Image
both method seems acurrate

User avatar
Laskos
Posts: 9408
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Re: Empirically Logistic ELO model better suited than Gaussi

Post by Laskos » Wed Jul 13, 2016 2:41 pm

JJJ wrote:
both method seems acurrate
Yes, both BayesELO and Ordo give pretty accurate logistic-invertible results for large ELO spans, with BayesELO having a 5-10% scale issue here. It's only 5-10% because the drawelo here is small. It could go to larger scale issue related to drawelo for longer tc games. It was a known issue and only of scale (BayesELO has a scale parameter). ELOStat, on the other hand, is completely off the mark for large ELO spans, it gives good results for very small ELO differences, but goes completely off logistic for larger.

User avatar
Laskos
Posts: 9408
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Re: Empirically Logistic ELO model better suited than Gaussi

Post by Laskos » Thu Jul 14, 2016 6:22 pm

I also tried to check for the draw model, Davidson or Rao-Kupper. The result seems to indicate Davidson as better suited, but basically is inconclusive, too few data points, no significance test can pass.

Image

lkaufman
Posts: 3672
Joined: Sun Jan 10, 2010 5:15 am
Location: Maryland USA
Contact:

Re: Empirically Logistic ELO model better suited than Gaussi

Post by lkaufman » Thu Jul 14, 2016 7:04 pm

I think this means that the elo values for various handicaps should be higher than you originally thought, especially for larger handicaps.
Komodo rules!

Uri Blass
Posts: 8553
Joined: Wed Mar 08, 2006 11:37 pm
Location: Tel-Aviv Israel

Re: Empirically Logistic ELO model better suited than Gaussi

Post by Uri Blass » Thu Jul 14, 2016 7:23 pm

lkaufman wrote:I think this means that the elo values for various handicaps should be higher than you originally thought, especially for larger handicaps.
I do not think that there is an elo value for various handicap because the elo difference is not constant.

A player with rating 2000 may beat a player with rating 1000 with queen handicap but nobody is going to beat a player with rating 2000 with queen handicap.

I also think that it may be interesting to have programs that based their search on the assumption that the opponent is going to go wrong.

For example you can make a version of komodo that will be anti-komodo
at small depth when the idea is that for example if the remaining depth is more than 10 and it is the opponent to move you simply prune all the moves of the opponent in the search except the move that komodo play at depth 10.

You can have by that way komodo with anti-depth 10 style and it may be interesting to see the maximal handicap that komodo with anti-depth n style can give to komodo and still win as function of n.

I expect the maximal handicap to be significantly bigger than the maximal handicap that normal komodo can give to komodo at depth n.

User avatar
fern
Posts: 8755
Joined: Sun Feb 26, 2006 3:07 pm

Re: Empirically Logistic ELO model better suited than Gaussi

Post by fern » Thu Jul 14, 2016 7:33 pm

mY gOD, kAI, YOU ARE A BLOODY GENIUS!!!

fERN

User avatar
Laskos
Posts: 9408
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Re: Empirically Logistic ELO model better suited than Gaussi

Post by Laskos » Thu Jul 14, 2016 8:10 pm

lkaufman wrote:I think this means that the elo values for various handicaps should be higher than you originally thought, especially for larger handicaps.
I don't think so, as I usually equaled strength by time odds, and the time odds ELO values were derived using small intervals. We did have a discussion when you found a very large ELO difference using a direct handicap match at equal TC. I surmised to that the large ELO value was because of computing it by logistic, while the "true ELO" might obey Gaussian.

Then, there is the scaling of handicap, as Uri noted. We knew that, and that Knight odds are maybe 1400 ELO points against 2000 FM, but might be 2000+ ELO points against 2500 GM.
Last edited by Laskos on Thu Jul 14, 2016 8:24 pm, edited 1 time in total.

User avatar
Laskos
Posts: 9408
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Re: Empirically Logistic ELO model better suited than Gaussi

Post by Laskos » Thu Jul 14, 2016 8:21 pm

fern wrote:mY gOD, kAI, YOU ARE A BLOODY GENIUS!!!

fERN
Hello Fern,

Gaius Mucius Scaevola regards!

Post Reply