Schizophrenic rating model for Leela

Uri Blass · Post by **Uri Blass** » Mon Jan 21, 2019 8:24 pm

Laskos wrote: ↑Mon Jan 21, 2019 11:27 am If we assume that a regular engine sees Leela as a schizophrenic, or a double personality engine, the scores one gets of regular engines against Leela can be explained by the usual Elo logistic. And an "Elo" rating can be defined for Leela, we will call it Elo_of_Leela, although Leela in a pool of regular engines doesn't obey the Elo logistic.

Let's define this schizophrenic Leela engine by the scores regular engines get against it as:

(1)

Here SCORE is the score a regular engine gets in a match against Leela, ranging from 0 (0%) to 1 (100%).
A is the the degree of schizophrenia of Leela, closer to 0.5 is more accentuate double personality, closer to 0 or 1 means less schizophrenia (range ids from 0 to 1).
ELO is the Elo of regular engine.
ELO1 and ELO2 are defining personalities of Leela, 2 personalities.

We define the Elo_of_Leela in a pool of regular engines as an Elo of regular engine against which it scores exactly 1/2 (50%).
Setting SC0RE=0.5 and solving for Elo, we get Elo_of_Leela as a function of ELO1, ELO2 and A:

(2)

Given the score a regular engine gets against Leela, it's hard to derive immediately the Elo of that regular engine in a pool of regular engines. We have, given the score, to derive ELO as a function of SCORE, ELO1, ELO2, A from equation (1). Against regular engines, Elo as a function of score is given by simple logistic inversion. Here the solution for Elo of regular engine is:

(3)

Now one can check the model, by fitting parameters A, ELO1, ELO2 to empirical data. The model is invariant to Elo translations, only Elo differences count, so basically we have only 2 variables in the model.
The best empirical data (rating list of regular engines) at short time control on large Elo span I found are here:
http://fastgm.de/60-0.60.html
The ratings are calculated by Ordo, so they do not suffer from any compression or distortion of BayesElo. Also, the error margins are small. Time control id 60'' + 0.6''.

I used 7 datapoints from this list, from the weakest, Ethereal 8.16 to the strongest, Stockfish 10. For each datapoint (7 different regular engines), I played 1000 games of Leela (one of the latest of test30 nets) against them.
Warning: the time control used in these games was very short, 6'' + 0.1''.

The fit of the model on 7 datapoints on very large Elo span gave (from equation (3)):

A = 0.53 (close to 0.5, very schizophrenic Leela)
ELO1 - ELO2 = 1070

Basically, Leela has two personalities of similar importance in matches, differing by about 1000 Elo points.
If I choose ELO1 equal to 3500, then ELO2 is 2430.

From the equation (2), the Elo_of_Leela is 3071 Elo points. It can be translated to anything by just translating ELO1, ELO2, but keeping their difference constant. I will keep those values, translating the rating list of Andreas (fastgm), and see how the fit works.

Each black datapoint is given by 1000 ultra-fast games match for each regular engine against Leela. The fit is almost perfect, with just 2 parameters fitting 7 datapoints. So, a double personality of Leela seen by regular engines is in almost perfect agreement with scores one gets when playing different rated regular engines against it. Again, Leela could be given an "Elo_of_Leela", but Leela doesn't obey the Elo logistic model of regular engines. So, in rating lists, Leela's rating may be almost arbitrary, depending on opponents. If you give here weak opponents, she would be rated lower, if we give here strong opponents, she would be rated higher. The "Elo compression" when a regular engine plays Leela is very pronounced, especially on small to medium Elo spans.
The two personalities on fairly strong GPU and reasonable time control can be defined as: one super-strong, well above any regular engine. Another the level of a mediocre regular engine. I do not know if double personality is expressed mostly in matches of games or in each game, move by move.
The only warning is that TC I used was very short.

If humans resemble Leela in playing, as many argue, including me, humans too seem schizophrenic to regular engines. The Elo ratings of humans in a pool of regular engines will be compressed, and the best a human can do is to play the best engine to improve his rating. If say top 5 engines with their CCRL ratings are introduced in human FIDE pool, they will inflate the general FIDE human ratings, and the top GMs would better play only engines to improve their FIDE rating. Probably a similar plot can be made of a human playing in a pool of regular engines, but no human will play thousands of games against strong engines in FIDE conditions to have enough empirical data.

I am not sure about humans.
If you use CCRL rating then you should use CCRL conditions(humans do not play with their own book but get moves that they never prepared for them) and I do not think humans play chess in these conditions so we have no data about humans.

Part of the advantage of strong players is that they know better the opening that they play relative to their opponent.
If you force humans to play openings that they never play then the strong humans lose part of their advantage.

carldaman · Post by **carldaman** » Tue Jan 22, 2019 12:30 am

Laskos wrote: ↑Mon Jan 21, 2019 11:27 am
If humans resemble Leela in playing, as many argue, including me, humans too seem schizophrenic to regular engines. The Elo ratings of humans in a pool of regular engines will be compressed, and the best a human can do is to play the best engine to improve his rating. If say top 5 engines with their CCRL ratings are introduced in human FIDE pool, they will inflate the general FIDE human ratings, and the top GMs would better play only engines to improve their FIDE rating. Probably a similar plot can be made of a human playing in a pool of regular engines, but no human will play thousands of games against strong engines in FIDE conditions to have enough empirical data.

I remember the ICC server of 20 years ago when humans and engines could only play in the same rating pool and it led to the inflation of human ratings to well over 3000 blitz Elo, so your conclusion is very valid. It is what I had assumed as well.

However, back then all the chess programs had a very pronounced schizophrenic quality to their strength, clearly stronger than humans tactically, but much weaker positionally, with most of the strength being derived tactically. (I actually made a post about that in the old rgcc forum about 20 years ago.) It is what would you describe as 'monomaniacal', but the effect was more pronounced since the difference between top engine tactical strength and positional weakness was probably wider than nowadays, and the positional weakness was also more exploitable by humans than it is now. This probably means that a human's Elo gains today vs a combined Elo pool would also be reduced vs 20 years ago, while still remaining positive.

Of course, what's strikingly different about Leela is the flip-flop in terms of the type of strength/weakness it exhibits, something we could not dream about even just a couple of years ago -- an chess entity being much stronger positionally than tactically. I suspect that its tactical weakness, assuming best available hardware and nets, is only marginally exploitable by the best humans, so it is basically a relative weakness rather than an absolute one.

Regards,
CL

Michel · Post by **Michel** » Tue Jan 22, 2019 1:22 pm

Nice! Lots of ideas!

So a fully schizophrenic Leela is just two players. The score of a regular engine against a fully schizophrenic Leela with given elo1,elo2 would be the same as its score against two regular players with elo1 and elo2 respectively.

A partially schizophrenic Leela is more complicated. It is like two players but one of them gets to play more than the other.

Branko Radovanovic · Post by **Branko Radovanovic** » Tue Jan 22, 2019 1:49 pm

Laskos wrote: ↑Mon Jan 21, 2019 5:31 pm Transitivity is pretty well obeyed by regular engines, I even came to pretty conclusive results that on large Elo spans it obeys the logistic and not Gaussian distribution. Some 2-3 years ago this was unclear to me.
I am not sure what you mean by "ratings in engine-to-engine Elo lists get compressed over time". There are compressing lists, SSDF is one of them, but they since ages tried to accommodate it to human ratings and are using a dubious rating calculator. Also, their conditions varied over time, but they include all conditions in the list. CCRL uses BayesElo which is compressing, especially on large Elo spans. Also, in my experiments, BayesElo draw model (Rao-Kupper) is ruled out, but Davidson draw model (Ordo is using it) is not ruled out. I didn't see this compression in regular engine-engine ratings in identical conditions on ratings lists using Ordo. Elo does get compressed with stronger hardware, time control and stronger engines, due to higher draw rate, but not compressed "over time" in identical conditions. Maybe I missed something. In general, my impression is that in correct conditions, regular engines obey the (logistic) Elo model pretty or very well, even on large Elo spans. Leela is a "weird sick man" in this pool of regular engines (even on small Elo spans).

By "compression", I mean that, even if the rating list is e.g. anchored to a certain engine, over time - as stronger and stronger new releases appear - Elo of a given engine (which used to be near the top) is diminishing.

E.g. today, SF 6 4CPU is rated 3291 in CCRL 40/40. Three years ago, it was 3300. This doesn't seem much, 10 Elo or so, but is IIRC fairly consistent whichever former top-level entry is chosen, so it's not down to random chance.

So I was mistaken then in thinking this is due to the Elo curve itself, rather than the inner workings of rating calculators?

Uri Blass · Post by **Uri Blass** » Tue Jan 22, 2019 3:22 pm

Branko Radovanovic wrote: ↑Tue Jan 22, 2019 1:49 pm
Laskos wrote: ↑Mon Jan 21, 2019 5:31 pm Transitivity is pretty well obeyed by regular engines, I even came to pretty conclusive results that on large Elo spans it obeys the logistic and not Gaussian distribution. Some 2-3 years ago this was unclear to me.
I am not sure what you mean by "ratings in engine-to-engine Elo lists get compressed over time". There are compressing lists, SSDF is one of them, but they since ages tried to accommodate it to human ratings and are using a dubious rating calculator. Also, their conditions varied over time, but they include all conditions in the list. CCRL uses BayesElo which is compressing, especially on large Elo spans. Also, in my experiments, BayesElo draw model (Rao-Kupper) is ruled out, but Davidson draw model (Ordo is using it) is not ruled out. I didn't see this compression in regular engine-engine ratings in identical conditions on ratings lists using Ordo. Elo does get compressed with stronger hardware, time control and stronger engines, due to higher draw rate, but not compressed "over time" in identical conditions. Maybe I missed something. In general, my impression is that in correct conditions, regular engines obey the (logistic) Elo model pretty or very well, even on large Elo spans. Leela is a "weird sick man" in this pool of regular engines (even on small Elo spans).
By "compression", I mean that, even if the rating list is e.g. anchored to a certain engine, over time - as stronger and stronger new releases appear - Elo of a given engine (which used to be near the top) is diminishing.

E.g. today, SF 6 4CPU is rated 3291 in CCRL 40/40. Three years ago, it was 3300. This doesn't seem much, 10 Elo or so, but is IIRC fairly consistent whichever former top-level entry is chosen, so it's not down to random chance.

So I was mistaken then in thinking this is due to the Elo curve itself, rather than the inner workings of rating calculators?

I think that it is possible to avoid compression by having 2 programs with many games that do not continue to play with fixed rating.

For example you can decide that shredder13 has fixed rating of 3199 when it does not play more games(3943 games are clearly enough)
You can decide also that Ufim8.02 has fixed rating of 2546 when it does not play more games(2246 games are enough)

You can avoid the fixed numbers and calculate rating like you want but after you get different numbers for this engines(for example 3190 for shredder13 and 2550 for Ufim8.02 change the rating of all the programs by a linear formula that translate 3199 to 3190 and 2546 to 2550)

N Konstantakis · Post by **N Konstantakis** » Wed Jan 23, 2019 11:45 am

Might it be that Leela looks schizophrenic because its the odd one out in a pool of AB's, while in a pool of Leelas any AB would look schizophrenic too?
I think its very interesting to repeat the experiment with a single AB in a wide elo range of Leelas.

Laskos · Post by **Laskos** » Wed Jan 23, 2019 1:00 pm

Michel wrote: ↑Tue Jan 22, 2019 1:22 pm Nice! Lots of ideas!

So a fully schizophrenic Leela is just two players. The score of a regular engine against a fully schizophrenic Leela with given elo1,elo2 would be the same as its score against two regular players with elo1 and elo2 respectively.

A partially schizophrenic Leela is more complicated. It is like two players but one of them gets to play more than the other.

"A" is important in schizophrenia, but ELO1 - ELO2 is even more important in defining the degree of illness.

Maybe on the lines like "positional, tactical" (as test-suites show) one can improvise a plausible explanation of this schizophrenia.

On these lines, one can imagine two sorts of distributions of outcomes in many games:

An accumulation of many very small errors/advantages (limited variance), mostly positional advantages in the case of Leela, which leads to Gaussian statistics through the central limit theorem, well mimicked in CDF by a logistic. Very high ELO1 for Leela in a pool of regular engines might derive from this.

Some errors may have a Cauchy or Levy-like distributions (not defined or infinite variance) and we have the "Levy flight", where the total distance traveled (to the outcome of a game) is almost always dominated by the largest single or at most two errors. Here Leela "excels" in frequency of these sort of errors compared to strong regular engines, hence deriving its low ELO2.

Both lead in many games to CDF well approximated by logistics, but two different logistics. Regular engines are close enough in properties to assimilate both these properties into one single Elo logistic in ratings among them, so that an individual regular engine in a pool of other regular engines might be schizophrenic with ELO1 - ELO2 of say about small 50 Elo, 100 Elo even 200 Elo points. Lets's call this very mild schizophrenia "moody", and regular engine can be at most moody in a pool of regular engines. Their rating will be well described by A*ELO1 + (1-A)*ELO2, and with these rating it will obey the general Elo model. But Leela is so different that ELO1 - ELO2 is above 1000 Elo points (the fit gave 1070), and it cannot fit a logistic Elo model in a pool of regular engines by this simple weighted averaging, it is truly pathologically schizophrenic. I will plot the cases of moody regular engine (at most 200 points between ELO1 and ELO2) and schizophrenic Leela-like engine (1000 points between ELO1 and ELO2), at full "split-personality" (A=0.5):

Moody regular engine (ELO1 = 100, ELO2 = -100, A=0.5):

The blue line is the true sum of the two logistics separated by 200 Elo points. The brown line is a logistic with average (ELO1 + ELO2)/2 = 0. This moody regular engine still fits very well with a logistic given by average rating.

Schizophrenic Leela-like engine (ELO1 = 500, ELO2 = -500, A=0.5):

The blue line is the true sum of the two logistics separated by 1000 Elo points. The brown line is a logistic with average (ELO1 + ELO2)/2 = 0. Now we see that 1000+ Elo points difference makes a huge difference, the average logistic fit fails badly, and the true sum gives indeed compressed ratings, only explainable by ELO1, ELO2, A separately, and not averaging.

This sort of explanation, aside test-suites, where we see highly pathological positional/tactical behavior of Leela can be seen from comparing evals, and we also can see that regular engines probably never diverge by 200+ Elo points in their moodiness, as the rule of thumb for regular engines is that they are getting stronger positionally and tactically fairly hand in hand.

From an easy experiment, we can probably derive that compared to the simplest eval --- material + PST. even a top SF10 eval cannot get 1000 Elo points difference against a regular engine:

depth=1
Score of SF10 vs Predateur 2.1: 980 - 4 - 16 [0.988] 1000
Elo difference: 766.23 +/- 86.73
Finished match

In Predateur 2.1 even PST are somehow dubious.
But the search of Predateur is similarly weaker, by about the same Elo value. So, they will not look pathological in rating lists one to another, at most moody.

OTOH, Leela is by some 500 Elo points stronger than SF10 in eval, SF10 at depth 1 (about 20-30 nodes searched) and Leela at nodes=20, and 350 Elo points stronger even at nodes=1 than SF10 depth=1. And tactically, LC0 is like a weak regular engine, and 1000+ Elo points full blown pathology in ELO1 - ELO2 can be explained.

It would be interesting to have a basic eval with SF search, or viceversa, SF eval with basic search, to see if they exhibit rating pathology. Almost surely not to the degree Leela exhibits.

Laskos · Post by **Laskos** » Wed Jan 23, 2019 1:16 pm

Branko Radovanovic wrote: ↑Tue Jan 22, 2019 1:49 pm
Laskos wrote: ↑Mon Jan 21, 2019 5:31 pm Transitivity is pretty well obeyed by regular engines, I even came to pretty conclusive results that on large Elo spans it obeys the logistic and not Gaussian distribution. Some 2-3 years ago this was unclear to me.
I am not sure what you mean by "ratings in engine-to-engine Elo lists get compressed over time". There are compressing lists, SSDF is one of them, but they since ages tried to accommodate it to human ratings and are using a dubious rating calculator. Also, their conditions varied over time, but they include all conditions in the list. CCRL uses BayesElo which is compressing, especially on large Elo spans. Also, in my experiments, BayesElo draw model (Rao-Kupper) is ruled out, but Davidson draw model (Ordo is using it) is not ruled out. I didn't see this compression in regular engine-engine ratings in identical conditions on ratings lists using Ordo. Elo does get compressed with stronger hardware, time control and stronger engines, due to higher draw rate, but not compressed "over time" in identical conditions. Maybe I missed something. In general, my impression is that in correct conditions, regular engines obey the (logistic) Elo model pretty or very well, even on large Elo spans. Leela is a "weird sick man" in this pool of regular engines (even on small Elo spans).
By "compression", I mean that, even if the rating list is e.g. anchored to a certain engine, over time - as stronger and stronger new releases appear - Elo of a given engine (which used to be near the top) is diminishing.

E.g. today, SF 6 4CPU is rated 3291 in CCRL 40/40. Three years ago, it was 3300. This doesn't seem much, 10 Elo or so, but is IIRC fairly consistent whichever former top-level entry is chosen, so it's not down to random chance.

So I was mistaken then in thinking this is due to the Elo curve itself, rather than the inner workings of rating calculators?

I don't think this is due to Elo curve (logistic). CCRL uses BayesElo and it compresses (also distorts) the ratings. One can check for this "compression" taking, for example, CCRL database of 10 years ago and the current one, and run them through Ordo rating calculator. Compare the differences between the engines in the two lists, and have a p-value that "the differences in differences" are not randomly distributed.
From what I know, aside some error margins and possible very few outliers, logistic Elo model was not ruled out for regular engines.

Laskos · Post by **Laskos** » Wed Jan 23, 2019 2:27 pm

N Konstantakis wrote: ↑Wed Jan 23, 2019 11:45 am Might it be that Leela looks schizophrenic because its the odd one out in a pool of AB's, while in a pool of Leelas any AB would look schizophrenic too?
I think its very interesting to repeat the experiment with a single AB in a wide elo range of Leelas.

Yes, it's possible. I did see that among them, Leelas do seem to obey an Elo logistic, but a regular engine is weirdo in their pool, also compressing its differences, but it was just an impression.

Laskos · Post by **Laskos** » Wed Jan 23, 2019 2:32 pm

Uri Blass wrote: ↑Mon Jan 21, 2019 8:24 pm
I am not sure about humans.
If you use CCRL rating then you should use CCRL conditions(humans do not play with their own book but get moves that they never prepared for them) and I do not think humans play chess in these conditions so we have no data about humans.

Part of the advantage of strong players is that they know better the opening that they play relative to their opponent.
If you force humans to play openings that they never play then the strong humans lose part of their advantage.

I mean just set FIDE ratings of the top regular CCRL engines to be equal to CCRL ratings, then let the engines play in FIDE conditions.

Schizophrenic rating model for Leela

Re: Schizophrenic rating model for Leela

Re: Schizophrenic rating model for Leela

Re: Schizophrenic rating model for Leela

Re: Schizophrenic rating model for Leela

Re: Schizophrenic rating model for Leela

Re: Schizophrenic rating model for Leela

Re: Schizophrenic rating model for Leela

Re: Schizophrenic rating model for Leela

Re: Schizophrenic rating model for Leela

Re: Schizophrenic rating model for Leela