Leela doesn't obey well the Elo curve set by regular engines

Laskos · Post by **Laskos** » Wed Sep 26, 2018 12:35 pm

In CCCC successive rounds, I noticed that Lc0 underperforms against weaker engines, often failing to convert won games, or even losing on tactical blunders to a weaker engine.

I took the excellent FGRL rating list of Andreas Strangmüller at 60''+ 0.6'' time control, adapted to my conditions, and having very many games (very small error margins).
http://www.fastgm.de/60-0.60.html

My time control in games is 60''+ 1''. The openings used are 12 positions from DeepMind paper, side and reversed. I played 24 rounds gauntlet of Lc0 v17 ID 11261 on GTX 1060 against top 3 current engines on four i7 cores, a total of 72 games against top 3 having this FGRL average rating:

Code: Select all

=====================

SF dev       : 3454
Houdini 6.03 : 3364
Komodo 12.1  : 3323

==========
Average: 3380

The result of the gauntlet is:

Code: Select all

Rank Name                          Elo     +/-   Games   Score   Draws

   0 lc0_v17 11261                 -58      46      72   41.7%   66.7%
   
   1 SF dev                        120      75      24   66.7%   66.7%
   2 Houdini 6.03                   73      74      24   60.4%   70.8%
   3 Komodo 12.1.1                 -14      87      24   47.9%   62.5%

72 of 72 games finished.

Performance: 3322 +/- 46 Elo points 2SD

========================================================================================

Then, I played in the same conditions against much weaker engines of the same line, but 5 year old. FGRL average rating of those is:

Code: Select all

==========

SF DD      : 3064
Houdini 3  : 3080
Komodo 7a  : 3062

==========
Average: 3069

The result of the gauntlet is:

Code: Select all

Rank Name                          Elo     +/-   Games   Score   Draws

   0 lc0_v17 11261                  68      52      72   59.7%   58.3%
   
   1 Komodo 7a                     -29      92      24   45.8%   58.3%
   2 Houdini 3                     -73      74      24   39.6%   70.8%
   3 SF DD                        -104     107      24   35.4%   45.8%

72 of 72 games finished.

Performance: 3137 +/- 52 Elo points 2SD

========================================================================================

We see that the performance against 300+ Elo points weaker engines is much lower for Leela.
The difference in performance in the two cases is:

Difference in Lc0 performances: 185 +/- 69 Elo points 2SD.

This is pretty huge, and the Elo curve is either not obeyed at all, or severely compressed. Therefore, only that and one can already say that there is no much meaning in "Elo strength of Leela against regular engines", never mind of scaling issues with time control, hardware issues and the opening repertoire issue. It's pretty useless to talk of Leela in Elo terms comparing it to regular engines, maybe just as order of magnitude and always specifying conditions.

chrisw · Post by **chrisw** » Wed Sep 26, 2018 12:46 pm

Laskos wrote: ↑Wed Sep 26, 2018 12:35 pm In CCCC successive rounds, I noticed that Lc0 underperforms against weaker engines, often failing to convert won games, or even losing on tactical blunders to a weaker engine.

I took the excellent FGRL rating list of Andreas Strangmüller at 60''+ 0.6'' time control, adapted to my conditions, and having very many games (very small error margins).
http://www.fastgm.de/60-0.60.html

My time control in games is 60''+ 1''. The openings used are 12 positions from DeepMind paper, side and reversed. I played 24 rounds gauntlet of Lc0 v17 ID 11261 on GTX 1060 against top 3 current engines on four i7 cores, a total of 72 games against top 3 having this FGRL average rating:
Code: Select all
=====================

SF dev       : 3454
Houdini 6.03 : 3364
Komodo 12.1  : 3323

==========
Average: 3380
The result of the gauntlet is:
Code: Select all
Rank Name                          Elo     +/-   Games   Score   Draws

   0 lc0_v17 11261                 -58      46      72   41.7%   66.7%
   
   1 SF dev                        120      75      24   66.7%   66.7%
   2 Houdini 6.03                   73      74      24   60.4%   70.8%
   3 Komodo 12.1.1                 -14      87      24   47.9%   62.5%

72 of 72 games finished.
Performance: 3322 +/- 46 Elo points 2SD

========================================================================================

Then, I played in the same conditions against much weaker engines of the same line, but 5 year old. FGRL average rating of those is:
Code: Select all
==========

SF DD      : 3064
Houdini 3  : 3080
Komodo 7a : 3062

==========
Average: 3069
The result of the gauntlet is:
Code: Select all
Rank Name                          Elo     +/-   Games   Score   Draws

   0 lc0_v17 11261                  68      52      72   59.7%   58.3%
   
   1 Komodo 7a                     -29      92      24   45.8%   58.3%
   2 Houdini 3                     -73      74      24   39.6%   70.8%
   3 SF DD                        -104     107      24   35.4%   45.8%

72 of 72 games finished.
Performance: 3137 +/- 52 Elo points 2SD

========================================================================================

We see that the performance against 300+ Elo points weaker engines is much lower for Leela.
The difference in performance in the two cases is:

Difference in Lc0 performances: 185 +/- 69 Elo points 2SD.

This is pretty huge, and the Elo curve is either not obeyed at all, or severely compressed. Therefore, only that and one can already say that there is no much meaning in "Elo strength of Leela against regular engines", never mind of scaling issues with time control, hardware issues and the opening repertoire issue. It's pretty useless to talk of Leela in Elo terms comparing it to regular engines, maybe just as order of magnitude and always specifying conditions.

I posted the same conclusion, from different data, an hour ago:

I want to suggest the elo of LC0 somehow manages to match the elo of whatever it is playing against. Well, that's one "explanation" of its rather curious behaviour:

If one looks at the loss count of the second program, Houdini against the final ranking of its opponents, we get, as would be expected, an decreasing gradient: 5,1,0,0,0,0,0
Komodo gets: 2,3,2,0,1,0,0
Ethereal: 6,3,3,1,2,0,1
Fire: 6,6,3,3,1,0,1
Booot: 5,4,4,3,1,3,0
Andsacs: 7,4,4,3,5,1,1

Lc0 is different: 1,1,1,1,1,0,2, almost irrelevant who the opponent is, the loss rate remains almost constant.

Obviously, "non-losses", counting wins and draws together, shows the same pattern in reverse. Which suggests, well, to me, that LC0 doesn't really have an elo that can be mapped onto any particular opponent. It's not behaving itself properly according to the laws of elo ratings.

Great minds thinking alike. Haha!!

Laskos · Post by **Laskos** » Wed Sep 26, 2018 1:15 pm

chrisw wrote: ↑Wed Sep 26, 2018 12:46 pm
I posted the same conclusion, from different data, an hour ago:

I want to suggest the elo of LC0 somehow manages to match the elo of whatever it is playing against. Well, that's one "explanation" of its rather curious behaviour:

If one looks at the loss count of the second program, Houdini against the final ranking of its opponents, we get, as would be expected, an decreasing gradient: 5,1,0,0,0,0,0
Komodo gets: 2,3,2,0,1,0,0
Ethereal: 6,3,3,1,2,0,1
Fire: 6,6,3,3,1,0,1
Booot: 5,4,4,3,1,3,0
Andsacs: 7,4,4,3,5,1,1

Lc0 is different: 1,1,1,1,1,0,2, almost irrelevant who the opponent is, the loss rate remains almost constant.

Obviously, "non-losses", counting wins and draws together, shows the same pattern in reverse. Which suggests, well, to me, that LC0 doesn't really have an elo that can be mapped onto any particular opponent. It's not behaving itself properly according to the laws of elo ratings.
Great minds thinking alike. Haha!!

Ah, great, I now saw your post in that thread. I started the experiment yesterday wanting to be well outside error margins, and by morning I had the result. Yes, same conclusions.
Great, Chris!

Branko Radovanovic · Post by **Branko Radovanovic** » Wed Sep 26, 2018 2:09 pm

This is very interesting and may at first appear difficult to explain.

However, consider an engine that randomly crashes in e.g. 20% of the games. Elo-wise, this would create the same phenomenon: such an engine would always lose at least 20% of the games, even against the weakest of opponents, and its Elo performance against strong engines would therefore be significantly higher than against weak engines.

Even without looking at the actual moves, one could therefore conclude that Leela is prone to random severe blunders which produce the same Elo-skewing effect - only milder, presumably - as random crashes.

We know Elo is not perfect, but in fact it is probably impossible to construct a performance scale which produces consistent results in the above-described case (an arbitrary fixed proportion of random losses).

Javier Ros · Post by **Javier Ros** » Wed Sep 26, 2018 6:58 pm

Laskos wrote: ↑Wed Sep 26, 2018 12:35 pm In CCCC successive rounds, I noticed that Lc0 underperforms against weaker engines, often failing to convert won games, or even losing on tactical blunders to a weaker engine.

I took the excellent FGRL rating list of Andreas Strangmüller at 60''+ 0.6'' time control, adapted to my conditions, and having very many games (very small error margins).
http://www.fastgm.de/60-0.60.html

My time control in games is 60''+ 1''. The openings used are 12 positions from DeepMind paper, side and reversed. I played 24 rounds gauntlet of Lc0 v17 ID 11261 on GTX 1060 against top 3 current engines on four i7 cores, a total of 72 games against top 3 having this FGRL average rating:
Code: Select all
=====================

SF dev       : 3454
Houdini 6.03 : 3364
Komodo 12.1  : 3323

==========
Average: 3380
The result of the gauntlet is:
Code: Select all
Rank Name                          Elo     +/-   Games   Score   Draws

   0 lc0_v17 11261                 -58      46      72   41.7%   66.7%
   
   1 SF dev                        120      75      24   66.7%   66.7%
   2 Houdini 6.03                   73      74      24   60.4%   70.8%
   3 Komodo 12.1.1                 -14      87      24   47.9%   62.5%

72 of 72 games finished.
Performance: 3322 +/- 46 Elo points 2SD

========================================================================================

Then, I played in the same conditions against much weaker engines of the same line, but 5 year old. FGRL average rating of those is:
Code: Select all
==========

SF DD      : 3064
Houdini 3  : 3080
Komodo 7a  : 3062

==========
Average: 3069
The result of the gauntlet is:
Code: Select all
Rank Name                          Elo     +/-   Games   Score   Draws

   0 lc0_v17 11261                  68      52      72   59.7%   58.3%
   
   1 Komodo 7a                     -29      92      24   45.8%   58.3%
   2 Houdini 3                     -73      74      24   39.6%   70.8%
   3 SF DD                        -104     107      24   35.4%   45.8%

72 of 72 games finished.
Performance: 3137 +/- 52 Elo points 2SD

========================================================================================

We see that the performance against 300+ Elo points weaker engines is much lower for Leela.
The difference in performance in the two cases is:

Difference in Lc0 performances: 185 +/- 69 Elo points 2SD.

This is pretty huge, and the Elo curve is either not obeyed at all, or severely compressed. Therefore, only that and one can already say that there is no much meaning in "Elo strength of Leela against regular engines", never mind of scaling issues with time control, hardware issues and the opening repertoire issue. It's pretty useless to talk of Leela in Elo terms comparing it to regular engines, maybe just as order of magnitude and always specifying conditions.

I have detected a similar phenomenon when playing games with lc0 against the same program with 4 cores or 6 cores and the Elo does not correlates well with the predicted results. That is, the score is very similar against 4 cores that against 6 cores. I haven't done any test with 1 and 2 cores but I think you would get similar results.

I think that the main responsible is the tactical weakness of lc0 that, when it appears against any alpha-beta program always leads to defeat regardless of the number of cores.

Laskos · Post by **Laskos** » Wed Sep 26, 2018 7:26 pm

Branko Radovanovic wrote: ↑Wed Sep 26, 2018 2:09 pm This is very interesting and may at first appear difficult to explain.

However, consider an engine that randomly crashes in e.g. 20% of the games. Elo-wise, this would create the same phenomenon: such an engine would always lose at least 20% of the games, even against the weakest of opponents, and its Elo performance against strong engines would therefore be significantly higher than against weak engines.

Even without looking at the actual moves, one could therefore conclude that Leela is prone to random severe blunders which produce the same Elo-skewing effect - only milder, presumably - as random crashes.

We know Elo is not perfect, but in fact it is probably impossible to construct a performance scale which produces consistent results in the above-described case (an arbitrary fixed proportion of random losses).

Yes, and the occurrence of these severe blunders, "milder random crashes", is pretty consistent. I played a bit with numbers, for my particular case to be modeled Elo-wise, the "regular" (Elo-curve obeying) play is very strong, on par with SF dev and even higher, but it occurs only in 60% of the games against regular engines. 40% of the games against regular engines have severe blunders (half-a-point or a full point losing). The erratic Elo performance is modeled pretty well in this case in accordance to my data. But it's an over-simplification, and this 40% "blunder" rate is dependent on the opponents, so for weaker opponents it will be smaller. We can expect that against 1600 CCRL Elo regular engine, the blunder rate is much smaller, as that regular engine will hardly exploit tactical weakness of Lc0, being itself no better even in tactics.

I am curious, if we have many NN engines with similar MCTS search, will they as a group obey some their Elo curve? We for long time lived in a chess engine rating paradigm obeying logistic Elo curve, and it is used in many other games too.

Michel · Post by **Michel** » Wed Sep 26, 2018 8:02 pm

Laskos wrote: ↑Wed Sep 26, 2018 7:26 pm
Branko Radovanovic wrote: ↑Wed Sep 26, 2018 2:09 pm This is very interesting and may at first appear difficult to explain.

However, consider an engine that randomly crashes in e.g. 20% of the games. Elo-wise, this would create the same phenomenon: such an engine would always lose at least 20% of the games, even against the weakest of opponents, and its Elo performance against strong engines would therefore be significantly higher than against weak engines.

Even without looking at the actual moves, one could therefore conclude that Leela is prone to random severe blunders which produce the same Elo-skewing effect - only milder, presumably - as random crashes.

We know Elo is not perfect, but in fact it is probably impossible to construct a performance scale which produces consistent results in the above-described case (an arbitrary fixed proportion of random losses).
Yes, and the occurrence of these severe blunders, "milder random crashes", is pretty consistent. I played a bit with numbers, for my particular case to be modeled Elo-wise, the "regular" (Elo-curve obeying) play is very strong, on par with SF dev and even higher, but it occurs only in 60% of the games against regular engines. 40% of the games against regular engines have severe blunders (half-a-point or a full point losing). The erratic Elo performance is modeled pretty well in this case in accordance to my data. But it's an over-simplification, and this 40% "blunder" rate is dependent on the opponents, so for weaker opponents it will be smaller. We can expect that against 1600 CCRL Elo regular engine, the blunder rate is much smaller, as that regular engine will hardly exploit tactical weakness of Lc0, being itself no better even in tactics.

I am curious, if we have many NN engines with similar MCTS search, will they as a group obey some their Elo curve? We for long time lived in a chess engine rating paradigm obeying logistic Elo curve, and it is used in many other games too.

Assuming the blunder rate is constant (as a first approximation) we could use it as a nuisance parameter in the elo model (like the draw ratio) and estimate it using maximum likelihood estimation... The test for such a model is if the computed elo differences are additive (your tests show that standard logistic elo is not additive when one of the engines involved is Leela).

Uri Blass · Post by **Uri Blass** » Wed Sep 26, 2018 10:00 pm

Laskos wrote: ↑Wed Sep 26, 2018 7:26 pm
Branko Radovanovic wrote: ↑Wed Sep 26, 2018 2:09 pm This is very interesting and may at first appear difficult to explain.

However, consider an engine that randomly crashes in e.g. 20% of the games. Elo-wise, this would create the same phenomenon: such an engine would always lose at least 20% of the games, even against the weakest of opponents, and its Elo performance against strong engines would therefore be significantly higher than against weak engines.

Even without looking at the actual moves, one could therefore conclude that Leela is prone to random severe blunders which produce the same Elo-skewing effect - only milder, presumably - as random crashes.

We know Elo is not perfect, but in fact it is probably impossible to construct a performance scale which produces consistent results in the above-described case (an arbitrary fixed proportion of random losses).
Yes, and the occurrence of these severe blunders, "milder random crashes", is pretty consistent. I played a bit with numbers, for my particular case to be modeled Elo-wise, the "regular" (Elo-curve obeying) play is very strong, on par with SF dev and even higher, but it occurs only in 60% of the games against regular engines. 40% of the games against regular engines have severe blunders (half-a-point or a full point losing). The erratic Elo performance is modeled pretty well in this case in accordance to my data. But it's an over-simplification, and this 40% "blunder" rate is dependent on the opponents, so for weaker opponents it will be smaller. We can expect that against 1600 CCRL Elo regular engine, the blunder rate is much smaller, as that regular engine will hardly exploit tactical weakness of Lc0, being itself no better even in tactics.

I am curious, if we have many NN engines with similar MCTS search, will they as a group obey some their Elo curve? We for long time lived in a chess engine rating paradigm obeying logistic Elo curve, and it is used in many other games too.

I wonder what happens if people fix the MCTS search and use LC0 with the normal alphabeta algorithm.

Note that I am not sure that the problem is only tactical blunders because of MCTS.
The problem is that LC0 can evaluate tablebases draws as more than +10 so even normal alphabeta is not going to help to avoid the draw and choose a winning move.

Laskos · Post by **Laskos** » Thu Sep 27, 2018 5:00 pm

Michel wrote: ↑Wed Sep 26, 2018 8:02 pm
Laskos wrote: ↑Wed Sep 26, 2018 7:26 pm
Branko Radovanovic wrote: ↑Wed Sep 26, 2018 2:09 pm This is very interesting and may at first appear difficult to explain.

However, consider an engine that randomly crashes in e.g. 20% of the games. Elo-wise, this would create the same phenomenon: such an engine would always lose at least 20% of the games, even against the weakest of opponents, and its Elo performance against strong engines would therefore be significantly higher than against weak engines.

Even without looking at the actual moves, one could therefore conclude that Leela is prone to random severe blunders which produce the same Elo-skewing effect - only milder, presumably - as random crashes.

We know Elo is not perfect, but in fact it is probably impossible to construct a performance scale which produces consistent results in the above-described case (an arbitrary fixed proportion of random losses).
Yes, and the occurrence of these severe blunders, "milder random crashes", is pretty consistent. I played a bit with numbers, for my particular case to be modeled Elo-wise, the "regular" (Elo-curve obeying) play is very strong, on par with SF dev and even higher, but it occurs only in 60% of the games against regular engines. 40% of the games against regular engines have severe blunders (half-a-point or a full point losing). The erratic Elo performance is modeled pretty well in this case in accordance to my data. But it's an over-simplification, and this 40% "blunder" rate is dependent on the opponents, so for weaker opponents it will be smaller. We can expect that against 1600 CCRL Elo regular engine, the blunder rate is much smaller, as that regular engine will hardly exploit tactical weakness of Lc0, being itself no better even in tactics.

I am curious, if we have many NN engines with similar MCTS search, will they as a group obey some their Elo curve? We for long time lived in a chess engine rating paradigm obeying logistic Elo curve, and it is used in many other games too.
Assuming the blunder rate is constant (as a first approximation) we could use it as a nuisance parameter in the elo model (like the draw ratio) and estimate it using maximum likelihood estimation... The test for such a model is if the computed elo differences are additive (your tests show that standard logistic elo is not additive when one of the engines involved is Leela).

First, I get the third test against even weaker engines, here are all my three tests:

Code: Select all

=====================
=====================

SF dev       : 3454
Houdini 6.03 : 3364
Komodo 12.1  : 3323

==========
Average: 3380

Rank Name                          Elo     +/-   Games   Score   Draws

   0 lc0_v17 11261                 -58      46      72   41.7%   66.7%
   
   1 SF dev                        120      75      24   66.7%   66.7%
   2 Houdini 6.03                   73      74      24   60.4%   70.8%
   3 Komodo 12.1.1                 -14      87      24   47.9%   62.5%

72 of 72 games finished.

Performance: 3322 +/- 46 Elo points 2SD



==========

SF DD      : 3064
Houdini 3  : 3080
Komomod 7a : 3062

==========
Average: 3069

Rank Name                          Elo     +/-   Games   Score   Draws

   0 lc0_v17 11261                  68      52      72   59.7%   58.3%
   
   1 Komodo 7a                     -29      92      24   45.8%   58.3%
   2 Houdini 3                     -73      74      24   39.6%   70.8%
   3 SF DD                        -104     107      24   35.4%   45.8%

72 of 72 games finished.

Performance: 3137 +/- 52 Elo points 2SD



==========

Hannibal 1.7  : 2910
Arasan 21     : 2845
Naum 4.6      : 2809

==========
Average: 2855


Rank Name                          Elo     +/-   Games   Score   Draws
   0 lc0_v17 11261                 154      63      72   70.8%   41.7%
   
   1 Hannibal 1.7                  -89      90      24   37.5%   58.3%
   2 Naum 4.6                     -191     114      24   25.0%   41.7%
   3 Arasan 21                    -191     144      24   25.0%   25.0%

72 of 72 games finished.

Performance: 3009 +/- 63 Elo points 2SD

I tried Rao-Kupper (BayesElo) to fit my results, without much success, but Davidson (Ordo) seems to work very well. If I am not wrong, this would mean Lc0 performance distribution in a game is also obeying a quasi-normal distribution, more correctly, the derivative of the logistic, which is close to normal, but has longer tails. The curious thing is that the best fitting is just replacing "400" in the logistic Elo distribution with "1000". WIth this logistic: 1/(1 + 10^( - Elo_dif/1000)), I get a "true" and stable performance:

Against an average of 3380 FGRL Elo rating: 3335
Against an average of 3069 FGRL Elo rating: 3339
Against an average of 2855 FGRL Elo rating: 3340

Very stable and well within error margins. With this "true" Elo of about 3340 on FGRL, Lc0 ID 11261 on GTX 1060 is roughly equal to SF8 on 4 i7 cores in FGRL Ordo Elo points. A result I got previously in direct matches. So, if say Lc0 is shown in normal rating calculator like Ordo as weaker by 40 +/- 10 Elo points than a regular engine, it comes in the real Ordo rating at 100 +/- 25 Elo points weaker. If this property is valid for A0 against a regular engine too, then its advantage over SF8 is not 100 Elo points, but 250 Elo points (or to put it another way, A0 will be equal in strength against a regular engine 250 Elo points stronger than the regular engine SF8).

I am now running a test to check the validity of this modified scale for logistic, with the Davidson model satisfied, against very weak engines of average FGRL rating of 2274 Elo points. If my model stands, Lc0 should perform 380 +/- 80 Elo points above that pool of 3 very weak regular engines.

Laskos · Post by **Laskos** » Thu Sep 27, 2018 9:58 pm

Laskos wrote: ↑Thu Sep 27, 2018 5:00 pm
I am now running a test to check the validity of this modified scale for logistic, with the Davidson model satisfied, against very weak engines of average FGRL rating of 2274 Elo points. If my model stands, Lc0 should perform 380 +/- 80 Elo points above that pool of 3 very weak regular engines.

Yes, the model seems to stand well against much weaker engines too, although the error margins are larger than I expected.

Code: Select all

============

Cheese 1.9.2  : 2328
RuyDos 1.1.6  : 2278
Ethereal 8.16 : 2215

============
Average: 2274


Rank Name                          Elo     +/-   Games   Score   Draws
   0 lc0_v17 11261                 433     138      72   92.4%   15.3%
   
   1 Ethereal 8.16                -417     264      24    8.3%   16.7%
   2 Cheese 1.9.2                 -417     264      24    8.3%   16.7%
   3 RuyDos 1.1.6                 -470     nan      24    6.3%   12.5%

72 of 72 games finished.

So, I expected 380 +/- 80 Elo points against 1000 or so FGRL Elo weaker engines, it came as 433 +/- 138 Elo points. Well within error margins, but to check more accurately one would need more games here.

All in all, one can talk of "Leela Elos" against regular engines, and the conversion to the normal Elos of regular engines is to multiply "Leela Elos" by a factor of 2.5 in order to get a rating in a list of regular engines.

Leela doesn't obey well the Elo curve set by regular engines

Leela doesn't obey well the Elo curve set by regular engines

Re: Leela doesn't obey well the Elo curve set by regular engines

Re: Leela doesn't obey well the Elo curve set by regular engines

Re: Leela doesn't obey well the Elo curve set by regular engines

Re: Leela doesn't obey well the Elo curve set by regular engines

Re: Leela doesn't obey well the Elo curve set by regular engines

Re: Leela doesn't obey well the Elo curve set by regular engines

Re: Leela doesn't obey well the Elo curve set by regular engines

Re: Leela doesn't obey well the Elo curve set by regular engines

Re: Leela doesn't obey well the Elo curve set by regular engines