The things are not as dramatic. A0 (Lc0) in a pool of regular engines is not obeying the Elo model. Regular engines in a pool of regular engines are obeying the Elo mode fairly well. But, at least from my experience with Lc0, the things are not as dramatic as your Formula 1 case. For example Lc0 with Leela Ratio of 2-3 (I have a strong GPU) beats heavily SF8 but at some time controls loses to SF10.Thomas A. Anderson wrote: ↑Wed Dec 12, 2018 2:57 pmLaskos, your post let me smile a little bit, because it looks like a good example, how hard it is to adopt the "new kind of math", that AZ brought us. I know you as a good analytical working person in this forum and also in this thread, you came up with this (compressed-ELO) model that tries to make AZ vs. SF match results fit into the standard ELO model. Reading your post, you stated at the beginning, that you don't have a high confidence SF10 would beat AZ, because AZ isn't "sensitive" to regular opponents. Two sentences later, you conclude that, because your model wouldn't explain ELO anomalies of more than factor 2, you are fairly confident AZ must be at SF10 strength level, with a possible slight edge for SF10 Now, that Matthew told us the number of games in the matches against ~SF9/Brainfish/etc. has been "more than high enough that the result is statistically significant", how can your compressed-ELO model "explain", that SF9 and SF8 lost against AZ by the same margin (while SF9 is ~30ELO ahead of SF8)? When the compression-model didn't work for the SF8-SF9-AZ trio, why believing it will fit for any kind of ELO-math in the SF9-SF10-AZ relation? Wouldn't it be more likely that the model doesn't fit the observations/results and therefore has to be dropped/revised?Laskos wrote: ↑Tue Dec 11, 2018 7:09 pm
In fact, I mainly take as reliable result for A0 from varied openings the TCEC openings match against SF8:
A0 vs SF8
+17 =75 -8
+31 Elo points
Now, at first glance one can almost surely say that SF10 would have performed better, like:
SF10 vs SF8
+22 =73 -5
+60 Elo points
But it still doesn't mean I have a very high confidence that SF10 would beat A0 in 100 games from TCEC openings in their conditions. A0 (and Lc0) is not that "sensitive" to the regular opponent, be it SF8 or SF10, when in superiority. A0 vs inferior regular engine shows a compressed Elo difference (I showed a model-plot in another thread). But in that model, the Elo compression is hardly above a factor of 2 or so. So, I would be fairly confident that A0 and SF10 are quite closely matched playing from TCEC openings, maybe with a slight advantage of SF10.
But as Matthew said, this was the first version of A0, I don't know what they have in hand by now.
I think, to be meaningful, the ELO system needs a certain kind of "transitivity" regarding the strength of the contenders (it's been a long time since my math classes and I might use the wrong term here). In case of AZ, this it lacks this prerequisite. When I need to explain the results, I think of contenders in Formula One race: Ferrari is constantly working on improving their cars, season by season. As McLaren etc. does. The 2018 Ferrari is better than the 2017 model, that was better than the 2016 type and so on. Thinking of the Ferrari as SF and the McLarens as Komodo and so on everything in the ELO world is fine, transitivity, rule-of-three and comparisons between cars over seasons boundaries might fit well more or less.
No the new "kid on the block" went in, a car constructed by Gyro Gearloose or Christopher Lloyd. When the car finishes a race the traditional contenders have no chance by any means. As aspected, knowing the constructors, the new car with its "jet propulsion" didn't manage to finish more than 60% of the races. Now, unlike in Formula One, think of one-on-one matches of the cars, and try to establish a performance rating where you get any meaningful number for the rocket-car. What would you think about a calculation like: If the rocket-car ist a 60-40 favorite against the 2018 Ferrari, and I build a 2019 Ferrari that beats the old model by a higher margin, then I have a high confidence that the Ferrari 2019 will beat the rocket car. I believe I'm preaching to the choir, as I remember you also stating stuff like this and it comes down to the fact that AZ has certain weaknesses, that can be exposed by traditional AB-Engines. But the rate of exposures (resulting in AZ losses) isn't very proportional to the ELO strength of the AB-engines. I'm wouldn't be surprised, if engines (ELO-)rated much lower than Stockfish would get better results against AZ, because they better expose its weaknesses.
All those matches aside the TCEC openings match against SF8 were fairly deterministic from SF side, and A0 was also fairly deterministic and close to its most trained lines. BrainFish match was SF8 + Cerebellum "best move opening", not true diversified openings and full BrainFish engine, the diversification coming from A0, playing again into A0 hand. "Number of games" relates to statistical errors, not systematic errors. Suppose you have completely deterministic engines and 1 starting position, A0 wins as White and draws as Black. So, you have in 2 games 1.5/2 performance of A0 and 1500/2000 performance in 2000 games. What would you mean here by "statistically significant"? This is bad practice of introducing a huge systematic error, which overwhelms the statistical error after even 5 games. Also, these somewhat deterministic "high statistical significance" matches usually play into A0 hand, as A0 is close to its optimum in these not very diversified games.Matthew told us the number of games in the matches against ~SF9/Brainfish/etc. has been "more than high enough that the result is statistically significant"
See here what I have posted in another thread the result:
Initial Board position:
Score of lc0_v19_11261 vs SF8: 18 - 0 - 22 [0.725] 40
Elo difference: 168.40 +/- 67.96
Lc0 seems unbeatable here (similar to their results). But from
Adam Hair's opening 4-mover PGN
Score of lc0_v19_11261 vs SF8: 16 - 7 - 17 [0.613] 40
Elo difference: 79.53 +/- 84.63
Suddenly, from real diverse openings, Lc0 stops being unbeatable, and its rating drops by almost 100 Elo points (although "Elo" here is just a number).
Therefore from 1 Initial Board position I have no high confidence in their results neither against SF8 nor against SF9, and in the difference between them. Be them "statistically significant" (meaning large number of games). They have a large systematic error.
DeepMind would say that A0 is not taught to play from 4-mover books and Chess is really from Initial Board position. But even in that case, just by using a diversified in Cutechess-Cli polyglot book for SF8 and SF9, not even of very high quality, they could have come up with more relevant results. They did come up with a relevant result from TCEC openings, and you see, the result against SF8 is a bit different. I do stand by my gut feeling (also, from experience with Lc0), somehow even by my simplistic model, that from TCEC openings A0 and SF10 are fairly matched, maybe (I said maybe) with slight advantage of SF10. I don't know why you cannot take my opinions a bit more thoughtfully.