Alphazero news

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

glennsamuel32
Posts: 136
Joined: Sat Dec 04, 2010 5:31 pm
Location: 223

Re: Alphazero news

Post by glennsamuel32 »

A small excerpt from the AlphaGo documentary...

In my mind, the simplest indicator of what AI can do :)

https://www.youtube.com/watch?v=xGar1k7 ... e=youtu.be
Judge without bias, or don't judge at all...
matthewlai
Posts: 793
Joined: Sun Aug 03, 2014 4:48 am
Location: London, UK

Re: Alphazero news

Post by matthewlai »

Thomas A. Anderson wrote: Wed Dec 12, 2018 5:53 pm Sounds like a very reasonable explanation. But "being good" seems to be an attribute of a position that appears to be much more subjective than I thought. BF is playing the book-resulting positions successfully against a non-book SF, that its purpose, the reason why it exists. This superiority is, as far as I know, confirmed by any match against the "usual suspects", means the crowd of the AB-engines. Now it seems that this is reversed when using the book against AZ. Of course, you can build books specifically against certain components: SF is handling KID positions better than Engine A but is playing them less good than Engine B. Therefore a book that forces SF into the KID might work against Engine A well, but fails against Engine B. But we are talking about the starting position and a complete book that wasn't certainly proofed only against some narrow opening lines, because AZ used was playing with diversity activated. How big was the diversity of those games?
That is a very reasonable explanation, too. We do find that SF and AZ win and lose for very different reasons (AZ often loses to crazy and amazing tactics SF finds for example, that AZ just doesn't have enough NPS to see, while still being able to search deep), so although the strengths are in the same ballpark, there are certainly positions where one does much better than the other, and vice versa.

It's surprisingly difficult to quantize diversity. While it's obvious that if two games are exactly the same there is a lack of diversity, once we go beyond that it's very difficult to quantize, and we don't usually get identical games. For example, there are transpositions, or games that are substantially similar except for a few irrelevant pieces at different places, etc. We didn't look too much into this because there are just too many possibilities and it's not part of the main results. We really only did it because people said they wanted to see it, but I don't think there's really much scientific value.
I would assume that AZ was playing different moves starting from move 1 on because there should be some of them within the 1% range already. We would need the games to answer the question finally, but my gut feeling is that here is something covered we can learn a lot from. The most "zero-ish" created opening book we have is shifting the match score of SF playing white pieces against AZ playing black from a 1-95-4 % towards a 9-73-18 % shape (both are rough values derived from the published graphs. Format: SF wins-Draws-AZ wins). Another interesting fact: that the BF-book works well for SF if it is playing the black pieces and fails only as white. This evens out and leads to the statement in the paper, that the usage of the opening book didn't have had a significant impact on the total match score .
Yeah many moves at move 1 (and the next few moves) have very similar values. It's possible that with diversity it's just taking SF out of book earlier or something like that. This is pure speculation.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Alphazero news

Post by Laskos »

Thomas A. Anderson wrote: Wed Dec 12, 2018 3:57 pm
Laskos wrote: Tue Dec 11, 2018 8:09 pm
In fact, I mainly take as reliable result for A0 from varied openings the TCEC openings match against SF8:

A0 vs SF8
+17 =75 -8
+31 Elo points

Now, at first glance one can almost surely say that SF10 would have performed better, like:

SF10 vs SF8
+22 =73 -5
+60 Elo points

But it still doesn't mean I have a very high confidence that SF10 would beat A0 in 100 games from TCEC openings in their conditions. A0 (and Lc0) is not that "sensitive" to the regular opponent, be it SF8 or SF10, when in superiority. A0 vs inferior regular engine shows a compressed Elo difference (I showed a model-plot in another thread). But in that model, the Elo compression is hardly above a factor of 2 or so. So, I would be fairly confident that A0 and SF10 are quite closely matched playing from TCEC openings, maybe with a slight advantage of SF10.
But as Matthew said, this was the first version of A0, I don't know what they have in hand by now.
Laskos, your post let me smile a little bit, because it looks like a good example, how hard it is to adopt the "new kind of math", that AZ brought us. I know you as a good analytical working person in this forum and also in this thread, you came up with this (compressed-ELO) model that tries to make AZ vs. SF match results fit into the standard ELO model. Reading your post, you stated at the beginning, that you don't have a high confidence SF10 would beat AZ, because AZ isn't "sensitive" to regular opponents. Two sentences later, you conclude that, because your model wouldn't explain ELO anomalies of more than factor 2, you are fairly confident AZ must be at SF10 strength level, with a possible slight edge for SF10 :) Now, that Matthew told us the number of games in the matches against ~SF9/Brainfish/etc. has been "more than high enough that the result is statistically significant", how can your compressed-ELO model "explain", that SF9 and SF8 lost against AZ by the same margin (while SF9 is ~30ELO ahead of SF8)? When the compression-model didn't work for the SF8-SF9-AZ trio, why believing it will fit for any kind of ELO-math in the SF9-SF10-AZ relation? Wouldn't it be more likely that the model doesn't fit the observations/results and therefore has to be dropped/revised?
I think, to be meaningful, the ELO system needs a certain kind of "transitivity" regarding the strength of the contenders (it's been a long time since my math classes and I might use the wrong term here). In case of AZ, this it lacks this prerequisite. When I need to explain the results, I think of contenders in Formula One race: Ferrari is constantly working on improving their cars, season by season. As McLaren etc. does. The 2018 Ferrari is better than the 2017 model, that was better than the 2016 type and so on. Thinking of the Ferrari as SF and the McLarens as Komodo and so on everything in the ELO world is fine, transitivity, rule-of-three and comparisons between cars over seasons boundaries might fit well more or less.
No the new "kid on the block" went in, a car constructed by Gyro Gearloose or Christopher Lloyd. When the car finishes a race the traditional contenders have no chance by any means. As aspected, knowing the constructors, the new car with its "jet propulsion" didn't manage to finish more than 60% of the races. Now, unlike in Formula One, think of one-on-one matches of the cars, and try to establish a performance rating where you get any meaningful number for the rocket-car. What would you think about a calculation like: If the rocket-car ist a 60-40 favorite against the 2018 Ferrari, and I build a 2019 Ferrari that beats the old model by a higher margin, then I have a high confidence that the Ferrari 2019 will beat the rocket car. I believe I'm preaching to the choir, as I remember you also stating stuff like this and it comes down to the fact that AZ has certain weaknesses, that can be exposed by traditional AB-Engines. But the rate of exposures (resulting in AZ losses) isn't very proportional to the ELO strength of the AB-engines. I'm wouldn't be surprised, if engines (ELO-)rated much lower than Stockfish would get better results against AZ, because they better expose its weaknesses.
The things are not as dramatic. A0 (Lc0) in a pool of regular engines is not obeying the Elo model. Regular engines in a pool of regular engines are obeying the Elo mode fairly well. But, at least from my experience with Lc0, the things are not as dramatic as your Formula 1 case. For example Lc0 with Leela Ratio of 2-3 (I have a strong GPU) beats heavily SF8 but at some time controls loses to SF10.
Matthew told us the number of games in the matches against ~SF9/Brainfish/etc. has been "more than high enough that the result is statistically significant"
All those matches aside the TCEC openings match against SF8 were fairly deterministic from SF side, and A0 was also fairly deterministic and close to its most trained lines. BrainFish match was SF8 + Cerebellum "best move opening", not true diversified openings and full BrainFish engine, the diversification coming from A0, playing again into A0 hand. "Number of games" relates to statistical errors, not systematic errors. Suppose you have completely deterministic engines and 1 starting position, A0 wins as White and draws as Black. So, you have in 2 games 1.5/2 performance of A0 and 1500/2000 performance in 2000 games. What would you mean here by "statistically significant"? This is bad practice of introducing a huge systematic error, which overwhelms the statistical error after even 5 games. Also, these somewhat deterministic "high statistical significance" matches usually play into A0 hand, as A0 is close to its optimum in these not very diversified games.

See here what I have posted in another thread the result:

Initial Board position:
Score of lc0_v19_11261 vs SF8: 18 - 0 - 22 [0.725] 40
Elo difference: 168.40 +/- 67.96
Finished match

Lc0 seems unbeatable here (similar to their results). But from

Adam Hair's opening 4-mover PGN
Score of lc0_v19_11261 vs SF8: 16 - 7 - 17 [0.613] 40
Elo difference: 79.53 +/- 84.63
Finished match

Suddenly, from real diverse openings, Lc0 stops being unbeatable, and its rating drops by almost 100 Elo points (although "Elo" here is just a number).

Therefore from 1 Initial Board position I have no high confidence in their results neither against SF8 nor against SF9, and in the difference between them. Be them "statistically significant" (meaning large number of games). They have a large systematic error.

DeepMind would say that A0 is not taught to play from 4-mover books and Chess is really from Initial Board position. But even in that case, just by using a diversified in Cutechess-Cli polyglot book for SF8 and SF9, not even of very high quality, they could have come up with more relevant results. They did come up with a relevant result from TCEC openings, and you see, the result against SF8 is a bit different. I do stand by my gut feeling (also, from experience with Lc0), somehow even by my simplistic model, that from TCEC openings A0 and SF10 are fairly matched, maybe (I said maybe) with slight advantage of SF10. I don't know why you cannot take my opinions a bit more thoughtfully.
noobpwnftw
Posts: 560
Joined: Sun Nov 08, 2015 11:10 pm

Re: Alphazero news

Post by noobpwnftw »

Laskos:
I admire your patience in describing the significance of bias introduced to the test results, in fact, it came from an inherit "feature" of NNs because when they generalize things you cannot take them out, and it's also why people have invented things like BrainFish and have it probe the book during search. Is this knowledge transferable, either have them play in a new domain like FRC or deeper opening lines to find out.

Just for the sake of measurement, I have started to create a considerably large book by brute force. The idea is simple: wherever there is a chance something can be "generalized", then they are also likely to exist in the book, when they are not, I'll take my chances.
jp
Posts: 1470
Joined: Mon Apr 23, 2018 7:54 am

Re: Alphazero news

Post by jp »

noobpwnftw wrote: Wed Dec 12, 2018 9:04 pm Just for the sake of measurement, I have started to create a considerably large book by brute force. The idea is simple: wherever there is a chance something can be "generalized", then they are also likely to exist in the book, when they are not, I'll take my chances.
I don't understand what this sentence means. What do you mean by chance & chances?
noobpwnftw
Posts: 560
Joined: Sun Nov 08, 2015 11:10 pm

Re: Alphazero news

Post by noobpwnftw »

jp wrote: Wed Dec 12, 2018 10:55 pm
noobpwnftw wrote: Wed Dec 12, 2018 9:04 pm Just for the sake of measurement, I have started to create a considerably large book by brute force. The idea is simple: wherever there is a chance something can be "generalized", then they are also likely to exist in the book, when they are not, I'll take my chances.
I don't understand what this sentence means. What do you mean by chance & chances?
The percentage of some unknown knowledge that is not covered by hand-crafted evaluation(1) and the probability that it would make a difference(2).
AW~
Posts: 8
Joined: Thu Aug 16, 2018 10:56 pm
Full name: Adonis Wilson

Re: Alphazero news

Post by AW~ »

Anyone know what would happen to A0 if it kept on training? Assuming it was diagnosed as 'done'...but what if it was left on for a week?
jp
Posts: 1470
Joined: Mon Apr 23, 2018 7:54 am

Re: Alphazero news

Post by jp »

AW~ wrote: Thu Dec 13, 2018 2:04 am Anyone know what would happen to A0 if it kept on training?
Look at Fig. 1a. Nothing happens except a big electricity bill.
Thomas A. Anderson
Posts: 27
Joined: Tue Feb 23, 2016 6:57 pm

Re: Alphazero news

Post by Thomas A. Anderson »

Laskos wrote: "A0 (Lc0) in a pool of regular engines is not obeying the Elo model."
Exactly! Thats why I didn't agree on your ELO math:
Laskos wrote: A0 vs SF8
+17 =75 -8
+31 Elo points
Now, at first glance one can almost surely say that SF10 would have performed better, like:
SF10 vs SF8
+22 =73 -5
+60 Elo point
Laskos wrote:
Thomas A. Anderson wrote: Matthew told us the number of games in the matches against ~SF9/Brainfish/etc. has been "more than high enough that the result is statistically significant"
All those matches aside the TCEC openings match against SF8 were fairly deterministic from SF side, and A0 was also fairly deterministic and close to its most trained lines. BrainFish match was SF8 + Cerebellum "best move opening", not true diversified openings and full BrainFish engine, the diversification coming from A0, playing again into A0 hand. "Number of games" relates to statistical errors, not systematic errors. Suppose you have completely deterministic engines and 1 starting position, A0 wins as White and draws as Black. So, you have in 2 games 1.5/2 performance of A0 and 1500/2000 performance in 2000 games. What would you mean here by "statistically significant"? This is bad practice of introducing a huge systematic error, which overwhelms the statistical error after even 5 games. Also, these somewhat deterministic "high statistical significance" matches usually play into A0 hand, as A0 is close to its optimum in these not very diversified games.
You are right, that SF didn't diversify, but to avoid your systematic error scenario, isn't it sufficient if A0 diversifies? As Matthew told us, A0 starts varying from the first move on. If this leads to even more diversification than using a 4move-book and two engine with non-/normal diversification, is unknown. But at least it certainly avoids this kind of 1500/2000 scenario.
Laskos wrote: the diversification coming from A0, playing again into A0 hand
You think that forcing A0 not to play its favorite move is disadvantageous for SF? :o It seems that something in that direction was observed in the test, but I can hardly imagine, that forcing an engine to play suboptimal moves is unfair against the opponent. If this should really be a validated outcome from the tests, it's one more interesting observation that should us make think about. I would rather think that increasing the variety by that means lowers the strength of AZ in the first place. As a followup, it might take advantage of a fact of that kind, that taking SF by surprise is resulting in entering lines, that SF hasn't had spent many search operations in. So maybe a search algorithm advantage (MCTS is better than AB when it's about leaving the main path) might even out and surpass the evaluation part in this case? But I'm only speculating at this point.

Laskos wrote: Therefore from 1 Initial Board position I have no high confidence in their results neither against SF8 nor against SF9, and in the difference between them. Be them "statistically significant" (meaning large number of games). They have a large systematic error.

DeepMind would say that A0 is not taught to play from 4-mover books and Chess is really from Initial Board position. But even in that case, just by using a diversified in Cutechess-Cli polyglot book for SF8 and SF9, not even of very high quality, they could have come up with more relevant results. They did come up with a relevant result from TCEC openings, and you see, the result against SF8 is a bit different. I do stand by my gut feeling (also, from experience with Lc0), somehow even by my simplistic model, that from TCEC openings A0 and SF10 are fairly matched, maybe (I said maybe) with slight advantage of SF10. I don't know why you cannot take my opinions a bit more thoughtfully.
Well, I think DM has anticipated/shared your complaints about the monocultural nature of testing with the initial position only. Therefore they did the tests with TCEC- and Human-opening starting positions you mentioned.
Laskos wrote: They did come up with a relevant result from TCEC openings, and you see, the result against SF8 is a bit different.
Let's look at what is "a bit different":
initial position: AZ scored ~57.5%
Human openings: AZ scored ~59.5%
TCEC openings: AZ scored ~57%

The results are manifesting the generalization of tests with the initial position or even showing a slight disadvantage for AZ in doing so.
Having done three tests addressing the initial-position "problem" (TCEC-, Human-Opening positions, and BF opening book usage), I am far away from accusing them to have done a "systematic error".

I don't know what happened in your LC0 tests, odds results. While with 40 games/test might leave the door open for the statistical errors, I would be curious about the games from the 18 - 0 - 22 run. A diversity issue? :wink:

Btw., while digging through the new games just published, I don't know if we, by trying to use many different starting positions, are not asking for a different kind of trouble: It might be possible, that there are commonly "accepted" positions that are, when playing against a sufficient strong entity, already lost or very close to that . Playing that kind of positions is probably really undesirable.
Last edited by Thomas A. Anderson on Thu Dec 13, 2018 4:03 pm, edited 3 times in total.
cu
Thomas A. Anderson
Posts: 27
Joined: Tue Feb 23, 2016 6:57 pm

Re: Alphazero news

Post by Thomas A. Anderson »

matthewlai wrote: Wed Dec 12, 2018 7:40 pm
Thomas A. Anderson wrote: Wed Dec 12, 2018 5:53 pm Sounds like a very reasonable explanation. But "being good" seems to be an attribute of a position that appears to be much more subjective than I thought. BF is playing the book-resulting positions successfully against a non-book SF, that its purpose, the reason why it exists. This superiority is, as far as I know, confirmed by any match against the "usual suspects", means the crowd of the AB-engines. Now it seems that this is reversed when using the book against AZ. Of course, you can build books specifically against certain components: SF is handling KID positions better than Engine A but is playing them less good than Engine B. Therefore a book that forces SF into the KID might work against Engine A well, but fails against Engine B. But we are talking about the starting position and a complete book that wasn't certainly proofed only against some narrow opening lines, because AZ used was playing with diversity activated. How big was the diversity of those games?
That is a very reasonable explanation, too. We do find that SF and AZ win and lose for very different reasons (AZ often loses to crazy and amazing tactics SF finds for example, that AZ just doesn't have enough NPS to see, while still being able to search deep), so although the strengths are in the same ballpark, there are certainly positions where one does much better than the other, and vice versa.

It's surprisingly difficult to quantize diversity. While it's obvious that if two games are exactly the same there is a lack of diversity, once we go beyond that it's very difficult to quantize, and we don't usually get identical games. For example, there are transpositions, or games that are substantially similar except for a few irrelevant pieces at different places, etc. We didn't look too much into this because there are just too many possibilities and it's not part of the main results. We really only did it because people said they wanted to see it, but I don't think there's really much scientific value.
I would assume that AZ was playing different moves starting from move 1 on because there should be some of them within the 1% range already. We would need the games to answer the question finally, but my gut feeling is that here is something covered we can learn a lot from. The most "zero-ish" created opening book we have is shifting the match score of SF playing white pieces against AZ playing black from a 1-95-4 % towards a 9-73-18 % shape (both are rough values derived from the published graphs. Format: SF wins-Draws-AZ wins). Another interesting fact: that the BF-book works well for SF if it is playing the black pieces and fails only as white. This evens out and leads to the statement in the paper, that the usage of the opening book didn't have had a significant impact on the total match score .
Yeah many moves at move 1 (and the next few moves) have very similar values. It's possible that with diversity it's just taking SF out of book earlier or something like that. This is pure speculation.
Matthew, as I see from our conversation (and even more in other threads/posts/forums) there is a lot of uncertainty/speculation around, that could be avoided/answered best by having the games from all test at hand (especially the SF9/BF/Opening-position tests). Do you think there is any chance to get them?
cu