Alphazero news

matthewlai · Post by **matthewlai** » Thu Dec 13, 2018 4:00 pm

Thomas A. Anderson wrote: ↑Thu Dec 13, 2018 3:47 pm
matthewlai wrote: ↑Wed Dec 12, 2018 7:40 pm
Thomas A. Anderson wrote: ↑Wed Dec 12, 2018 5:53 pm Sounds like a very reasonable explanation. But "being good" seems to be an attribute of a position that appears to be much more subjective than I thought. BF is playing the book-resulting positions successfully against a non-book SF, that its purpose, the reason why it exists. This superiority is, as far as I know, confirmed by any match against the "usual suspects", means the crowd of the AB-engines. Now it seems that this is reversed when using the book against AZ. Of course, you can build books specifically against certain components: SF is handling KID positions better than Engine A but is playing them less good than Engine B. Therefore a book that forces SF into the KID might work against Engine A well, but fails against Engine B. But we are talking about the starting position and a complete book that wasn't certainly proofed only against some narrow opening lines, because AZ used was playing with diversity activated. How big was the diversity of those games?
That is a very reasonable explanation, too. We do find that SF and AZ win and lose for very different reasons (AZ often loses to crazy and amazing tactics SF finds for example, that AZ just doesn't have enough NPS to see, while still being able to search deep), so although the strengths are in the same ballpark, there are certainly positions where one does much better than the other, and vice versa.

It's surprisingly difficult to quantize diversity. While it's obvious that if two games are exactly the same there is a lack of diversity, once we go beyond that it's very difficult to quantize, and we don't usually get identical games. For example, there are transpositions, or games that are substantially similar except for a few irrelevant pieces at different places, etc. We didn't look too much into this because there are just too many possibilities and it's not part of the main results. We really only did it because people said they wanted to see it, but I don't think there's really much scientific value.

I would assume that AZ was playing different moves starting from move 1 on because there should be some of them within the 1% range already. We would need the games to answer the question finally, but my gut feeling is that here is something covered we can learn a lot from. The most "zero-ish" created opening book we have is shifting the match score of SF playing white pieces against AZ playing black from a 1-95-4 % towards a 9-73-18 % shape (both are rough values derived from the published graphs. Format: SF wins-Draws-AZ wins). Another interesting fact: that the BF-book works well for SF if it is playing the black pieces and fails only as white. This evens out and leads to the statement in the paper, that the usage of the opening book didn't have had a significant impact on the total match score .
Yeah many moves at move 1 (and the next few moves) have very similar values. It's possible that with diversity it's just taking SF out of book earlier or something like that. This is pure speculation.
Matthew, as I see from our conversation (and even more in other threads/posts/forums) there is a lot of uncertainty/speculation around, that could be avoided/answered best by having the games from all test at hand (especially the SF9/BF/Opening-position tests). Do you think there is any chance to get them?

Hi Thomas, for supplementary data for Science (and any other reputable journal), we had to select what to include based on what we (and reviewers) thought were scientifically valuable and relevant to our claims in the paper. That's why only data supporting our main claims have been included. Unfortunately it's above my pay grade to decide to release data that hasn't been released, but hopefully Lc0 will catch up soon since we have at this point released more or less all relevant details of our algorithms (and I am happy to continue clarifying any points of confusion in our pseudo-code), and then there will be a publicly-accessible way to generate those data!

Does SF do better against Lc0 with the Brainfish opening book (using whatever setting people think is optimal) at long time control, with diversity ensured with TCEC openings for example? I think the result of that would answer most of the questions here. At short time control I am pretty sure the opening book will help, but at long time control I am much less certain.

Laskos · Post by **Laskos** » Thu Dec 13, 2018 6:52 pm

I will reply just shortly, as I am a bit bored nailing down every detail of what I posted, and I have in les than an hour a dinner outside.

Thomas A. Anderson wrote: ↑Thu Dec 13, 2018 3:38 pm
Laskos wrote: "A0 (Lc0) in a pool of regular engines is not obeying the Elo model."
Exactly! Thats why I didn't agree on your ELO math:

Laskos wrote: A0 vs SF8
+17 =75 -8
+31 Elo points
Now, at first glance one can almost surely say that SF10 would have performed better, like:
SF10 vs SF8
+22 =73 -5
+60 Elo point

What math or model is here? First result is that got by A0 against SF8 from TCEC openings in their conditions. Second result is my fiction, no any model or math. From my experience with TCEC superfinals, SF10 against SF8 would "score something like that" in the TCEC superfinal. In other words, better than what A0 got. But I then stated (read again) that it doesn't mean that SF10 would beat A0 from TCEC positions, because A0 is not obeying Elo model. If instead of A0 was Komodo 12, I would have said that SF10 would most likely beat Komodo 12 in the TCEC superfinal just by seeing these results, because most strong regular engines obey Elo model.

Laskos wrote:
Thomas A. Anderson wrote: Matthew told us the number of games in the matches against ~SF9/Brainfish/etc. has been "more than high enough that the result is statistically significant"
All those matches aside the TCEC openings match against SF8 were fairly deterministic from SF side, and A0 was also fairly deterministic and close to its most trained lines. BrainFish match was SF8 + Cerebellum "best move opening", not true diversified openings and full BrainFish engine, the diversification coming from A0, playing again into A0 hand. "Number of games" relates to statistical errors, not systematic errors. Suppose you have completely deterministic engines and 1 starting position, A0 wins as White and draws as Black. So, you have in 2 games 1.5/2 performance of A0 and 1500/2000 performance in 2000 games. What would you mean here by "statistically significant"? This is bad practice of introducing a huge systematic error, which overwhelms the statistical error after even 5 games. Also, these somewhat deterministic "high statistical significance" matches usually play into A0 hand, as A0 is close to its optimum in these not very diversified games.
You are right, that SF didn't diversify, but to avoid your systematic error scenario, isn't it sufficient if A0 diversifies? As Matthew told us, A0 starts varying from the first move on. If this leads to even more diversification than using a 4move-book and two engine with non-/normal diversification, is unknown. But at least it certainly avoids this kind of 1500/2000 scenario.

If you don't like that 4-mover suite, allow SF8 or SF9 whatever to use a diversified polyglot book, with A0 playing its best moves right from the start. ANyrhing against that?
The diversity achieved by A0 diversification is not very high and is "A0 diversity".

Laskos wrote: the diversification coming from A0, playing again into A0 hand
You think that forcing A0 not to play its favorite move is disadvantageous for SF? It seems that something in that direction was observed in the test, but I can hardly imagine, that forcing an engine to play suboptimal moves is unfair against the opponent. If this should really be a validated outcome from the tests, it's one more interesting observation that should us make think about. I would rather think that increasing the variety by that means lowers the strength of AZ in the first place. As a followup, it might take advantage of a fact of that kind, that taking SF by surprise is resulting in entering lines, that SF hasn't had spent many search operations in. So maybe a search algorithm advantage (MCTS is better than AB when it's about leaving the main path) might even out and surpass the evaluation part in this case? But I'm only speculating at this point.

Variety should not be achieved by what A0 thinks is close to ITS optimal. This is skewed variety, favoring A0.

Laskos wrote: Therefore from 1 Initial Board position I have no high confidence in their results neither against SF8 nor against SF9, and in the difference between them. Be them "statistically significant" (meaning large number of games). They have a large systematic error.

DeepMind would say that A0 is not taught to play from 4-mover books and Chess is really from Initial Board position. But even in that case, just by using a diversified in Cutechess-Cli polyglot book for SF8 and SF9, not even of very high quality, they could have come up with more relevant results. They did come up with a relevant result from TCEC openings, and you see, the result against SF8 is a bit different. I do stand by my gut feeling (also, from experience with Lc0), somehow even by my simplistic model, that from TCEC openings A0 and SF10 are fairly matched, maybe (I said maybe) with slight advantage of SF10. I don't know why you cannot take my opinions a bit more thoughtfully.
Well, I think DM has anticipated/shared your complaints about the monocultural nature of testing with the initial position only. Therefore they did the tests with TCEC- and Human-opening starting positions you mentioned.
Laskos wrote: They did come up with a relevant result from TCEC openings, and you see, the result against SF8 is a bit different.
Let's look at what is "a bit different":
initial position: AZ scored ~57.5%
Human openings: AZ scored ~59.5%
TCEC openings: AZ scored ~57%

The results are manifesting the generalization of tests with the initial position or even showing a slight disadvantage for AZ in doing so.
Having done three tests addressing the initial-position "problem" (TCEC-, Human-Opening positions, and BF opening book usage), I am far away from accusing them to have done a "systematic error".

Cerebelleum best book lines give even worse diversification to SF8 than SF8 itself on 44 threads. I have no confidence in that result. "Human Openings" given in A0 preprint (those 12) were heavily favoring A0, and I have again no confidence in that result. I don't even know how many openings they choose, the same 12?

And if you cannot see the difference between

+155 -6 =839 from Initial Board position and
+17 -8 =75 from TCEC positions (the only reliable diversified openings result)

with math involved like Normalized Elo or Wilo, I cannot help you in 2 minutes.

I don't know what happened in your LC0 tests, odds results. While with 40 games/test might leave the door open for the statistical errors, I would be curious about the games from the 18 - 0 - 22 run. A diversity issue?

Sure a diversity issue.

Btw., while digging through the new games just published, I don't know if we, by trying to use many different starting positions, are not asking for a different kind of trouble: It might be possible, that there are commonly "accepted" positions that are, when playing against a sufficient strong entity, already lost or very close to that . Playing that kind of positions is probably really undesirable.

Again, if A0 is so smart, let it alone to play its best moves right from the beginning against a full (diversified openings) BrainFish of early 2018, and let's see the result. Matthew is right that the opening books help less with increased TC, but allow for "BrainFish diversity", as A0 all alone by itself is so smart that it sees that any BrainFish diversity results in lost games for Brainfish

.

matthewlai · Post by **matthewlai** » Thu Dec 13, 2018 7:42 pm

Laskos wrote: ↑Thu Dec 13, 2018 6:52 pm If you don't like that 4-mover suite, allow SF8 or SF9 whatever to use a diversified polyglot book, with A0 playing its best moves right from the start. ANyrhing against that?

If we did that, the criticism we would be getting now would be something like "those books aren't very strong, and SF would have played different moves in some of those positions that are probably better". Like you said, that wouldn't be fair for SF, and that surely would have been picked out if we did that and won. I hope it's obvious why we couldn't have done that for the paper. We weakened AZ for diversity instead.

Yes, there are always other things we can do, other tests we can run, other ways we can evaluate.

Variety should not be achieved by what A0 thinks is close to ITS optimal. This is skewed variety, favoring A0.

This doesn't make any sense. We have one player playing what it thinks is the best move all the time, and another player that doesn't. This favours SF for the same reason that you think the setup above favours AZ (1 player forced to diversify from what it thinks is best). Obviously in chess both players will try to steer the game into a position that they think is good for themselves. That's the whole point of the game. The Brainfish book will be trying to steer the game into a position that's good for SF. We forced AZ to do worse at that.

Cerebelleum best book lines give even worse diversification to SF8 than SF8 itself on 44 threads. I have no confidence in that result. "Human Openings" given in A0 preprint (those 12) were heavily favoring A0, and I have again no confidence in that result. I don't even know how many openings they choose, the same 12?

Yes, the same 12. The openings the most commonly played openings from a large set of high level human games (I don't remember which set off the top of my head). There is no cherry-picking. They were the top 12. If they favour AZ, that just means AZ plays common human openings better.

Yes, the score is more equal with TCEC openings, because there are many TCEC openings where both SF and AZ agree that one side is significantly ahead out of the opening. In the extreme case, if every opening starts with 95% winrate for one side, you'll get 0 Elo between any two reasonably strong player, even if one is 500 Elo stronger than the other. We also explained that in the paper. For someone with your statistical background, it really shouldn't be difficult to see.

Again, if A0 is so smart, let it alone to play its best moves right from the beginning against a full (diversified openings) BrainFish of early 2018, and let's see the result. Matthew is right that the opening books help less with increased TC, but allow for "BrainFish diversity", as A0 all alone by itself is so smart that it sees that any BrainFish diversity results in lost games for Brainfish .

The result will be people (rightly) saying it's not a fair match and doesn't mean anything.

noobpwnftw · Post by **noobpwnftw** » Thu Dec 13, 2018 8:14 pm

Laskos wrote: ↑Thu Dec 13, 2018 6:52 pm I will reply just shortly, as I am a bit bored nailing down every detail of what I posted, and I have in les than an hour a dinner outside.

Don't you see that it is virtually impossible to make it "fair" when one is the direct result of a statistical fitter over a defined set while one is not, and you have them play on the same defined set. It's like you train a dog to bite and it bites well, now you judge a man and a dog then argue that whichever bites better is the one more capable of learning, so this is going nowhere, we cannot even decide whether this particular NN is the man or the dog(note that the NN "knowledge" is not likely to be transferable).

noobpwnftw · Post by **noobpwnftw** » Thu Dec 13, 2018 10:01 pm

There is no hard proof that by introducing diversity to its opening moves "weakens" the engine in any way, that's like saying Berlin is inferior to French, one could also make an argument saying that they are exactly equivalent and if some statistics are favoring one over the other, then those statistics are flawed, which both can be verified only if chess is solved.

I also find it interesting that people tend to link "zero" to ground truth, but turns out that it is based on statistics and statistics is often far off, maybe that's why nobody is interested in measuring its performance against known solved games like checkers and where chess has its tablebases.

hgm · Post by **hgm** » Thu Dec 13, 2018 10:40 pm

noobpwnftw wrote: ↑Thu Dec 13, 2018 10:01 pmthat's like saying Berlin is inferior to French, one could also make an argument saying that they are exactly equivalent and if some statistics are favoring one over the other, then those statistics are flawed, which both can be verified only if chess is solved.

No. Solving Chess would only make things worse. It would qualify a large range of openings, varing from very poor to very good as draws, completely unable to make the distinction between 'nearly lost' and 'nearly won'. Just like playing KBPPKB from a 6-men EGT would not make the engine think it does anything bad when it blunders away a Bishop and two Pawns in the first 3 moves, when the initial position was a very-hard-to-hold draw.

Laskos · Post by **Laskos** » Thu Dec 13, 2018 11:11 pm

matthewlai wrote: ↑Thu Dec 13, 2018 7:42 pm
Laskos wrote: ↑Thu Dec 13, 2018 6:52 pm If you don't like that 4-mover suite, allow SF8 or SF9 whatever to use a diversified polyglot book, with A0 playing its best moves right from the start. ANyrhing against that?
If we did that, the criticism we would be getting now would be something like "those books aren't very strong, and SF would have played different moves in some of those positions that are probably better". Like you said, that wouldn't be fair for SF, and that surely would have been picked out if we did that and won. I hope it's obvious why we couldn't have done that for the paper. We weakened AZ for diversity instead.

Yes, there are always other things we can do, other tests we can run, other ways we can evaluate.

Variety should not be achieved by what A0 thinks is close to ITS optimal. This is skewed variety, favoring A0.
This doesn't make any sense. We have one player playing what it thinks is the best move all the time, and another player that doesn't. This favours SF for the same reason that you think the setup above favours AZ (1 player forced to diversify from what it thinks is best). Obviously in chess both players will try to steer the game into a position that they think is good for themselves. That's the whole point of the game. The Brainfish book will be trying to steer the game into a position that's good for SF. We forced AZ to do worse at that.

Cerebelleum best book lines give even worse diversification to SF8 than SF8 itself on 44 threads. I have no confidence in that result. "Human Openings" given in A0 preprint (those 12) were heavily favoring A0, and I have again no confidence in that result. I don't even know how many openings they choose, the same 12?
Yes, the same 12. The openings the most commonly played openings from a large set of high level human games (I don't remember which set off the top of my head). There is no cherry-picking. They were the top 12. If they favour AZ, that just means AZ plays common human openings better.

Yes, the score is more equal with TCEC openings, because there are many TCEC openings where both SF and AZ agree that one side is significantly ahead out of the opening. In the extreme case, if every opening starts with 95% winrate for one side, you'll get 0 Elo between any two reasonably strong player, even if one is 500 Elo stronger than the other. We also explained that in the paper. For someone with your statistical background, it really shouldn't be difficult to see.

As I have very limited options to answer, I want just to remark that previous TCEC superfinals with not 500 Elo model obeying Elo poins difference had even more skewed results from these "outrageous" openings.

Again, if A0 is so smart, let it alone to play its best moves right from the beginning against a full (diversified openings) BrainFish of early 2018, and let's see the result. Matthew is right that the opening books help less with increased TC, but allow for "BrainFish diversity", as A0 all alone by itself is so smart that it sees that any BrainFish diversity results in lost games for Brainfish .
The result will be people (rightly) saying it's not a fair match and doesn't mean anything.

noobpwnftw · Post by **noobpwnftw** » Thu Dec 13, 2018 11:43 pm

hgm wrote: ↑Thu Dec 13, 2018 10:40 pm
noobpwnftw wrote: ↑Thu Dec 13, 2018 10:01 pmthat's like saying Berlin is inferior to French, one could also make an argument saying that they are exactly equivalent and if some statistics are favoring one over the other, then those statistics are flawed, which both can be verified only if chess is solved.
No. Solving Chess would only make things worse. It would qualify a large range of openings, varing from very poor to very good as draws, completely unable to make the distinction between 'nearly lost' and 'nearly won'. Just like playing KBPPKB from a 6-men EGT would not make the engine think it does anything bad when it blunders away a Bishop and two Pawns in the first 3 moves, when the initial position was a very-hard-to-hold draw.

Well the "nearly" here is a form of compromise to truth because if we have perfect knowledge then the chances that "nearly" happens is zero.

We accept that everyone is using an approximation of truth based on statistics(hand-crafted evaluations also have their roots in statistics), now if people apply a naive uniformity over the samples and it would render some approximations fail more often(aka. "weakening"), then it tells something about the approximations, not that we should in turn use biased samples to make them look better.

Laskos · Post by **Laskos** » Fri Dec 14, 2018 12:34 am

Laskos wrote: ↑Thu Dec 13, 2018 11:11 pm
matthewlai wrote: ↑Thu Dec 13, 2018 7:42 pm
Yes, the same 12. The openings the most commonly played openings from a large set of high level human games (I don't remember which set off the top of my head). There is no cherry-picking. They were the top 12. If they favour AZ, that just means AZ plays common human openings better.

Yes, the score is more equal with TCEC openings, because there are many TCEC openings where both SF and AZ agree that one side is significantly ahead out of the opening. In the extreme case, if every opening starts with 95% winrate for one side, you'll get 0 Elo between any two reasonably strong player, even if one is 500 Elo stronger than the other. We also explained that in the paper. For someone with your statistical background, it really shouldn't be difficult to see.
As I have very limited options to answer, I want just to remark that previous TCEC superfinals with not 500 Elo model obeying Elo poins difference had even more skewed results from these "outrageous" openings.

Now that I am back home right to go to sleep after a crazy evening, I have to say a bit:

from those "outrageous" TCEC 2016 (Season 9) openings which even an "expert on statistics" (I am no any expert on statistics, I just know what I need) should understand that are useless, Stockfish 8 won the superfinal against Houdini 5 with exactly the same score as A0 beat SF8:

SF8 vs Houdini 5
+17 -8 =75

A0 vs SF8
+17 -8 =75

So, SF8, rated only some 30-40 Elo points above Houdini 5 on several rating lists, with engines obeying the Elo model, somehow from these "outrageous" TCEC openings, managed to beat convincingly (LOS=96%) Houdini 5 in 100 LTC games. It was either an accident, or these TCEC openings do have some significant resolution power.

That you used the same 12 "human openings" is bad. Everyone familiar with them with Lc0 knows that Lc0 overperforms on them by almost 100 Elo points. I am not claiming that you did it intentionally, but as Javier Ross said, most lead to closed position with few tactics, very favorable to Lc0 (A0).

All in all, I take your "outrageous TCEC openings" result as the most reliable (you introduce a huge systematic error in all the other results), and am of opinion that from usual openings used by me or usual testers, or from TCEC openings, SF10 and A0 are pretty closely matched. And I would bet 1:1 on SF10 to beat A0 (that one, not the improved one) from TCEC 2016 Season 9 openings in your conditions.

matthewlai · Post by **matthewlai** » Fri Dec 14, 2018 12:52 am

Laskos wrote: ↑Fri Dec 14, 2018 12:34 am
Laskos wrote: ↑Thu Dec 13, 2018 11:11 pm
matthewlai wrote: ↑Thu Dec 13, 2018 7:42 pm
Yes, the same 12. The openings the most commonly played openings from a large set of high level human games (I don't remember which set off the top of my head). There is no cherry-picking. They were the top 12. If they favour AZ, that just means AZ plays common human openings better.

Yes, the score is more equal with TCEC openings, because there are many TCEC openings where both SF and AZ agree that one side is significantly ahead out of the opening. In the extreme case, if every opening starts with 95% winrate for one side, you'll get 0 Elo between any two reasonably strong player, even if one is 500 Elo stronger than the other. We also explained that in the paper. For someone with your statistical background, it really shouldn't be difficult to see.
As I have very limited options to answer, I want just to remark that previous TCEC superfinals with not 500 Elo model obeying Elo poins difference had even more skewed results from these "outrageous" openings.

Now that I am back home right to go to sleep after a crazy evening, I have to say a bit:

from those "outrageous" TCEC 2016 (Season 9) openings which even an "expert on statistics" (I am no any expert on statistics, I just know what I need) should understand that are useless, Stockfish 8 won the superfinal against Houdini 5 with exactly the same score as A0 beat SF8:

SF8 vs Houdini 5
+17 -8 =75

A0 vs SF8
+17 -8 =75

So, SF8, rated only some 30-40 Elo points above Houdini 5 on several rating lists, with engines obeying the Elo model, somehow from these "outrageous" TCEC openings, managed to beat convincingly (LOS=96%) Houdini 5 in 100 LTC games. It was either an accident, or these TCEC openings do have some significant resolution power.

That you used the same 12 "human openings" is bad. Everyone familiar with them with Lc0 knows that Lc0 overperforms on them by almost 100 Elo points. I am not claiming that you did it intentionally, but as Javier Ross said, most lead to closed position with few tactics, very favorable to Lc0 (A0).

All in all, I take your "outrageous TCEC openings" result as the most reliable (you introduce a huge systematic error in all the other results), and am of opinion that from usual openings used by me or usual testers, or from TCEC openings, SF10 and A0 are pretty closely matched. And I would bet 1:1 on SF10 to beat A0 (that one, not the improved one) from TCEC 2016 Season 9 openings in your conditions.

If you say those human openings are bad, what are you comparing it to? Clearly not start position, but another set of openings?

So if you have a set of openings that SF performs better on, and a set of openings that AZ performs better on, why is it unfair to use set AZ performs better on, but not the set that SF performs better on? Everything is relative here. We are basically defining a new game that is almost exactly like chess, but starting from arbitrary positions. The positions are part of the rules of this new game, and obviously changing the rules will change how different engines do.

The TCEC openings are all open and tactical openings, favouring SF. Why do you say they are more reliable?

If AZ can always play into closed openings from start position no matter what the opponent does, why should its performance on open openings be reflected in its Elo rating?

The most fair way I can think of to modify the rules to introduce diversity is to let engines play from start position, but force them to diverge from games already played in the match, maybe in the first 10 moves or so. The details would still need to be worked out, but that way engines still have a lot of power to choose openings they want to play (just like humans), but there's enough diversity to get low error on Elo. For example, maybe neither engine is allowed to repeat the same first 8 move sequence from its side, unless the opponent diverges first. With swapping colours this would be fair. So if both players repeat a past game, white loses on move 8.

Or just make it part of the game. Engines are allowed access to past games in the match, and if they decide to repeat games, the loser will just keep on losing. Just like human games.

Alphazero news

Re: Alphazero news

Re: Alphazero news

Re: Alphazero news

Re: Alphazero news

Re: Alphazero news

Re: Alphazero news

Re: Alphazero news

Re: Alphazero news

Re: Alphazero news

Re: Alphazero news