I can't believe that so many people don't get it!

trulses · Post by **trulses** » Tue Dec 19, 2017 3:31 am

Michael Sherwin wrote:
Ras wrote:
Michael Sherwin wrote:But is that correct?
That is how an NN works. The only common factor with RomiChess is that there is some way of reinforcement learning, but the rest has nothing in common. That's probably what so many people got.

Are there enough neurons to remember millions of these stats
That's not now how an NN works. Memorising is one technique that we humans can do with our brains, but actually, it's the least powerful way even we humans learn. It's about pattern recognition without precise position match, which I guess is exactly what RomiChess does not perform.
And here are some quotes by some that seem to be in the know.

Truls Edvard Stokke
"Hey Michael, very interesting stuff, this seems like a table-based monte carlo policy evaluation. Impressive that you would independently discover such a thing on your own." " However this is indeed a first step towards the policy evaluation used in A0. " Then in his simulation of Ao on a pc he publishes a chart of a search tree with backed up values. And then in other subjects it is mentioned by more than one that A0 stores wins, losses, draws and a winning percentage and you guys don't argue against it. It can't store all that data in the NN. It has to be storing w,l,d,p data somewhere either in memory or on a hard drive. And to say NN does not work that way is ridiculous. NN can analyze stored data. I might not be 100% correct but what you guys are saying is, it is like those that tell me God does not work like that. Well I got news for you, God can work anyway he likes and so can NN. You might be right but don't say stupid things like NN does not work that way, lol. Is there an emoji for frustration?

Hey Michael, I offer you my sincerest apologies if I have misled you in any way. When I said that the policy evaluation you used was a first step towards the learning I truly meant a first step. It's similar in the sense you both use the MC/episode return to train your policy evaluators. The tree from the image I showed is a search from the root position with a certain number of simulations, it's not kept between games and is only a means for the neural network to improve itself.

Keeping the tree between games won't improve the performance, in fact it's likely to make it much worse. This is due to MCTS producing action-value estimates by taking the mean of every evaluation in the subtree corresponding to that action. If you then start mixing in old trees produced by a bad or perhaps even a randomly initialized neural network you're going to be influenced by the poor decisions and misevaluations infinitely far into the past.

However this is not a problem in your learning algorithm, since you (to my knowledge at least) update the table by a fixed centipawn amount. This fixed learning rate means that old values are eventually purged from the table, which is necessary since your policy keeps moving as it takes the table into account from one game to the next. Similarly the training of the neural network in A0 also uses a fixed learning rate (although it was lowered by a fixed amount at certain steps, to help the weights converge).

In A0, the tree is only there to produce labels for the policy and to help you select good moves during a game. It makes sense to keep the sub-tree you visit during a game since for any one game the network is essentially frozen, no updates to the weights are made so you're not measuring a moving target. The key insight to AG0 and A0 is that MCTS can be used a policy improvement operator, MCTS takes in one policy and out comes an improved policy. Couple that with the generalization of a properly trained neural network and apparently you have yourself a killer algorithm.

Michael Sherwin · Post by **Michael Sherwin** » Tue Dec 19, 2017 3:40 am

trulses wrote:
Michael Sherwin wrote:
Ras wrote:
Michael Sherwin wrote:But is that correct?
That is how an NN works. The only common factor with RomiChess is that there is some way of reinforcement learning, but the rest has nothing in common. That's probably what so many people got.

Are there enough neurons to remember millions of these stats
That's not now how an NN works. Memorising is one technique that we humans can do with our brains, but actually, it's the least powerful way even we humans learn. It's about pattern recognition without precise position match, which I guess is exactly what RomiChess does not perform.
And here are some quotes by some that seem to be in the know.

Truls Edvard Stokke

"Hey Michael, very interesting stuff, this seems like a table-based monte carlo policy evaluation. Impressive that you would independently discover such a thing on your own." " However this is indeed a first step towards the policy evaluation used in A0. " Then in his simulation of Ao on a pc he publishes a chart of a search tree with backed up values. And then in other subjects it is mentioned by more than one that A0 stores wins, losses, draws and a winning percentage and you guys don't argue against it. It can't store all that data in the NN. It has to be storing w,l,d,p data somewhere either in memory or on a hard drive. And to say NN does not work that way is ridiculous. NN can analyze stored data. I might not be 100% correct but what you guys are saying is, it is like those that tell me God does not work like that. Well I got news for you, God can work anyway he likes and so can NN. You might be right but don't say stupid things like NN does not work that way, lol. Is there an emoji for frustration?
Hey Michael, I offer you my sincerest apologies if I have misled you in any way. When I said that the policy evaluation you used was a first step towards the learning I truly meant a first step. It's similar in the sense you both use the MC/episode return to train your policy evaluators. The tree from the image I showed is a search from the root position with a certain number of simulations, it's not kept between games and is only a means for the neural network to improve itself.

Keeping the tree between games won't improve the performance, in fact it's likely to make it much worse. This is due to MCTS producing action-value estimates by taking the mean of every evaluation in the subtree corresponding to that action. If you then start mixing in old trees produced by a bad or perhaps even a randomly initialized neural network you're going to be influenced by the poor decisions and misevaluations infinitely far into the past.

However this is not a problem in your learning algorithm, since you (to my knowledge at least) update the table by a fixed centipawn amount. This fixed learning rate means that old values are eventually purged from the table, which is necessary since your policy keeps moving as it takes the table into account from one game to the next. Similarly the training of the neural network in A0 also uses a fixed learning rate (although it was lowered by a fixed amount at certain steps, to help the weights converge).

In A0, the tree is only there to produce labels for the policy and to help you select good moves during a game. It makes sense to keep the sub-tree you visit during a game since for any one game the network is essentially frozen, no updates to the weights are made so you're not measuring a moving target. The key insight to AG0 and A0 is that MCTS can be used a policy improvement operator, MCTS takes in one policy and out comes an improved policy. Couple that with the generalization of a properly trained neural network and apparently you have yourself a killer algorithm.

Thank you for the explanation I can accept that. Still there seems to be similarities so it does not seem fair to say that the two methods are totally disparate and unrelated?

Michael Sherwin · Post by **Michael Sherwin** » Tue Dec 19, 2017 4:56 am

trulses wrote:
Michael Sherwin wrote:
Ras wrote:
Michael Sherwin wrote:But is that correct?
That is how an NN works. The only common factor with RomiChess is that there is some way of reinforcement learning, but the rest has nothing in common. That's probably what so many people got.

Are there enough neurons to remember millions of these stats
That's not now how an NN works. Memorising is one technique that we humans can do with our brains, but actually, it's the least powerful way even we humans learn. It's about pattern recognition without precise position match, which I guess is exactly what RomiChess does not perform.
And here are some quotes by some that seem to be in the know.

Truls Edvard Stokke
"Hey Michael, very interesting stuff, this seems like a table-based monte carlo policy evaluation. Impressive that you would independently discover such a thing on your own." " However this is indeed a first step towards the policy evaluation used in A0. " Then in his simulation of Ao on a pc he publishes a chart of a search tree with backed up values. And then in other subjects it is mentioned by more than one that A0 stores wins, losses, draws and a winning percentage and you guys don't argue against it. It can't store all that data in the NN. It has to be storing w,l,d,p data somewhere either in memory or on a hard drive. And to say NN does not work that way is ridiculous. NN can analyze stored data. I might not be 100% correct but what you guys are saying is, it is like those that tell me God does not work like that. Well I got news for you, God can work anyway he likes and so can NN. You might be right but don't say stupid things like NN does not work that way, lol. Is there an emoji for frustration?
Hey Michael, I offer you my sincerest apologies if I have misled you in any way. When I said that the policy evaluation you used was a first step towards the learning I truly meant a first step. It's similar in the sense you both use the MC/episode return to train your policy evaluators. The tree from the image I showed is a search from the root position with a certain number of simulations, it's not kept between games and is only a means for the neural network to improve itself.

Keeping the tree between games won't improve the performance, in fact it's likely to make it much worse. This is due to MCTS producing action-value estimates by taking the mean of every evaluation in the subtree corresponding to that action. If you then start mixing in old trees produced by a bad or perhaps even a randomly initialized neural network you're going to be influenced by the poor decisions and misevaluations infinitely far into the past.

However this is not a problem in your learning algorithm, since you (to my knowledge at least) update the table by a fixed centipawn amount. This fixed learning rate means that old values are eventually purged from the table, which is necessary since your policy keeps moving as it takes the table into account from one game to the next. Similarly the training of the neural network in A0 also uses a fixed learning rate (although it was lowered by a fixed amount at certain steps, to help the weights converge).

In A0, the tree is only there to produce labels for the policy and to help you select good moves during a game. It makes sense to keep the sub-tree you visit during a game since for any one game the network is essentially frozen, no updates to the weights are made so you're not measuring a moving target. The key insight to AG0 and A0 is that MCTS can be used a policy improvement operator, MCTS takes in one policy and out comes an improved policy. Couple that with the generalization of a properly trained neural network and apparently you have yourself a killer algorithm.

Mr Stokke, I'm left with some questions then I will quietly go off into the night ... because I just ordered a pizza and have to pick it up.

Your opinion is appreciated.
1.) People are calling what I did in RomiChess a cheap book trick and is frowned upon and maybe not allowed in some contest.

2.) News reports are calling RL new. What significance is there in RomiChess demonstrating it 12 years ago January?

3.) Romi does not use fixed centipawn values. They are higher towards the leaves and quite small towards the root. For a more gentile but consistent influence.

4.) When did the MC policy terminology come to be? Was it before January 2006 or after? Did I invent something that did not exist yet? Or did I reinvent something?

5.) Romi is a relatively weak engine without learning enabled. And yet Romi shows large gains in performance when enabled. What are the expected gains if a top AB searcher like SF adopted similar learning? I see at least a 1000 elo increase when trained up. So what are the implications if that were true and can it be true?

6.) Any other related thoughts that you might have.

Thanks! Of course there is no pressure to answer. Just if you want too!

Now I'm going to hunt down my late supper. It looks doubtful that I'm going to get pizza with the K-team as an honorary member. lol

hgm · Post by **hgm** » Tue Dec 19, 2017 8:42 am

Michael Sherwin wrote:That is not crystal clear. Romi only loads the subtree discarding the rest. That does not mean that the rest is trashed. That just means that the hash only loads the subtree. In AG0 they may just be omitting that the whole tree is stored elsewhere.

You are making things up. The papers explain very concisely what they do. Short of claiming that they are die-hard liars, there is no room for any of the things you seem to suspect. And even if they are lying through their teeth, that still would not qualify you to tell us what they did instead. You were not there, looking over their shoulder.

Rebel · Post by **Rebel** » Tue Dec 19, 2017 11:05 am

Daniel Shawul wrote:
Rebel wrote: Maybe you underestimate what can be done by simple hashing. A couple of years ago I created an opening book of 150 million positions (1.6 Gb) made from CCRL/CEGT games (max 30 moves) and analysed positions by Dann Corbit and got a 102 ELO improvement [link].
No, I don't underestimate the value of book learning especially with a deterministic opponent. What I am opposed to is the claiming I did what alphago did, when the only denominator is "learning". Book learning as everybody knows it (whether it is stored in the book or hash_table) is specific to a certain position -- with a 64 bit hash key.

There are enough signs to think that's indeed what happened, learning how to beat SF8 in all variations either from the start position (as HGM suggest it happened) which makes the learning even more easy or from predefined openings to avoid doubles.

Daniel Shawul wrote:AlphaGo's learning (NN training) is learning a general evaluation function. This can be compared to automatic parameter tuning done in chess programs, with the only difference being the NN actually constructs the important features while we have to code in the passed_pawns & king safety features our selves.

I know what the paper says, its 3400+ elo strength comes from 44 million self-play games. How believable is that? It also means, you, me and everybody else can do the same. The paper describes how it is done.

When I take a look at the games (have you?) I see SF8 slaughtered in Morphy style but... without making any calculation mistake which if there are calculation holes in AZ a program like SF8 immediately would punish. And so AZ comes across as the perfect Murphy. And I don't buy it.

Rebel · Post by **Rebel** » Tue Dec 19, 2017 11:20 am

hgm wrote:
Michael Sherwin wrote:That is not crystal clear. Romi only loads the subtree discarding the rest. That does not mean that the rest is trashed. That just means that the hash only loads the subtree. In AG0 they may just be omitting that the whole tree is stored elsewhere.
You are making things up. The papers explain very concisely what they do. Short of claiming that they are die-hard liars, there is no room for any of the things you seem to suspect. And even if they are lying through their teeth, that still would not qualify you to tell us what they did instead. You were not there, looking over their shoulder.

All of the document can be true, except that a paragraph of how AZ learned SF8 first was left out.

Rebel · Post by **Rebel** » Tue Dec 19, 2017 12:24 pm

Guenther wrote:
Rebel wrote:
hgm wrote:
Rebel wrote:
hgm wrote:The 100 games all started from the normal start position.
Nothing of that in the document.
Well, it should have been if they started from non-standard positions. The 10 games they published from that match all started from the standard position.
Nope.

Game-4 AZ-SF 1. d4 e6
Game-5 AZ-SF 1. d4 Nf6
And what?? Of course the MP search randomizes at 64 threads, especially if the search is always terminated after a fixed time...

I don't deny that as a possibile assumption, it's clear as mud until we have all 100 games, if ever. On the other hand, how scientific is such an approach? Makes AZ's life much easier, unfair.

Daniel Shawul · Post by **Daniel Shawul** » Tue Dec 19, 2017 1:36 pm

Rebel wrote:
Daniel Shawul wrote:
Rebel wrote: Maybe you underestimate what can be done by simple hashing. A couple of years ago I created an opening book of 150 million positions (1.6 Gb) made from CCRL/CEGT games (max 30 moves) and analysed positions by Dann Corbit and got a 102 ELO improvement [link].
No, I don't underestimate the value of book learning especially with a deterministic opponent. What I am opposed to is the claiming I did what alphago did, when the only denominator is "learning". Book learning as everybody knows it (whether it is stored in the book or hash_table) is specific to a certain position -- with a 64 bit hash key.
There are enough signs to think that's indeed what happened, learning how to beat SF8 in all variations either from the start position (as HGM suggest it happened) which makes the learning even more easy or from predefined openings to avoid doubles.

Daniel Shawul wrote:AlphaGo's learning (NN training) is learning a general evaluation function. This can be compared to automatic parameter tuning done in chess programs, with the only difference being the NN actually constructs the important features while we have to code in the passed_pawns & king safety features our selves.
I know what the paper says, its 3400+ elo strength comes from 44 million self-play games. How believable is that? It also means, you, me and everybody else can do the same. The paper describes how it is done.

When I take a look at the games (have you?) I see SF8 slaughtered in Morphy style but... without making any calculation mistake which if there are calculation holes in AZ a program like SF8 immediately would punish. And so AZ comes across as the perfect Murphy. And I don't buy it.

Come on Ed, it is not so shocking if you look at how they got to this stage. AlphaGo->AlphaGoZero->AlphaZero. If you don't trust Google, believe Gian Carlo Pascutoo (our chess programmer colleague) who is trying to reproduce AlphaGoZero's success with leelazero and he is already reaching 2000 elos with it last time I checked. Ofcourse he doesn't have the luxury of 5000 TPUs for training, so he is using distributed computing to get to that level in 1 month.

In the first AlphaGo, which mixed supervised learning, it was shown clearly supervised learning contributes less to its strength than reinforcement learning. Everybody was asking why they used human games for training anyway, and indeed in their next paper AlphaGoZero showed the human games were what is holding it back from reaching higher levels! So the second achievement that you are vary suspicious of was not that surprizing for those who follow developments closely.

The third one (with its application to chess and don't forget shogi (which i believe they did to shut up doubters)) demonstrates the generality of the approach to different games -- especially those full of tactics which is the weak point of MCTS. Remeber Go has tactics too like ladders -- a lot of people asked how they solved that during the Lee Sedol match. Then its application to chess was predicted by many in this forum. I am still not so convinced with MCTS for chess because of my little stint with it, but I already learned something that greatly improved MCTS for chess. Even with all my question, I am convinced their approach could work for chess too and would no way think they are lying.

Think about it, lying and lying through these three papers to get attention (if you believe that is what they are striving for), or simple Occam's razor -- what they got and published in 2 nature papers is actually true. I would go for the latter. To scream conspiracy on everything that could be done better is not so productive IMO.

Daniel

hgm · Post by **hgm** » Tue Dec 19, 2017 2:48 pm

Rebel wrote:All of the document can be true, except that a paragraph of how AZ learned SF8 first was left out.

That would make them die-hard liars. Lying by omission is still lying. It would be considered gross scientific fraud.

I was told that at the Free University (or was it UvA) only two thesis defenses in all of the history of the university had not resulted in granting the Ph.D. degree. In one of them the student appeared stone drunk. The other was for a thesis that discussed an experimental treatment of a certain kind of cancer, which by the 10 case studies treated in the thesis looked very good. And then during questioning, it turned out that the fact that 90 other patients submitted to this same treatment had died had been omitted...

trulses · Post by **trulses** » Tue Dec 19, 2017 3:49 pm

I'll put my comments inline.

Michael Sherwin wrote:1.) People are calling what I did in RomiChess a cheap book trick and is frowned upon and maybe not allowed in some contest.

What defines a "cheap book trick" I guess is up to each person. Since you mix this evaluation into your policy what you're doing is a form of policy iteration, so personally I'd say it's more than that.

I can understand the book argument though, at the end of "training" you can look up statistics and values from the table, similar to a book. So for some definitions of "book" it's a book method. There's also no generalization to new states beyond that of direct transpositions.

That said I think what people took issue with here is the way you described A0, MCTS and the NN. There are some significant inaccuracies there and that got the whole thread started off in a bad direction.

As for being allowed in contests I can't say anything beyond that this has to be up to the organizer. I have no clue what kind of planning and rules go into computer chess competitions.

Michael Sherwin wrote:2.) News reports are calling RL new. What significance is there in RomiChess demonstrating it 12 years ago January?

I wouldn't pay much attention to what the news have to say on the topic, in my experience they frequently get technical things wrong.

Michael Sherwin wrote:3.) Romi does not use fixed centipawn values. They are higher towards the leaves and quite small towards the root. For a more gentile but consistent influence.

This is very interesting, if I understand this correctly then you also discovered the value of having a discount factor. Common to RL is to use the gamma symbol for this term, you'll see it pop up in RL equations from time to time. The intuition behind it is that the most recent states are more to "blame" for a reward and therefore receive a stronger signal.

Michael Sherwin wrote:4.) When did the MC policy terminology come to be? Was it before January 2006 or after? Did I invent something that did not exist yet? Or did I reinvent something?

According to Sutton's book, first visit MC policy evaluation dates back to the 1940s. I wouldn't get too hung up on the date though, you weren't aware of it so it might as well have been invented yesterday or at the time of the roman empire and things would've turned out the same.

From what I can tell you independently re-discovered a lot of simple but effective RL. It's very impressive. I'm not 100% on all the details so there might be some novelties in there that haven't been done before.

Michael Sherwin wrote:5.) Romi is a relatively weak engine without learning enabled. And yet Romi shows large gains in performance when enabled. What are the expected gains if a top AB searcher like SF adopted similar learning? I see at least a 1000 elo increase when trained up. So what are the implications if that were true and can it be true?

I know enough that I couldn't possibly give any elo estimates without just pulling numbers out of my behind. I'll say this though, it's in the nature of optimization to get diminishing returns so whatever the gains are they would probably be a lot less than in RomiChess. After all it's easier to gain rating the lower rated you are.

I can't believe that so many people don't get it!

Re: I can't believe that so many people don't get it!

Re: I can't believe that so many people don't get it!

Re: I can't believe that so many people don't get it!

Re: I can't believe that so many people don't get it!

Re: I can't believe that so many people don't get it!

Re: I can't believe that so many people don't get it!

Re: I can't believe that so many people don't get it!

Re: I can't believe that so many people don't get it!

Re: I can't believe that so many people don't get it!

Re: I can't believe that so many people don't get it!