Hey Michael, I offer you my sincerest apologies if I have misled you in any way. When I said that the policy evaluation you used was a first step towards the learning I truly meant a first step. It's similar in the sense you both use the MC/episode return to train your policy evaluators. The tree from the image I showed is a search from the root position with a certain number of simulations, it's not kept between games and is only a means for the neural network to improve itself.Michael Sherwin wrote:And here are some quotes by some that seem to be in the know.Ras wrote:That is how an NN works. The only common factor with RomiChess is that there is some way of reinforcement learning, but the rest has nothing in common. That's probably what so many people got.Michael Sherwin wrote:But is that correct?
That's not now how an NN works. Memorising is one technique that we humans can do with our brains, but actually, it's the least powerful way even we humans learn. It's about pattern recognition without precise position match, which I guess is exactly what RomiChess does not perform.Are there enough neurons to remember millions of these stats
Truls Edvard Stokke
"Hey Michael, very interesting stuff, this seems like a table-based monte carlo policy evaluation. Impressive that you would independently discover such a thing on your own." " However this is indeed a first step towards the policy evaluation used in A0. " Then in his simulation of Ao on a pc he publishes a chart of a search tree with backed up values. And then in other subjects it is mentioned by more than one that A0 stores wins, losses, draws and a winning percentage and you guys don't argue against it. It can't store all that data in the NN. It has to be storing w,l,d,p data somewhere either in memory or on a hard drive. And to say NN does not work that way is ridiculous. NN can analyze stored data. I might not be 100% correct but what you guys are saying is, it is like those that tell me God does not work like that. Well I got news for you, God can work anyway he likes and so can NN. You might be right but don't say stupid things like NN does not work that way, lol. Is there an emoji for frustration?
Keeping the tree between games won't improve the performance, in fact it's likely to make it much worse. This is due to MCTS producing action-value estimates by taking the mean of every evaluation in the subtree corresponding to that action. If you then start mixing in old trees produced by a bad or perhaps even a randomly initialized neural network you're going to be influenced by the poor decisions and misevaluations infinitely far into the past.
However this is not a problem in your learning algorithm, since you (to my knowledge at least) update the table by a fixed centipawn amount. This fixed learning rate means that old values are eventually purged from the table, which is necessary since your policy keeps moving as it takes the table into account from one game to the next. Similarly the training of the neural network in A0 also uses a fixed learning rate (although it was lowered by a fixed amount at certain steps, to help the weights converge).
In A0, the tree is only there to produce labels for the policy and to help you select good moves during a game. It makes sense to keep the sub-tree you visit during a game since for any one game the network is essentially frozen, no updates to the weights are made so you're not measuring a moving target. The key insight to AG0 and A0 is that MCTS can be used a policy improvement operator, MCTS takes in one policy and out comes an improved policy. Couple that with the generalization of a properly trained neural network and apparently you have yourself a killer algorithm.