YUFe wrote: ↑Thu Jan 16, 2020 10:00 am
They represent the entire game, not just the current state.
Mu0 was tested vs A0 with a search budget of 800 (lol) nodes per decision. I doubt Mu0 would be competent at large search depths without modifications, as it only encodes the root and then searches in the latent space.
However, if it were modified to roll out a number of predicted 6-ply sequences on an actual board, then encode the resulting positions before searching in their subtrees in the latent space, and repeat this procedure 6, 12, etc. plies away from the root, then it would have to encode somewhat fewer nodes than if it encoded every position in the tree, though the savings wouldn't be that big because there'd be relatively fewer TT hits.
As Mu0's value prediction network would be very deep and have a ton of parameters anyway, the addition of policy prediction heads to the output layer didn't increase the computational cost per latent state by much, so it made sense for DeepMind to add those heads to guide the search.
On the other hand, in an algorithm with a lightweight value prediction function, it indeed likely wouldn't make sense to call a separate policy predictor instead of guiding the search with some simple move ordering heuristic and then by the value predictions for the children. Then (as far as I understand, that's what you had in mind) an encoder would only be used to map a discrete board state into a lower dimension real-valued vector that would be easier to put into a kernel/NN/etc. to predict the value, and would also make the training easier.
I'm not an expert, so take my ramblings with a grain of salt