That’s not how I read it, this seems to be a quite specific masking off of illegals..,,Daniel Shawul wrote: ↑Sat Nov 23, 2019 7:46 pmThe ply=0 is an exception for the sake of complying to game play rules, not something they actually needed.chrisw wrote: ↑Sat Nov 23, 2019 7:07 pmWell, it does get the rules of the game told to it, but not in a conventional manner. And it doesn't "learn" the rules in any sort of conventional manner either. It doesn't really learn chess, it creates an internal model of chess, where the rules are probabalistic, not firm as in normal chess.
From the paper:
Actions available. AlphaZero used the set of legal actions obtained from the simulator to mask the prior
produced by the network everywhere in the search tree. MuZero only masks legal actions at the root of the
search tree where the environment can be queried, but does not perform any masking within the search tree.
This is possible because the network rapidly learns not to predict actions that never occur in the trajectories
it is trained on
It does get game rules, but in a different form to being told "bishops move on diagonals" and so on. At the root, it is told which moves are legal and which not. MuZero only masks legal actions at the root of the search tree where the environment can be queried.
Within the tree it can play any move, including Kh1-d8 if it wants, but since it "learns" from the games it produces, Kh1-d8 won't occur and "the network rapidly learns not to predict actions that never occur in the trajectories it is trained on".
It creates a probalistic model of chess with probable rules, and the model is hidden in network weights, as per usual. It ends up being able to simulate playing chess. Which is fine. Why not? Especially if it works and plays strong. It will never actually play illegal moves (because illegal moves are masked away at ply zero), but it will play all manner of illegal things in the search tree. MuZero does not give special treatment to terminal nodes and always uses the value predicted by the network. Inside the tree, the search can proceed past a terminal node - in this case the network is expected to always predict the same value.
Presumably then, in the tree, states can exist where one or both sides have no king, for example.
I have no idea what this means:
This is achieved by treating terminal states as absorbing states during training.
So, it is not quite true to say MuZero doesn't get told the rules of the game. It is told at ply zero. And the games that it generates in self-play will all be legal games (because each move can only be selected from the legal move list known at ply zero). Each self play games will also terminate legally, so it will get to understand termination rules via the ply one move knowledge.
I'm not sure if it gets told the game score, I'ld guess so. That would be more Rules information.
For example, in Shogi they could have let it play illegal moves at ply=0 too AND the rules of shogi allow it. The rule is if you make an illegal
move you immediately loose the game. They mention in the paper that the way pieces move and legal move generation are learned quickly
so it could be making illegal moves 1 in 1000 for all we know.
MuZero only masks legal actions at the root of the search tree where the environment can be queried.
so it will never play an illegal move (at ply zero) and never be presented with an illegal PGN. I think there’s quite a difference between “never seeing an illegal trajectory” and seeing illegal trajectories and learning bad scores for each of them. There are what? 4096 illegal trajectories for every 30 or 40 legal moves, so, in the latter “learn bad scores” scenario, only 1 in a 100 or so learns are useful ones. Big advantage to the former “never see an illegal trajectory” by knowing ply one move rules.
I’m not trying to trash the MuZero concept here, btw, just trying to understand it. But there’s a big philosophical difference between learning the rules entirely by observation, and being helped.