Training using 1 playout instead of 800

Daniel Shawul · Post by **Daniel Shawul** » Fri Apr 26, 2019 8:41 pm

I was curious how training a network without using MCTS would perform, mostly because it is much faster than 800 playouts.
Though it seems a ridiclous idea at first it is not at all. For starters, AlphaGo actually used that before going full zero.
They trained a policy network in supervised manner and then used policy gradient method for reinforcement learning.
I implemented 1-node for training and here are things i noticed

a) it needs multi-threaded batching capability to get an 800x speedup compared to the 800 playouts version.
This also has the additional benefit of each tread generating games with batch size of 1 i.e. single thread search
which could be more selective and probably stronger games.

b) I train the policy head to match the actual game result. This is quite different from the AlphaZero algorithms because
the training of the policy head is by imitation irregardles of the outcome. So I do not train policy head from moves made by
the loosing side. WDL are weighted by 1, 0.5 and 0 but i guess one could use different weight like 1, 0.5, 0.2 etc.
My guess is they dropped policy gradient method for A0 because the 800 playouts search gives stronger values even for the loosing side,
but is that the case also for 1-playout training.

c) A naive implementation of 1-playout training results in too many draws (draws by repetition, fifty move etc) -after a while 99% of the games become draws. This could be because I don't have repetiton/fifty moves count as input planes, but i have also noticed similar behavior in lc0 nets for instance.
My solutio for this was to generate next players moves at the root and bias policy for mates/stalemates/repetition/fifty-draws and add a small bonus
depending on the fifty move count for pawn/capture moves. This works well.

d) I add Dirchilet noise to the policy head, and do random sampling based on the policy values. The root moves are not evaluated by NN so you only have policy and the root nodes evaluation. For training i use the game outcome (z) and and a one-hot encoding of the move made. I think A0 uses pi (move probability obtained from MCTS). I am not sure if the value/policy network are coupled well enough for efficient training.

I am able to train a net and it seems to capture importance of queen after a while but it seems too noisy.
What are your thoughts ?

Daniel

trulses · Post by **trulses** » Fri Apr 26, 2019 9:29 pm

Daniel Shawul wrote: ↑Fri Apr 26, 2019 8:41 pm
...

b) I train the policy head to match the actual game result. This is quite different from the AlphaZero algorithms because
the training of the policy head is by imitation irregardles of the outcome. So I do not train policy head from moves made by
the loosing side. WDL are weighted by 1, 0.5 and 0 but i guess one could use different weight like 1, 0.5, 0.2 etc.
My guess is they dropped policy gradient method for A0 because the 800 playouts search gives stronger values even for the loosing side,
but is that the case also for 1-playout training.

It would probably benefit you to look at actor-critic methods for this. It is common to subtract a ``baseline'' from the reward (or outcome, in this case) to reduce variance. In the original alphago paper I think they used 1, 0 and -1 as weights for their REINFORCE gradient. This has the effect of ignoring draws, discouraging actions that led to a loss and encouraging actions that led to a win.

...

d) I add Dirchilet noise to the policy head, and do random sampling based on the policy values. The root moves are not evaluated by NN so you only have policy and the root nodes evaluation. For training i use the game outcome (z) and and a one-hot encoding of the move made. I think A0 uses pi (move probability obtained from MCTS). I am not sure if the value/policy network are coupled well enough for efficient training.

What parameters do you use for the dirichlet noise? For an agent that does no search this could easily be far too much noise to inject. Perhaps annealing the mixing value (epsilon in the paper) to 0 or some small value is something you want to try here.

I would encourage you to look into entropy regularization as a replacement in this specific circumstance.

chrisw · Post by **chrisw** » Fri Apr 26, 2019 10:07 pm

Daniel Shawul wrote: ↑Fri Apr 26, 2019 8:41 pm I was curious how training a network without using MCTS would perform, mostly because it is much faster than 800 playouts.
Though it seems a ridiclous idea at first it is not at all. For starters, AlphaGo actually used that before going full zero.
They trained a policy network in supervised manner and then used policy gradient method for reinforcement learning.
I implemented 1-node for training and here are things i noticed

a) it needs multi-threaded batching capability to get an 800x speedup compared to the 800 playouts version.
This also has the additional benefit of each tread generating games with batch size of 1 i.e. single thread search
which could be more selective and probably stronger games.

b) I train the policy head to match the actual game result. This is quite different from the AlphaZero algorithms because
the training of the policy head is by imitation irregardles of the outcome. So I do not train policy head from moves made by
the loosing side. WDL are weighted by 1, 0.5 and 0 but i guess one could use different weight like 1, 0.5, 0.2 etc.
My guess is they dropped policy gradient method for A0 because the 800 playouts search gives stronger values even for the loosing side,
but is that the case also for 1-playout training.

c) A naive implementation of 1-playout training results in too many draws (draws by repetition, fifty move etc) -after a while 99% of the games become draws. This could be because I don't have repetiton/fifty moves count as input planes, but i have also noticed similar behavior in lc0 nets for instance.
My solutio for this was to generate next players moves at the root and bias policy for mates/stalemates/repetition/fifty-draws and add a small bonus
depending on the fifty move count for pawn/capture moves. This works well.

d) I add Dirchilet noise to the policy head, and do random sampling based on the policy values. The root moves are not evaluated by NN so you only have policy and the root nodes evaluation. For training i use the game outcome (z) and and a one-hot encoding of the move made. I think A0 uses pi (move probability obtained from MCTS). I am not sure if the value/policy network are coupled well enough for efficient training.

I am able to train a net and it seems to capture importance of queen after a while but it seems too noisy.
What are your thoughts ?

Daniel

chrisw · Post by **chrisw** » Fri Apr 26, 2019 10:56 pm

chrisw wrote: ↑Fri Apr 26, 2019 10:07 pm
Daniel Shawul wrote: ↑Fri Apr 26, 2019 8:41 pm I was curious how training a network without using MCTS would perform, mostly because it is much faster than 800 playouts.
Though it seems a ridiclous idea at first it is not at all. For starters, AlphaGo actually used that before going full zero.
They trained a policy network in supervised manner and then used policy gradient method for reinforcement learning.
I implemented 1-node for training and here are things i noticed

a) it needs multi-threaded batching capability to get an 800x speedup compared to the 800 playouts version.
This also has the additional benefit of each tread generating games with batch size of 1 i.e. single thread search
which could be more selective and probably stronger games.

b) I train the policy head to match the actual game result. This is quite different from the AlphaZero algorithms because
the training of the policy head is by imitation irregardles of the outcome. So I do not train policy head from moves made by
the loosing side. WDL are weighted by 1, 0.5 and 0 but i guess one could use different weight like 1, 0.5, 0.2 etc.
My guess is they dropped policy gradient method for A0 because the 800 playouts search gives stronger values even for the loosing side,
but is that the case also for 1-playout training.

c) A naive implementation of 1-playout training results in too many draws (draws by repetition, fifty move etc) -after a while 99% of the games become draws. This could be because I don't have repetiton/fifty moves count as input planes, but i have also noticed similar behavior in lc0 nets for instance.
My solutio for this was to generate next players moves at the root and bias policy for mates/stalemates/repetition/fifty-draws and add a small bonus
depending on the fifty move count for pawn/capture moves. This works well.

d) I add Dirchilet noise to the policy head, and do random sampling based on the policy values. The root moves are not evaluated by NN so you only have policy and the root nodes evaluation. For training i use the game outcome (z) and and a one-hot encoding of the move made. I think A0 uses pi (move probability obtained from MCTS). I am not sure if the value/policy network are coupled well enough for efficient training.

I am able to train a net and it seems to capture importance of queen after a while but it seems too noisy.
What are your thoughts ?

Daniel

I wasted a lot of time on this problem! Conclusion was that policy and value are two very different things. The former is basically spatial, about moving. The latter is basically mass, how much wood, then translated into win/loss.
If we train on one ply games, then our taught policy and our taught value are disconnected by, imo, not enough. The former is too easily mapped to the latter with just one ply difference. If it’s too easily mapped then we don’t have much information difference between the two.
I can’t think of another way to make complex the mapping between the two, or differentiate them, other than with the search.

Daniel Shawul · Post by **Daniel Shawul** » Sat Apr 27, 2019 12:09 am

trulses wrote: ↑Fri Apr 26, 2019 9:29 pm
What parameters do you use for the dirichlet noise? For an agent that does no search this could easily be far too much noise to inject. Perhaps annealing the mixing value (epsilon in the paper) to 0 or some small value is something you want to try here.

Good point. I thought I lowered it enough when reducing from 0.25 to 0.15 but setting it to 0 (turning it off) all in all seems to be better already.
The noise is there to ecncourag finding of bad looking moves that turn out to be good later, but i guess that can wait till the basic stuff is learned first.
If that noise is disabled, there will be no randomness after the 30 half moves are played.

I would encourage you to look into entropy regularization as a replacement in this specific circumstance.

Thanks, will look into it.

Rémi Coulom · Post by **Rémi Coulom** » Sat Apr 27, 2019 2:12 pm

Daniel Shawul wrote: ↑Fri Apr 26, 2019 8:41 pm b) I train the policy head to match the actual game result. This is quite different from the AlphaZero algorithms because
the training of the policy head is by imitation irregardles of the outcome. So I do not train policy head from moves made by
the loosing side. WDL are weighted by 1, 0.5 and 0 but i guess one could use different weight like 1, 0.5, 0.2 etc.
My guess is they dropped policy gradient method for A0 because the 800 playouts search gives stronger values even for the loosing side,
but is that the case also for 1-playout training.

Why not use proper policy gradient? With the data you generate, you can do policy gradient at the same cost, and I expect it has better chances to improve the policy.

Rémi Coulom · Post by **Rémi Coulom** » Sat Apr 27, 2019 2:31 pm

Rémi Coulom wrote: ↑Sat Apr 27, 2019 2:12 pm
Daniel Shawul wrote: ↑Fri Apr 26, 2019 8:41 pm b) I train the policy head to match the actual game result. This is quite different from the AlphaZero algorithms because
the training of the policy head is by imitation irregardles of the outcome. So I do not train policy head from moves made by
the loosing side. WDL are weighted by 1, 0.5 and 0 but i guess one could use different weight like 1, 0.5, 0.2 etc.
My guess is they dropped policy gradient method for A0 because the 800 playouts search gives stronger values even for the loosing side,
but is that the case also for 1-playout training.
Why not use proper policy gradient? With the data you generate, you can do policy gradient at the same cost, and I expect it has better chances to improve the policy.

Oh, in fact what you do is policy gradient, but your baseline is surprising. As trulses wrote, using a draw as a baseline looks like a better approach.

I have not had very good experience with policy gradient. It tends to converge to bad local optima, and does not mix well with exploration.

Henk · Post by **Henk** » Sat Apr 27, 2019 2:32 pm

How much time does it cost you to compute one training game on your computer?

Daniel Shawul · Post by **Daniel Shawul** » Sun Apr 28, 2019 1:56 pm

Oh, in fact what you do is policy gradient, but your baseline is surprising. As trulses wrote, using a draw as a baseline looks like a better approach.

I don't think using draw as baseline matters. The 'correct' baseline is to use the current position's evaluation as baseline to calculate Advantage = Action-value - Postion-value. This actually lowers the variance, but i think the former just shifts the reference value.

I have not had very good experience with policy gradient. It tends to converge to bad local optima, and does not mix well with exploration.

Btw AG original implmentation of policy gradient did not have random sampling or dirichlet noise so there is a tendency to converge to local optima.
I think they used different opponents from the pool of players at the time and only maybe discovered random sapling and noise later.

How much time does it cost you to compute one training game on your computer?

With 1 playouts, what I can do in 1 day, I do it in 30 sec.

Rémi Coulom · Post by **Rémi Coulom** » Sun Apr 28, 2019 3:17 pm

Daniel Shawul wrote: ↑Sun Apr 28, 2019 1:56 pmBtw AG original implmentation of policy gradient did not have random sampling or dirichlet noise so there is a tendency to converge to local optima.
I think they used different opponents from the pool of players at the time and only maybe discovered random sapling and noise later.

The mathematical derivation of the policy-gradient formula requires that the game moves are sampled according to the current policy. If they are not, then the gradient estimation becomes incorrect. So, in theory, policy gradient cannot be combined with exploration. It may work a little in practice, but I don't expect it to work well. Dirichlet exploration will more often cause a blunder than an improvement, and it makes it likely that the final result of the game is a consequence of a Dirichlet blunder more than a consequence of the policy.

Training using 1 playout instead of 800

Training using 1 playout instead of 800

Re: Training using 1 playout instead of 800

Re: Training using 1 playout instead of 800

Re: Training using 1 playout instead of 800

Re: Training using 1 playout instead of 800

Re: Training using 1 playout instead of 800

Re: Training using 1 playout instead of 800

Re: Training using 1 playout instead of 800

Re: Training using 1 playout instead of 800

Re: Training using 1 playout instead of 800