Training using 1 playout instead of 800

trulses · Post by **trulses** » Sun Apr 28, 2019 5:08 pm

Daniel:

Daniel Shawul wrote: ↑Sat Apr 27, 2019 12:09 am
...

Good point. I thought I lowered it enough when reducing from 0.25 to 0.15 but setting it to 0 (turning it off) all in all seems to be better already.
The noise is there to ecncourag finding of bad looking moves that turn out to be good later, but i guess that can wait till the basic stuff is learned first.
If that noise is disabled, there will be no randomness after the 30 half moves are played.

What's the value of alpha parameter in the dirichlet noise? Do you use a softmax temperature of 0 after 30 half moves? I'm no expert but it seems like you could easily wipe out a winning signal from a position in chess with just a tiny % of random moves.

I don't think using draw as baseline matters. ... just shifts the reference value.

If you look at the classical policy gradient you will see it's just like a supervised classification gradient with the sampled move as the "correct" label. The only difference between the two is that the gradient is scaled by the reward signal (z in the paper). With this in mind what you're doing is increasing the probability of the moves that led to a draw but to a lesser extent than those that led to a win, effectively they are being encouraged with half the learning rate of the winning moves.

You're right in that it theoretically makes no difference, but when your games overwhelmingly become draws you'll just end up encouraging these moves eventually wiping out your win and loss signal. There are many ways you can address this, the easiest of which is just center draws at 0 so these draw based gradients don't perform any updates. The alternative is to look into other techniques that are designed to address imbalanced datasets, like making sure your batches are evenly balanced in terms of moves that led to losses/draws/wins.

There are also other implications like slowly shifting your logits, while technically softmax wipes out offsets from logits you could eventually run into numerical issues on this front. I have no idea if actually would be the case here since it depends on architecture and hyper-parameters.

Zero centering your rewards is just a good idea in general. There are some implementations that go the extra step and normalize the rewards per batch, although I wouldn't recommend this here.

The 'correct' baseline is to use the current position's evaluation as baseline to calculate Advantage = Action-value - Postion-value

Estimating the action advantage is interesting, have you tried it? Also have you tried other rewards than the monte carlo return? Have you tried using a discount factor? There are plenty of ways you can include the value net here, like TD(0), TD(lambda) etc.

Btw AG original implmentation of policy gradient did not have random sampling or dirichlet noise so there is a tendency to converge to local optima

I think they actually did use samping in the original implementation. Just no softmax temperature or added noise, just like a classical REINFORCE. You're also right about the pool of opponents.

Rémi:

The whole exploration issue is why I think entropy regularization is so important, you can control indirectly how uniform your policy is so you don't converge to a stratified policy too early.

Daniel Shawul · Post by **Daniel Shawul** » Tue Apr 30, 2019 12:40 am

trulses wrote: ↑Sun Apr 28, 2019 5:08 pm What's the value of alpha parameter in the dirichlet noise? Do you use a softmax temperature of 0 after 30 half moves? I'm no expert but it seems like you could easily wipe out a winning signal from a position in chess with just a tiny % of random moves.

alpha=0.3 and beta=1.0. After 30 moves, the best move is played without temperature.

If you look at the classical policy gradient you will see it's just like a supervised classification gradient with the sampled move as the "correct" label. The only difference between the two is that the gradient is scaled by the reward signal (z in the paper). With this in mind what you're doing is increasing the probability of the moves that led to a draw but to a lesser extent than those that led to a win, effectively they are being encouraged with half the learning rate of the winning moves.

You're right in that it theoretically makes no difference, but when your games overwhelmingly become draws you'll just end up encouraging these moves eventually wiping out your win and loss signal. There are many ways you can address this, the easiest of which is just center draws at 0 so these draw based gradients don't perform any updates. The alternative is to look into other techniques that are designed to address imbalanced datasets, like making sure your batches are evenly balanced in terms of moves that led to losses/draws/wins.

I actually decided to try policy gradient when i realized that training policy head with the loosing side's moves just doesn't make sense. So I set weight_of_loss = 0, and weight_of_wins=1.0, weight_of_draws=0.5 giving the current form of my policy gradient

Estimating the action advantage is interesting, have you tried it? Also have you tried other rewards than the monte carlo return? Have you tried using a discount factor? There are plenty of ways you can include the value net here, like TD(0), TD(lambda) etc.

I have now implemented actor-critic after saving the mcts score along with the training data. Actor-critic with advantage is not affected by the proportion of draws and wins. Scores with bigger deviation from the position evaluation (in other words, things that NN have trouble understanding) are weighed higher.

trulses · Post by **trulses** » Thu May 09, 2019 4:19 pm

Daniel Shawul wrote: ↑Tue Apr 30, 2019 12:40 am ...

I actually decided to try policy gradient when i realized that training policy head with the loosing side's moves just doesn't make sense.

I think losses carry a lot more signal than a draw does. If you just play randomly a draw is much more likely than any other outcome. Going back to the gradient scaling argument, if you give moves that led to a loss a negative weight you will end up lowering the probability of those moves instead of increasing them.

I have now implemented actor-critic after saving the mcts score along with the training data. Scores with bigger deviation from the position evaluation (in other words, things that NN have trouble understanding) are weighed higher.

This seems best. If you don't already I would recommend monitoring the mean and variance of the advantages on a per-batch basis to make sure that the training procedure is healthy (mean roughly 0 and variance not too close to 0).

Training using 1 playout instead of 800

Re: Training using 1 playout instead of 800

Re: Training using 1 playout instead of 800

Re: Training using 1 playout instead of 800