What's the value of alpha parameter in the dirichlet noise? Do you use a softmax temperature of 0 after 30 half moves? I'm no expert but it seems like you could easily wipe out a winning signal from a position in chess with just a tiny % of random moves.Daniel Shawul wrote: ↑Fri Apr 26, 2019 10:09 pm
Good point. I thought I lowered it enough when reducing from 0.25 to 0.15 but setting it to 0 (turning it off) all in all seems to be better already.
The noise is there to ecncourag finding of bad looking moves that turn out to be good later, but i guess that can wait till the basic stuff is learned first.
If that noise is disabled, there will be no randomness after the 30 half moves are played.
If you look at the classical policy gradient you will see it's just like a supervised classification gradient with the sampled move as the "correct" label. The only difference between the two is that the gradient is scaled by the reward signal (z in the paper). With this in mind what you're doing is increasing the probability of the moves that led to a draw but to a lesser extent than those that led to a win, effectively they are being encouraged with half the learning rate of the winning moves.I don't think using draw as baseline matters. ... just shifts the reference value.
You're right in that it theoretically makes no difference, but when your games overwhelmingly become draws you'll just end up encouraging these moves eventually wiping out your win and loss signal. There are many ways you can address this, the easiest of which is just center draws at 0 so these draw based gradients don't perform any updates. The alternative is to look into other techniques that are designed to address imbalanced datasets, like making sure your batches are evenly balanced in terms of moves that led to losses/draws/wins.
There are also other implications like slowly shifting your logits, while technically softmax wipes out offsets from logits you could eventually run into numerical issues on this front. I have no idea if actually would be the case here since it depends on architecture and hyper-parameters.
Zero centering your rewards is just a good idea in general. There are some implementations that go the extra step and normalize the rewards per batch, although I wouldn't recommend this here.
Estimating the action advantage is interesting, have you tried it? Also have you tried other rewards than the monte carlo return? Have you tried using a discount factor? There are plenty of ways you can include the value net here, like TD(0), TD(lambda) etc.The 'correct' baseline is to use the current position's evaluation as baseline to calculate Advantage = Action-value - Postion-value
I think they actually did use samping in the original implementation. Just no softmax temperature or added noise, just like a classical REINFORCE. You're also right about the pool of opponents.Btw AG original implmentation of policy gradient did not have random sampling or dirichlet noise so there is a tendency to converge to local optima
The whole exploration issue is why I think entropy regularization is so important, you can control indirectly how uniform your policy is so you don't converge to a stratified policy too early.