A0 subtlety

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: A0 subtlety

Post by Daniel Shawul »

jackd wrote: Fri Feb 15, 2019 10:34 pm @Daniel Shawul

How strong is your policy network on it's own?
It is weak tactically so can be exploited very easily. But its positional play is awesome, infact it can win games against tscp from time to time
using only policy network (1 node mcts with 20x256 net).
I don't know its strength but it adds a lot of positional knowldege to the enigne. The Facebook guys also found out accidentally that using only value network gets you maximum amateur dan level in GO.
jackd
Posts: 25
Joined: Mon Dec 10, 2018 2:45 pm
Full name: jack d.

Re: A0 subtlety

Post by jackd »

Daniel Shawul wrote: Sat Feb 16, 2019 3:26 pm
jackd wrote: Fri Feb 15, 2019 10:34 pm @Daniel Shawul

How strong is your policy network on it's own?
It is weak tactically so can be exploited very easily. But its positional play is awesome, infact it can win games against tscp from time to time
using only policy network (1 node mcts with 20x256 net).
I don't know its strength but it adds a lot of positional knowldege to the enigne. The Facebook guys also found out accidentally that using only value network gets you maximum amateur dan level in GO.
Did you train on the ccrl dataset?
http://blog.lczero.org/2018/09/a-standard-dataset.html

I just started training on it last night. Not sure what to expect.
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: A0 subtlety

Post by Daniel Shawul »

jackd wrote: Sat Feb 16, 2019 4:38 pm
Daniel Shawul wrote: Sat Feb 16, 2019 3:26 pm
jackd wrote: Fri Feb 15, 2019 10:34 pm @Daniel Shawul

How strong is your policy network on it's own?
It is weak tactically so can be exploited very easily. But its positional play is awesome, infact it can win games against tscp from time to time
using only policy network (1 node mcts with 20x256 net).
I don't know its strength but it adds a lot of positional knowldege to the enigne. The Facebook guys also found out accidentally that using only value network gets you maximum amateur dan level in GO.
Did you train on the ccrl dataset?
http://blog.lczero.org/2018/09/a-standard-dataset.html

I just started training on it last night. Not sure what to expect.
I trained on ccrl+cegt+millionbase+kingbase+chessdb+ssdf etc for a total of 20 million games. Even after that the network is not that good.
But I suspect the problem is that haven't lowered he learning rate -- I need to do multiple epochs with smaller learning rate.
I did the first pass with Adam optimizer and default learning rate of 1e-3. Now doing the second pass with 1e-4, and already seeing some improvement. Good luck and don't be disheartened if ccrl trained net is not that strong.
chrisw
Posts: 4313
Joined: Tue Apr 03, 2012 4:28 pm

Re: A0 subtlety

Post by chrisw »

Daniel Shawul wrote: Sat Feb 16, 2019 5:34 pm
jackd wrote: Sat Feb 16, 2019 4:38 pm
Daniel Shawul wrote: Sat Feb 16, 2019 3:26 pm
jackd wrote: Fri Feb 15, 2019 10:34 pm @Daniel Shawul

How strong is your policy network on it's own?
It is weak tactically so can be exploited very easily. But its positional play is awesome, infact it can win games against tscp from time to time
using only policy network (1 node mcts with 20x256 net).
I don't know its strength but it adds a lot of positional knowldege to the enigne. The Facebook guys also found out accidentally that using only value network gets you maximum amateur dan level in GO.
Did you train on the ccrl dataset?
http://blog.lczero.org/2018/09/a-standard-dataset.html

I just started training on it last night. Not sure what to expect.
I trained on ccrl+cegt+millionbase+kingbase+chessdb+ssdf etc for a total of 20 million games. Even after that the network is not that good.
But I suspect the problem is that haven't lowered he learning rate -- I need to do multiple epochs with smaller learning rate.
I did the first pass with Adam optimizer and default learning rate of 1e-3. Now doing the second pass with 1e-4, and already seeing some improvement. Good luck and don't be disheartened if ccrl trained net is not that strong.
did you filter for Elo and double games?
jackd
Posts: 25
Joined: Mon Dec 10, 2018 2:45 pm
Full name: jack d.

Re: A0 subtlety

Post by jackd »

Daniel Shawul wrote: Sat Feb 16, 2019 5:34 pm
jackd wrote: Sat Feb 16, 2019 4:38 pm
Daniel Shawul wrote: Sat Feb 16, 2019 3:26 pm
jackd wrote: Fri Feb 15, 2019 10:34 pm @Daniel Shawul

How strong is your policy network on it's own?
It is weak tactically so can be exploited very easily. But its positional play is awesome, infact it can win games against tscp from time to time
using only policy network (1 node mcts with 20x256 net).
I don't know its strength but it adds a lot of positional knowldege to the enigne. The Facebook guys also found out accidentally that using only value network gets you maximum amateur dan level in GO.
Did you train on the ccrl dataset?
http://blog.lczero.org/2018/09/a-standard-dataset.html

I just started training on it last night. Not sure what to expect.
I trained on ccrl+cegt+millionbase+kingbase+chessdb+ssdf etc for a total of 20 million games. Even after that the network is not that good.
But I suspect the problem is that haven't lowered he learning rate -- I need to do multiple epochs with smaller learning rate.
I did the first pass with Adam optimizer and default learning rate of 1e-3. Now doing the second pass with 1e-4, and already seeing some improvement. Good luck and don't be disheartened if ccrl trained net is not that strong.
How many positions do you sample per epoch? Also I won't be disheartened as long as I can get a policy net that can make reasonable seeming chess moves ( hopefully 1000 human Elo).
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: A0 subtlety

Post by Daniel Shawul »

I did not filter but 80% of the games should have >2000 elo games. I am having difficulty getting a super strong net with only supervised learning so far though I still need to try few things like a) dropping learning rate after each epoch -- which already seems to help a bit. b) Ordering games by elo.
Reinforcement learning gives you stronger and stronger games with time, as opposed to unordered games in supervized learning. c) I am currently using Adam optimizer which modifies the base learning rate for each rate. A0 used simple SGD with learning rate schedules. d) Only using the highest quality games (maybe 5 million) and doing multiple epochs over it while dropping learning rate. Funny thing is reinforcement learning nets are trained from games whose moves are searched for only 800 playouts, which should produce games inferior to, for instance, CCRL 40/40. But ccrl games may lack deep positional understanding so in that regard training only on human games maybe better even if tactical mistakes are made. So I think the quality of the game should be measured with regard to its positional content i think. The policy network does make tactical mistakes but you can tell the moves have good positional motives. Anyway everything about the NN engine is geared to overwhelm opponent positionally -- which is probably why elo rating system does not work well with them.
jackd
Posts: 25
Joined: Mon Dec 10, 2018 2:45 pm
Full name: jack d.

Re: A0 subtlety

Post by jackd »

Daniel Shawul wrote: Sun Feb 17, 2019 2:23 am I did not filter but 80% of the games should have >2000 elo games. I am having difficulty getting a super strong net with only supervised learning so far though I still need to try few things like a) dropping learning rate after each epoch -- which already seems to help a bit. b) Ordering games by elo.
Reinforcement learning gives you stronger and stronger games with time, as opposed to unordered games in supervized learning. c) I am currently using Adam optimizer which modifies the base learning rate for each rate. A0 used simple SGD with learning rate schedules. d) Only using the highest quality games (maybe 5 million) and doing multiple epochs over it while dropping learning rate. Funny thing is reinforcement learning nets are trained from games whose moves are searched for only 800 playouts, which should produce games inferior to, for instance, CCRL 40/40. But ccrl games may lack deep positional understanding so in that regard training only on human games maybe better even if tactical mistakes are made. So I think the quality of the game should be measured with regard to its positional content i think. The policy network does make tactical mistakes but you can tell the moves have good positional motives. Anyway everything about the NN engine is geared to overwhelm opponent positionally -- which is probably why elo rating system does not work well with them.
I am very interested about how strong Leela and A0's policy networks are. I am about to restart training with the following params:

BaseNetwork = 10 X 64

ValueNetwork = 30 * 3 * 3 convolutional layer followed by dense and tanh

PolicyNetwork = 30 * 3 * 3 convolutional layer followed by dense and a masked softmax

batch size = 1000

batches per epoch = 50000

learning alogorithm: can't decide between momentum or RMSPROP EDIT: RMSPROP it is
jackd
Posts: 25
Joined: Mon Dec 10, 2018 2:45 pm
Full name: jack d.

Re: A0 subtlety

Post by jackd »

It's worth mentioning that I wrote I was considering using momentum because I saw it was used by A0, not because I was getting good results with it. After realizing that batch normalization is what allowed A0 to use the parameters it did, I realized the correct choice for my network is RMSPROP until I add support for batch normalization via CUDNN.
chrisw
Posts: 4313
Joined: Tue Apr 03, 2012 4:28 pm

Re: A0 subtlety

Post by chrisw »

Daniel Shawul wrote: Sun Feb 17, 2019 2:23 am I did not filter but 80% of the games should have >2000 elo games. I am having difficulty getting a super strong net with only supervised learning so far though I still need to try few things like a) dropping learning rate after each epoch -- which already seems to help a bit. b) Ordering games by elo.
Reinforcement learning gives you stronger and stronger games with time, as opposed to unordered games in supervized learning. c) I am currently using Adam optimizer which modifies the base learning rate for each rate. A0 used simple SGD with learning rate schedules. d) Only using the highest quality games (maybe 5 million) and doing multiple epochs over it while dropping learning rate. Funny thing is reinforcement learning nets are trained from games whose moves are searched for only 800 playouts, which should produce games inferior to, for instance, CCRL 40/40. But ccrl games may lack deep positional understanding so in that regard training only on human games maybe better even if tactical mistakes are made. So I think the quality of the game should be measured with regard to its positional content i think. The policy network does make tactical mistakes but you can tell the moves have good positional motives. Anyway everything about the NN engine is geared to overwhelm opponent positionally -- which is probably why elo rating system does not work well with them.
I think the problem with SL is that you have one target only, namely to represent the knowledge that is encoded in the training set. Low grade knowledge in the set will be represented also, and general noise will make the process more difficult. Obviously increasing the training set size with lower grade, noisier games has plus and minus to it. It’s a limited target, once you got it, that’s it.

RL on the other hand has a moving target. Itself. Theoretically sky being the limit. Every RL game that ‘finds’ something, is finding that something because of temperature, so that means the ‘thing’ it found, it is already close enough to to find with a bit of random thrown in (sorry that is a but mangled, but you get the point). Thus the consequent weight adjust is close already, and the net gets nudged closer. That’s a moving and continuously improving target. The limit, I guess, being when and if the net gets stuck in some local minimum.

Thought experiment. Suppose one was to download all the recent LC0 net training games and do SL on them. Would one produce an equivalent net? I think not. Because LC0 training games are tuned to LC0 state of net at the time of training, so LC0 benefits from them “more” than would a net with a different weight set. SL would get the game knowledge benefit each time, but not close enough to affect decision.

That’s the reason RL works on what must be relatively poor quality 800 rollout games. The nudges it gets are on regions where it is already close. Effective nudges, even if game quality low. Maybe generally higher quality SL games can compensate, I am just guessing.
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: A0 subtlety

Post by Daniel Shawul »

I think if you take lc0 self play games in the same order as they are produced (i.e. from poor games that don't know material values to strong ones) and "supervized train" a new network you should be able to get the same network. That might not be the case if the order of the games is changed.
The order is important because mini-batch gradient descent is used for optimization. In batch gradient descent, however, the order does not matter because the update is done after all gradients in the 40 million games are accumulated. I am not sure batch gradient descent will produce strong nets because, as you mentioned there will be a lot of noise in the data, and the algorithm tries to fit all samples with the same weight.. The algorithm lc0 uses has a replay buffer that keeps the last 1 million games for the purpose of randomly sampling mini-batches. Someone even claimed that lc0 nets are trained from 1 million games because of that but I think that is a mistake. The replay buffer is just a tool to shuffle positions to minimize the effect of correlation between positions in the same game. However, I think there is a "forgetting" effect where earlier games played with no idea of piece values are forgotten. That is why I think using batch gradient descent over the 40 million games and doing multiple passes over it will not give same net.