Daniel Shawul wrote: ↑Sun Feb 17, 2019 2:23 am
I did not filter but 80% of the games should have >2000 elo games. I am having difficulty getting a super strong net with only supervised learning so far though I still need to try few things like a) dropping learning rate after each epoch -- which already seems to help a bit. b) Ordering games by elo.
Reinforcement learning gives you stronger and stronger games with time, as opposed to unordered games in supervized learning. c) I am currently using Adam optimizer which modifies the base learning rate for each rate. A0 used simple SGD with learning rate schedules. d) Only using the highest quality games (maybe 5 million) and doing multiple epochs over it while dropping learning rate. Funny thing is reinforcement learning nets are trained from games whose moves are searched for only 800 playouts, which should produce games inferior to, for instance, CCRL 40/40. But ccrl games may lack deep positional understanding so in that regard training only on human games maybe better even if tactical mistakes are made. So I think the quality of the game should be measured with regard to its positional content i think. The policy network does make tactical mistakes but you can tell the moves have good positional motives. Anyway everything about the NN engine is geared to overwhelm opponent positionally -- which is probably why elo rating system does not work well with them.
I think the problem with SL is that you have one target only, namely to represent the knowledge that is encoded in the training set. Low grade knowledge in the set will be represented also, and general noise will make the process more difficult. Obviously increasing the training set size with lower grade, noisier games has plus and minus to it. It’s a limited target, once you got it, that’s it.
RL on the other hand has a moving target. Itself. Theoretically sky being the limit. Every RL game that ‘finds’ something, is finding that something because of temperature, so that means the ‘thing’ it found, it is already close enough to to find with a bit of random thrown in (sorry that is a but mangled, but you get the point). Thus the consequent weight adjust is close already, and the net gets nudged closer. That’s a moving and continuously improving target. The limit, I guess, being when and if the net gets stuck in some local minimum.
Thought experiment. Suppose one was to download all the recent LC0 net training games and do SL on them. Would one produce an equivalent net? I think not. Because LC0 training games are tuned to LC0 state of net at the time of training, so LC0 benefits from them “more” than would a net with a different weight set. SL would get the game knowledge benefit each time, but not close enough to affect decision.
That’s the reason RL works on what must be relatively poor quality 800 rollout games. The nudges it gets are on regions where it is already close. Effective nudges, even if game quality low. Maybe generally higher quality SL games can compensate, I am just guessing.