MCTS beginner questions

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

brianr
Posts: 536
Joined: Thu Mar 09, 2006 3:01 pm

Re: MCTS beginner questions

Post by brianr »

Leela Chess is benefiting from much of the work done for Leela Go.
Net2net is one of the things used by Leela Go that I think has been used for Chess also.
The paper is here:
https://arxiv.org/abs/1511.05641

It looks like it uses the current smaller net weights to bootstrap the new larger net.
I think a training step using sample data is then still needed, but it learns faster instead of using random weights to start with.

You can search GitHub repos.
If you search the Leela Go repo for net2net there is much more than in the Leela Chess repo.

For more info than I could hope to provide, you could ask in the Leela Chess forum (url tags don't seem to work for this one):
https://groups.google.com/forum/#!forum/lczero
Gian-Carlo Pascutto
Posts: 1243
Joined: Sat Dec 13, 2008 7:00 pm

Re: MCTS beginner questions

Post by Gian-Carlo Pascutto »

We've used both net2net and retraining on all data for upgrading the network size. In theory net2net should work better, or at least faster.
jp
Posts: 1470
Joined: Mon Apr 23, 2018 7:54 am

Re: MCTS beginner questions

Post by jp »

Gian-Carlo Pascutto wrote:We've used both net2net and retraining on all data for upgrading the network size. In theory net2net should work better, or at least faster.
Thanks. How long does retraining on all data take? I guess that means the vast majority of time is taken in creating the data (games), not the training once you have the data. Is that the case very generally?
Gian-Carlo Pascutto
Posts: 1243
Joined: Sat Dec 13, 2008 7:00 pm

Re: MCTS beginner questions

Post by Gian-Carlo Pascutto »

jp wrote: Thanks. How long does retraining on all data take? I guess that means the vast majority of time is taken in creating the data (games), not the training once you have the data. Is that the case very generally?
I'm not sure what you mean. If you're upgrading the network, you already have the games. For 192x15 it took me about a week on a GTX 1070. At this point the training hasn't fully converged, but the new network is stronger than the smaller one by some margin, and that's good enough to inject it into the cycle.

The whole Zero process is totally bottlenecked on producing the games. That's why it's possible to do it in a distributed manner.
jp
Posts: 1470
Joined: Mon Apr 23, 2018 7:54 am

Re: MCTS beginner questions

Post by jp »

Gian-Carlo Pascutto wrote:...
I'm not sure what you mean. If you're upgrading the network, you already have the games. For 192x15 it took me about a week on a GTX 1070. At this point the training hasn't fully converged, but the new network is stronger than the smaller one by some margin, and that's good enough to inject it into the cycle.

The whole Zero process is totally bottlenecked on producing the games. That's why it's possible to do it in a distributed manner.
Yes, that's what I was asking for a numerical figure on (but maybe that's not so easy to say & not so important). I mean, the week you took for 192x15 was once the distributed computing had generated all the games, right? I just wondered how much time (in units of GTX1070 equivalent) it took to generate those games (e.g. 999 GTX1070 weeks would mean 99.9% bottleneck).

In general, I'm just trying to get an idea of the final performance after training and upgrading and repeating over and over vs. starting with the largest NN size (for the same total computer time). Does the former really give a better final performance? Of course, you get more games more quickly to begin with, but in some sense those games might be "lower quality" (or maybe not, as long as the small NN is not saturated??).

One advantage I see of starting with a small NN and upgrading is that debugging should be quicker & you'll get a bug-free product sooner.
Gian-Carlo Pascutto
Posts: 1243
Joined: Sat Dec 13, 2008 7:00 pm

Re: MCTS beginner questions

Post by Gian-Carlo Pascutto »

jp wrote: Does the former really give a better final performance?
I don't worry too much about final performance being different, but its's true at the end you may take more resources to reach the final performance.

But you have more chances of getting there. Not only due to faster debugging, but also because it's easier to get people to join the project if you make good progress.

Also I consider it somewhat of a nobrainer that the first batch of random games might as well be done with a smaller network, so surely the optimal in compute terms must lie someshwere in the middle, right.
jp
Posts: 1470
Joined: Mon Apr 23, 2018 7:54 am

Re: MCTS beginner questions

Post by jp »

Yes, that sounds reasonable. There presumably exists some unknown optimal NN upgrade schedule.
jp
Posts: 1470
Joined: Mon Apr 23, 2018 7:54 am

Re: MCTS beginner questions

Post by jp »

brianr wrote:Leela Chess is benefiting from much of the work done for Leela Go.
Net2net is one of the things used by Leela Go that I think has been used for Chess also.
The paper is here:
https://arxiv.org/abs/1511.05641

It looks like it uses the current smaller net weights to bootstrap the new larger net.
I think a training step using sample data is then still needed, but it learns faster instead of using random weights to start with.
...
Thanks for the reference.

Alexander Lyashuk also wrote the following (re. net2net):

"Increasing number of blocks:
by inserting blocks which are initialized in such way that they do nothing (except small noise), so input = output.

Increasing number of filters in a block:
(roughly speaking) by splitting existing nodes into several. E.g. of x = 5y was one node, it's replaced with x = 2y + 3y (two nodes).
(this is overly simplified and may be wrong)."