Math question

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Math question

Post by Laskos »

jp wrote: Sat Jul 20, 2019 8:13 pm
Laskos wrote: Sat Jul 20, 2019 7:25 pm Well, in Lc0 they do introduce a non-zero temperature for some number of moves (maybe 20?) during the training, some sort of noise obeying IIRC the Dirichlet distribution with some K. We can randomize similarly Lc0 via the UCI settings. I do this when I have few or single starting position, for say 4 moves. It can actually produce a good opening tree with well chosen parameters. But I haven't even checked if Lc0 with no noise on one of thread isn't deterministic to fixed nodes. With default settings (0 temperature and 2 threads), it appers to be no less deterministic than SF or Komodo on two threads.

I am not sure why in AB engines it's hard to introduce a simple noise like Rybka did yers ago or Komodo nowadays. It seems no to be computationally intensive.
Can you give a rough rule of thumb for how much non-determinism there is (for SF, etc.) as a function of number of threads and number of nodes per move?

In the other thread, Uri believes there is a lot of variety, while I worry it has almost none (for purposes where we want a lot).
e.g. viewtopic.php?p=805353#p805353
Here is a thread in which I gave some examples (to 1 and 2 threads, I think). I think I had some broader rules o thumb written here somewhere later, but I couldn't find them quickly.
viewtopic.php?f=2&t=61462
I tend to agree with Uri that a multi-threaded SF is somewhat more random than Leela on 2 threads (temperature=0).
chrisw
Posts: 4315
Joined: Tue Apr 03, 2012 4:28 pm

Re: Math question

Post by chrisw »

Laskos wrote: Sat Jul 20, 2019 7:25 pm
chrisw wrote: Sat Jul 20, 2019 6:07 pm
Laskos wrote: Sat Jul 20, 2019 2:12 pm
chrisw wrote: Sat Jul 20, 2019 1:39 pm
hgm wrote: Sat Jul 20, 2019 10:08 am The reason for making a book based on games is usually that you don't trust engines to recognize the strategic aspects of the positions. Otherwise you might as well not bother, and let the engine play without book.
This is true. Although books have another function which is to introduce variance for engines which are either not good at ranking ply one moves, or else have to consume time to select from a ranked list. NNs currently using MCTS build a ranking anyway, so they really can operate without book. ABs require computational expenditure for ranking, hence they tend to use books instead
How does MCTS NN as usually used (say in Leela) by necessity introduces variance on move choices? It's not really MC anymore, more a convergent fast approximation to minimax, right? It builds a ranking indeed, but it could be a deterministic one as well, right? What a book surely saves, even in the real MCTS case, is time.
Well, MCTS does produce a ranking order based on visit count or win rate and so on, and it’s a straightforward programming task of negligable computational time to take that ordering and the value range and select one if the better ranked moves with a bit of added random. Contrast with AB where search needs relaunching with increased window, that's computationally expensive.
With more effort some self play and some analysis, the MCTS chooser could optimism itself even.
How does LCZero add variance, not looked at it for some time but I was under the impression they did indeed randomize a bit on the top few moves?
Well, in Lc0 they do introduce a non-zero temperature for some number of moves (maybe 20?) during the training, some sort of noise obeying IIRC the Dirichlet distribution with some K. We can randomize similarly Lc0 via the UCI settings. I do this when I have few or single starting position, for say 4 moves. It can actually produce a good opening tree with well chosen parameters. But I haven't even checked if Lc0 with no noise on one of thread isn't deterministic to fixed nodes. With default settings (0 temperature and 2 threads), it appers to be no less deterministic than SF or Komodo on two threads.

I am not sure why in AB engines it's hard to introduce a simple noise like Rybka did yers ago or Komodo nowadays. It seems no to be computationally intensive.
can't tell you "how much" non-determination there is built in LCZero, but the MCTS policy move chooser has a randomisation built into the algorithm, such that first move from unseen position is (usually) random chosen (or last time I looked). And this random choice first move will have a definite randomising effect on further choices from the same node later, and a random effect on the backed up score for nodes back to the root, and thus the shape of the tree and the decision made at the root move visit count. How much is not possible to say, you'ld need to make experiments to find out.

As for AB, I guess engine designers like to rely on opening book stats, rather than adding noise to the noise that is already there.
User avatar
Rebel
Posts: 6991
Joined: Thu Aug 18, 2011 12:04 pm

Re: Math question.

Post by Rebel »

Ajedrecista wrote: Fri Jul 19, 2019 7:24 pmGlad to see that it might be useful!
Your formula is working like a charm.

Steps
1) I created a quality PGN database between human players of at least 2500 elo.
2) Made of Polyglot book of it.
3) Created the WDL statistics.
4) Started to tune the book with your formula "a" which I have babtized as the WDL-Factor.
5) Testing the book against 2 other books and see how it goes.

Code: Select all

   Results Elo-2500 book         Against the
   against ProDeo book         Performance book

WDL Factor  Games  Perc  Elo      Perc  Elo
   None      5000  50.7%  +5      50.5%  +3
   0.25      5000  45.4% -33      42.4% -53         
   0.50      5000  47.6% -16      46.7% -23
   0.75      5000  49.9%  -1      48.7%  -9
   1.00      5000  51.3%  +9      50.4%  +2
   1.50      5000  52.7% +18      52.8% +19
   2.00      5000  54.3% +30      54.1% +28         
   2.50      5000  55.1% +35      54.2% +29
   3.00      5000  54.2% +29      54.5% +31
Both books show a remarkable similar progression path and the optimal WDL paramter for this book seems to be around 250.

Weight distribution example of the used WDL-Factor - http://rebel13.nl/text/elo-2500.txt

Next step, make it available.
90% of coding is debugging, the other 10% is writing bugs.
crem
Posts: 177
Joined: Wed May 23, 2018 9:29 pm

Re: Math question

Post by crem »

chrisw wrote: Sun Jul 21, 2019 11:30 am can't tell you "how much" non-determination there is built in LCZero, but the MCTS policy move chooser has a randomisation built into the algorithm, such that first move from unseen position is (usually) random chosen (or last time I looked). And this random choice first move will have a definite randomising effect on further choices from the same node later, and a random effect on the backed up score for nodes back to the root, and thus the shape of the tree and the decision made at the root move visit count. How much is not possible to say, you'ld need to make experiments to find out.

As for AB, I guess engine designers like to rely on opening book stats, rather than adding noise to the noise that is already there.
In Lc0 there's no need to pick random move on the first extension of the node, as priors for every move are known and you pick the move with the highest prior.

With temperature and noise off, the only nondeterminism comes from the fact that batches have size larger than 1. When more than one thread evaluates NN at the same time, it's roughly the same as if one thread evaluated one larger combined batch, as while gathering batches, threads couldn't yet see results of other thread's evaluation.

So running multiple threads is roughly equivalent to randomizing batch size of every iteration.

With batches of size more than 1 it sometimes happen that wrong node is chosen (i.e. node, which wouldn't be chosen if all batches contained just a single node). While it can happen, it's not that often in reality. And when it happens, most of the time it doesn't lead to divergence in PV.

Divergence happens when e.g. there are two possible moves from a given position, both of which lead to positions that have larger eval than parent. In this case the move which got the visit first, will attract even more visits.

So all in all, Lc0 is pretty deterministic.
chrisw
Posts: 4315
Joined: Tue Apr 03, 2012 4:28 pm

Re: Math question

Post by chrisw »

crem wrote: Thu Jul 25, 2019 10:12 am
chrisw wrote: Sun Jul 21, 2019 11:30 am can't tell you "how much" non-determination there is built in LCZero, but the MCTS policy move chooser has a randomisation built into the algorithm, such that first move from unseen position is (usually) random chosen (or last time I looked). And this random choice first move will have a definite randomising effect on further choices from the same node later, and a random effect on the backed up score for nodes back to the root, and thus the shape of the tree and the decision made at the root move visit count. How much is not possible to say, you'ld need to make experiments to find out.

As for AB, I guess engine designers like to rely on opening book stats, rather than adding noise to the noise that is already there.
In Lc0 there's no need to pick random move on the first extension of the node, as priors for every move are known and you pick the move with the highest prior.

With temperature and noise off, the only nondeterminism comes from the fact that batches have size larger than 1. When more than one thread evaluates NN at the same time, it's roughly the same as if one thread evaluated one larger combined batch, as while gathering batches, threads couldn't yet see results of other thread's evaluation.

So running multiple threads is roughly equivalent to randomizing batch size of every iteration.

With batches of size more than 1 it sometimes happen that wrong node is chosen (i.e. node, which wouldn't be chosen if all batches contained just a single node). While it can happen, it's not that often in reality. And when it happens, most of the time it doesn't lead to divergence in PV.

Divergence happens when e.g. there are two possible moves from a given position, both of which lead to positions that have larger eval than parent. In this case the move which got the visit first, will attract even more visits.

So all in all, Lc0 is pretty deterministic.
Ah. Maybe this is different to what I thought. Which was: when search is at a position it hasn't seen before, LC0 generates a move list of what comes next, but those moves are not yet ordered, and we can't lookup in the NN what to do next, because, well, we've not been there before. You can set batch to 1, of course, and ask everytime, but batch is not set to 1, so what to do? I thought LC0 just added the position to the batch, and chose a continuation at random, sets a virtual loss on the continuation to minimise repeats; then, when batch was full, then NN is asked to provide data for the batch, and the data gets backdated into all those random continuations that have built up.
Maybe it is me who is random here?!