Math question

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

User avatar
Ajedrecista
Posts: 1966
Joined: Wed Jul 13, 2011 9:04 pm
Location: Madrid, Spain.

Re: Math question.

Post by Ajedrecista »

Hello:
Robert Pope wrote: Fri Jul 19, 2019 6:57 pm
Rebel wrote: Fri Jul 19, 2019 9:52 am
Ajedrecista wrote: Thu Jul 18, 2019 8:10 pm Given the big difference of played games of the 'Big Four' (43.18%, 36.48%, 10.57% and 8.08%), I obtain plausible weights IMHO although the math is not so simple.

You can tweak the parameters z and a. z ~ 1.96 or z = 2 are pretty standard, but a is more subjective and susceptible of being tested.

Regards from Spain.

Ajedrecista.
I am missing the dedinion of [z] in your post or I am in need for new glasses.
z ~ 1.959963985 (95% confidence in a normal distribution; you can change z)
It's in one of the very early posts.
Thanks Robert! I was about answering that all the info was not in the same post.

z is the z-score in a normal distribution, so you can pick z from a wide variety, though z = 2 looks both reasonable (confidence ~ 95.45%) and an easy value to remember.

------------
Rebel wrote:
Ajedrecista wrote:As you can see, a = 1 gives your original weights. a =< 0 gives weird weights (like giveaway, worst is best) and a -> +infinity awards the most played move regardless its score.
I like the formula especially because it's so easily to tune with a parameter [a], I think I am going to use it, so thanks.
Glad to see that it might be useful! It would be interesting to know if the method can hold extreme cases like very few games with scores near 0% or 100%; or overwhelmingly played moves with very poor scores, since n^a intends to provide a kind of 'smooth' (not the right word probably) in the frequency of moves (that is, to provide more variety on similar good-performing moves like e4 and c4 in your example, with very different number of games between them).

Using the lower bound of Wilson score interval in the performance instead of the performance itself tries to not bias in favour of moves with high performances but low number of games (like the following example: which is more significant? A comment with 1 'like' and 0 'dislikes' or other one with 99 'likes' and 1 'dislike'?).

It looks like continuity correction lowers the weights of moves with very low percentage of moves while raise a little the weights of top played moves, in comparison with not using continuity correction. The formulas are there:

https://en.wikipedia.org/wiki/Binomial_ ... e_interval

https://en.wikipedia.org/wiki/Binomial_ ... correction

It is up to you to choose either solution.

Regards from Spain.

Ajedrecista.
User avatar
hgm
Posts: 27788
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Math question

Post by hgm »

Assigning weights just based on WDL for the position is unsound anyway; one really should minimax the opening tree. Otherwise you can get in a situation where a move X looks very good, because it is played 10,000 times, with WDL = 8000/1000/1000 (so 85% score), while in fact all the points were scored in the 9000 games where the opponent replied with move A, while the 1000 times he chose move B ended all as a loss. So in fact move X will score 0% against an opponent that knows his theory, not 85%. (Which of course was the reason while people stopped playing X alltogether after refutation B was found, so that the stats for it will never catch up.)

Minimaxing must still be done a bit subtle, though; you must stop well before the leaves, because otherwise you will 'zoom in' on statistical flukes like they are really good.

I once developed a method how to do this, but it needs several parameters. In the first place, one should ask oneself why one would ever play a move that is suboptimal. (I.e. why not all weights should be 0 except one.) The answer here is that unpredictability is worth something, and that a prepared opponent will be stronger than a non-prepared one. If by playing a move that objectively is 7 Elo waker (i.e. loses 1% more often) you make your opponent 20 Elo weaker, because he was not prepared for it, it is a gain. Not always playing the best move also means you should not really minimax, but assign a weight based on the move depending on the average (weighted!) score in the daughter, and then take the weighted average of the result of these moves to get the node score.

A second point is how much value you want to attach by the playing frequency of the moves (without considering their results). If you analyse a set of GM games, the fact that GMs play move A and B equally often might be more significant than the actual scores obtained with these moves, especially is statistics is still poor. (E.g. both are played 10 times, move A scored 60%, move B 40%. Should we really consider A better than B based on these flimsy stats, or should we trust the GM's judgement that they are equal, and was the extra win or loss just luck? In the latter limit we should treat the moves as one, and use 10-out-of-20 score for them.) This again obviously depends on the rating of the players.
Dann Corbit
Posts: 12538
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Math question

Post by Dann Corbit »

Another issue is that we only have statistics and statistical analysis for the moves that have been played.
And the data is only trustworthy for moves that have been played a lot.
What about the 10% of the time when the best move has never been played?

So I think that the mini-max tool needs a live component (an active engine which searches during the mini-max operation).
Unfortunately, that sort of thing will make filling out the tree orders of magnitude slower than a pure alpha-beta search of existing data.

Ideally, it could be a cooperative thing where thousands of computers participate.
I think Uri suggested something like this once.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
User avatar
hgm
Posts: 27788
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Math question

Post by hgm »

The reason for making a book based on games is usually that you don't trust engines to recognize the strategic aspects of the positions. Otherwise you might as well not bother, and let the engine play without book.
chrisw
Posts: 4313
Joined: Tue Apr 03, 2012 4:28 pm

Re: Math question

Post by chrisw »

hgm wrote: Sat Jul 20, 2019 10:08 am The reason for making a book based on games is usually that you don't trust engines to recognize the strategic aspects of the positions. Otherwise you might as well not bother, and let the engine play without book.
This is true. Although books have another function which is to introduce variance for engines which are either not good at ranking ply one moves, or else have to consume time to select from a ranked list. NNs currently using MCTS build a ranking anyway, so they really can operate without book. ABs require computational expenditure for ranking, hence they tend to use books instead
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Math question

Post by Laskos »

chrisw wrote: Sat Jul 20, 2019 1:39 pm
hgm wrote: Sat Jul 20, 2019 10:08 am The reason for making a book based on games is usually that you don't trust engines to recognize the strategic aspects of the positions. Otherwise you might as well not bother, and let the engine play without book.
This is true. Although books have another function which is to introduce variance for engines which are either not good at ranking ply one moves, or else have to consume time to select from a ranked list. NNs currently using MCTS build a ranking anyway, so they really can operate without book. ABs require computational expenditure for ranking, hence they tend to use books instead
How does MCTS NN as usually used (say in Leela) by necessity introduces variance on move choices? It's not really MC anymore, more a convergent fast approximation to minimax, right? It builds a ranking indeed, but it could be a deterministic one as well, right? What a book surely saves, even in the real MCTS case, is time.
User avatar
Ovyron
Posts: 4556
Joined: Tue Jul 03, 2007 4:30 am

Re: Math question

Post by Ovyron »

hgm wrote: Sat Jul 20, 2019 10:08 am The reason for making a book based on games is usually that you don't trust engines to recognize the strategic aspects of the positions. Otherwise you might as well not bother, and let the engine play without book.
An important aspect of a book is that it allows one to play instantly to save time. Sometimes it doesn't even matter if the engine would have played better moves without book, if the time disadvantage of the opponent is significant (say, you get them out of book at move 3, and you remain in it to move 20, those are 17 moves where the opponent had to think and lose time on the clock) it'll make up for suboptimal move choices.
Your beliefs create your reality, so be careful what you wish for.
chrisw
Posts: 4313
Joined: Tue Apr 03, 2012 4:28 pm

Re: Math question

Post by chrisw »

Laskos wrote: Sat Jul 20, 2019 2:12 pm
chrisw wrote: Sat Jul 20, 2019 1:39 pm
hgm wrote: Sat Jul 20, 2019 10:08 am The reason for making a book based on games is usually that you don't trust engines to recognize the strategic aspects of the positions. Otherwise you might as well not bother, and let the engine play without book.
This is true. Although books have another function which is to introduce variance for engines which are either not good at ranking ply one moves, or else have to consume time to select from a ranked list. NNs currently using MCTS build a ranking anyway, so they really can operate without book. ABs require computational expenditure for ranking, hence they tend to use books instead
How does MCTS NN as usually used (say in Leela) by necessity introduces variance on move choices? It's not really MC anymore, more a convergent fast approximation to minimax, right? It builds a ranking indeed, but it could be a deterministic one as well, right? What a book surely saves, even in the real MCTS case, is time.
Well, MCTS does produce a ranking order based on visit count or win rate and so on, and it’s a straightforward programming task of negligable computational time to take that ordering and the value range and select one if the better ranked moves with a bit of added random. Contrast with AB where search needs relaunching with increased window, that's computationally expensive.
With more effort some self play and some analysis, the MCTS chooser could optimism itself even.
How does LCZero add variance, not looked at it for some time but I was under the impression they did indeed randomize a bit on the top few moves?
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Math question

Post by Laskos »

chrisw wrote: Sat Jul 20, 2019 6:07 pm
Laskos wrote: Sat Jul 20, 2019 2:12 pm
chrisw wrote: Sat Jul 20, 2019 1:39 pm
hgm wrote: Sat Jul 20, 2019 10:08 am The reason for making a book based on games is usually that you don't trust engines to recognize the strategic aspects of the positions. Otherwise you might as well not bother, and let the engine play without book.
This is true. Although books have another function which is to introduce variance for engines which are either not good at ranking ply one moves, or else have to consume time to select from a ranked list. NNs currently using MCTS build a ranking anyway, so they really can operate without book. ABs require computational expenditure for ranking, hence they tend to use books instead
How does MCTS NN as usually used (say in Leela) by necessity introduces variance on move choices? It's not really MC anymore, more a convergent fast approximation to minimax, right? It builds a ranking indeed, but it could be a deterministic one as well, right? What a book surely saves, even in the real MCTS case, is time.
Well, MCTS does produce a ranking order based on visit count or win rate and so on, and it’s a straightforward programming task of negligable computational time to take that ordering and the value range and select one if the better ranked moves with a bit of added random. Contrast with AB where search needs relaunching with increased window, that's computationally expensive.
With more effort some self play and some analysis, the MCTS chooser could optimism itself even.
How does LCZero add variance, not looked at it for some time but I was under the impression they did indeed randomize a bit on the top few moves?
Well, in Lc0 they do introduce a non-zero temperature for some number of moves (maybe 20?) during the training, some sort of noise obeying IIRC the Dirichlet distribution with some K. We can randomize similarly Lc0 via the UCI settings. I do this when I have few or single starting position, for say 4 moves. It can actually produce a good opening tree with well chosen parameters. But I haven't even checked if Lc0 with no noise on one of thread isn't deterministic to fixed nodes. With default settings (0 temperature and 2 threads), it appers to be no less deterministic than SF or Komodo on two threads.

I am not sure why in AB engines it's hard to introduce a simple noise like Rybka did yers ago or Komodo nowadays. It seems no to be computationally intensive.
jp
Posts: 1470
Joined: Mon Apr 23, 2018 7:54 am

Re: Math question

Post by jp »

Laskos wrote: Sat Jul 20, 2019 7:25 pm Well, in Lc0 they do introduce a non-zero temperature for some number of moves (maybe 20?) during the training, some sort of noise obeying IIRC the Dirichlet distribution with some K. We can randomize similarly Lc0 via the UCI settings. I do this when I have few or single starting position, for say 4 moves. It can actually produce a good opening tree with well chosen parameters. But I haven't even checked if Lc0 with no noise on one of thread isn't deterministic to fixed nodes. With default settings (0 temperature and 2 threads), it appers to be no less deterministic than SF or Komodo on two threads.

I am not sure why in AB engines it's hard to introduce a simple noise like Rybka did yers ago or Komodo nowadays. It seems no to be computationally intensive.
Can you give a rough rule of thumb for how much non-determinism there is (for SF, etc.) as a function of number of threads and number of nodes per move?

In the other thread, Uri believes there is a lot of variety, while I worry it has almost none (for purposes where we want a lot).
e.g. viewtopic.php?p=805353#p805353