### book move selection logic

Posted:

**Sat Jun 16, 2018 5:40 pm**Arasan has had its own opening book since its inception. Currently I select moves from the book based on a combination of frequency and result information, with some manual tweaking to avoid moves that are actually bad but frequent (usually older lines that have been refuted).

I have been looking more recently at a more formal approach and connecting this with some kind of learning system (I used to have a form of book learning but it is not in current versions and was never very effective).

If you just look at results and not frequency, then of course each position that comes from a game has a set of win/loss/draw counts associated with it. This also has a result score, which you can compute as (wins + (0.5+contempt factor)*draws)/total count.

If you have this info for each move from the position (or at least all moves known to the book, may be a subset of possible moves), then of course a simple approach is to take the move with the best result score. However, there is some uncertainty about the scores: what you really have is an estimate of future scores based on past scores, and another move that now has a lower score might actually give the best score in future.

This can be thought of as an example of the "multi-armed bandit" problem, about which there is a large literature. (Note: technically it might be a case of the "adversarial bandit" problem, since you are playing an opponent who wants to minimize your score; but I assuming for right now game outcomes can just be though of in terms of random probabilities).

One simple and effective approach is based on what is called Thompson sampling. You model the distribution of results (in this case it is a gamma distribution) for each move, and you take a random sample from each distribution. Then you compute the scores for the samples, and you take the move whose sample has the highest score. (This is one algorithm: the other that is frequently cited is called UCB1).

This introduces some variability. You will tend to get the highest-scoring moves played more often, but once in a while the sample from another move's distribution will score higher, and you will select that move. Initially you get more variability but over time, if you keep doing this, the process will zero in on the move that really has the best scoring percentage. There are some nice convergence proofs that show this is an effective approach.

There are a couple of problems though in the chess context. One is that the score is independent of move frequency. If you just turn this algorithm loose and let it run, it may pick something like 1. e3 almost as often as 1. e4, because they might have similar stats and because there won't be a lot of 1. e3 games to go on and that distribution will have a large standard deviation. Maybe not a problem, but if have knowledge of prior games from strong players then move frequency does tend to be an indicator, even if imperfect, of move quality. So maybe you want to add at least a frequency filter when doing real games and drop very infrequent moves. For offline learning you could turn the filter off.

The other problem is as mentioned, eventually the algorithm tends to converge on one move. Suppose 1. e4 scores 55% for White and 1. d4 scores 53%. If enough games are available, the standard deviation of the two distributions will be very small and so the algorithm will start to play 1. e4 almost all the time. This is theoretically optimal, but in practical terms the score is scarcely different between the moves, and for variety and to increase the chances of tripping up an opponent, you might still want to play 1. d4 sometimes, even if this is not theoretically optimal.

To deal with this you could establish a tolerance level for the result difference, and play moves whose difference from the best move is within the tolerance limit, with decreasing frequency as the limit is approached. If the tolerance was zero you'd have pure Thompson sampling and with a non-zero tolerance you'd have a generalization of it with more exploration. (I have not seen this sort of thing described in the literature).

I am not sure how to handle this tolerance statistically. You would want to reduce the frequency of moves played as their (result mean-best result mean)/tolerance moves farther away from zero, but taking into account the standard deviation of the difference. Maybe a Monte Carlo approach?

The difference of results from two moves is distributed as the difference of two gamma distributions and that is hard to compute, although there are approximations to it (for example http://www.ucs.louisiana.edu/~kxk4695/GammaR2.pdf).

Comments appreciated. This is just a dump of my thinking on the problem so far.

--Jon

I have been looking more recently at a more formal approach and connecting this with some kind of learning system (I used to have a form of book learning but it is not in current versions and was never very effective).

If you just look at results and not frequency, then of course each position that comes from a game has a set of win/loss/draw counts associated with it. This also has a result score, which you can compute as (wins + (0.5+contempt factor)*draws)/total count.

If you have this info for each move from the position (or at least all moves known to the book, may be a subset of possible moves), then of course a simple approach is to take the move with the best result score. However, there is some uncertainty about the scores: what you really have is an estimate of future scores based on past scores, and another move that now has a lower score might actually give the best score in future.

This can be thought of as an example of the "multi-armed bandit" problem, about which there is a large literature. (Note: technically it might be a case of the "adversarial bandit" problem, since you are playing an opponent who wants to minimize your score; but I assuming for right now game outcomes can just be though of in terms of random probabilities).

One simple and effective approach is based on what is called Thompson sampling. You model the distribution of results (in this case it is a gamma distribution) for each move, and you take a random sample from each distribution. Then you compute the scores for the samples, and you take the move whose sample has the highest score. (This is one algorithm: the other that is frequently cited is called UCB1).

This introduces some variability. You will tend to get the highest-scoring moves played more often, but once in a while the sample from another move's distribution will score higher, and you will select that move. Initially you get more variability but over time, if you keep doing this, the process will zero in on the move that really has the best scoring percentage. There are some nice convergence proofs that show this is an effective approach.

There are a couple of problems though in the chess context. One is that the score is independent of move frequency. If you just turn this algorithm loose and let it run, it may pick something like 1. e3 almost as often as 1. e4, because they might have similar stats and because there won't be a lot of 1. e3 games to go on and that distribution will have a large standard deviation. Maybe not a problem, but if have knowledge of prior games from strong players then move frequency does tend to be an indicator, even if imperfect, of move quality. So maybe you want to add at least a frequency filter when doing real games and drop very infrequent moves. For offline learning you could turn the filter off.

The other problem is as mentioned, eventually the algorithm tends to converge on one move. Suppose 1. e4 scores 55% for White and 1. d4 scores 53%. If enough games are available, the standard deviation of the two distributions will be very small and so the algorithm will start to play 1. e4 almost all the time. This is theoretically optimal, but in practical terms the score is scarcely different between the moves, and for variety and to increase the chances of tripping up an opponent, you might still want to play 1. d4 sometimes, even if this is not theoretically optimal.

To deal with this you could establish a tolerance level for the result difference, and play moves whose difference from the best move is within the tolerance limit, with decreasing frequency as the limit is approached. If the tolerance was zero you'd have pure Thompson sampling and with a non-zero tolerance you'd have a generalization of it with more exploration. (I have not seen this sort of thing described in the literature).

I am not sure how to handle this tolerance statistically. You would want to reduce the frequency of moves played as their (result mean-best result mean)/tolerance moves farther away from zero, but taking into account the standard deviation of the difference. Maybe a Monte Carlo approach?

The difference of results from two moves is distributed as the difference of two gamma distributions and that is hard to compute, although there are approximations to it (for example http://www.ucs.louisiana.edu/~kxk4695/GammaR2.pdf).

Comments appreciated. This is just a dump of my thinking on the problem so far.

--Jon