TalkChess.com

Posted: **Thu Jan 02, 2014 2:46 am**

It is not yet implemented but it is conceptually elegant that it deserves a new thread

. The idea is that a persistent UCT_tree = book. For games like Go, this becomes obvious because you use a MCTS search, so saving/reloading the tree part of MCTS is equivalent to having a book. For chess, we are using alpha-beta so it needs new code however the idea is equally workable.
-------------------

Edit: Infact the idea of book learning seems to be exactly the same as the procedure for growing and expanding a UCT tree in MCTS. The only difference is that the book is on disk. So we use UCB to select based on the number of WDL at each node, and expand along most promising lines using, prune tree if we have a limit on book size etc...

We start with a PGN file from which we keep score of WDL scores at each node from white's perspective. Then we limit up to say 20 ply to add to book (or UCT tree) constructing the tree. This will require us to add all siblings of the best move at internal nodes even though they are never visited. Maybe there is a way to avoid this but for simplicity lets add all siblings. For instance, we can just add a couple of moves which have close heuristic evals from short searches. Once the PGN file is added, new games the engine plays itself are handled like old ones from a PGN file! That is it.

Then we select book moves using UCB formula which has an exploration coefficient to control variance

Code: Select all

UCB = mean + C * sqrt&#40;log&#40;parent's visited&#41; / visited&#41;

where C is the exploration coefficient. One can set a smaller C for tournament play for instance. So there you have it a robust book with reinforcement learning! Smile I think I will implement this for Nebiyu which has already the necessary infrastructure. It doesn't even need external PGN because it can automatically build and extend the book from its own games. Note that heuristic eval() or minimax are not used in this algorithm. Maybe we can somehow use the eval() to seed siblings in the initial phase but that is just about it. So persistent UCT_tree = book. Wonder if someone else has seen this connection and implemented his book using UCT?

Posted: **Thu Jan 02, 2014 4:00 pm**

I never really studied MC-UCT search, but WinBoard already implements something very similar with -mcBookMode true:

It defines a relation between move performance (% score) and relative playing frequency (i.e. relative to other moves from the same position). Currently I use a 4th power for that (so that moves with a 1% worse performance will be chosen 4% less often). Move choice in the book is based on these desired playing frequencies, and will pick the move for which the actual number of plays lags behind most compared to the desired number of plays. If all moves are played as much as they should (which I defined as that the most under-played move is played more than N - 0.5*sqrt(N) where N is the number of times it deserved to be played), a book miss is generated, so the engine can freely explore a move of its choice (to be added to the book if it wasn't already there).

When used with a randomizing engine, or a group of equally strong deterministic engines, this will build an opening tree.

To prevent rash decisions for rejecting moves, based on insufficient statistics, I add 5 wins and 5 losses to each move before calculating its desired number of plays.

Posted: **Thu Jan 02, 2014 6:35 pm**

hgm wrote:I never really studied MC-UCT search, but WinBoard already implements something very similar with -mcBookMode true:

It defines a relation between move performance (% score) and relative playing frequency (i.e. relative to other moves from the same position). Currently I use a 4th power for that (so that moves with a 1% worse performance will be chosen 4% less often). Move choice in the book is based on these desired playing frequencies, and will pick the move for which the actual number of plays lags behind most compared to the desired number of plays. If all moves are played as much as they should (which I defined as that the most under-played move is played more than N - 0.5*sqrt(N) where N is the number of times it deserved to be played), a book miss is generated, so the engine can freely explore a move of its choice (to be added to the book if it wasn't already there).

This is a probability matching variant in which the frequency of moves is based on a function of mean reward (4th power for your case). UCB is better because it also takes into consideration the uncertainties we have in the rewards. Also its total regret is known to converge with a rate established as a lower bound, O(log n), so it solves the MAB problem. Epsilon-greedy and softmax selection also achieve that but UCB has its success for Computer Go on its side.

One problem is that both algorithms are deterministic. So if one good move happens to be played much less than it should have, it will constantly get picked up until the distribution is established! This is bound to happen with a PGN of games that will certainly not have a distribution in accordance with the selection procedure. Once the distribution is established, in every game a different move will be picked so deterministic maybe Ok after this. I think we can turn at-least the probability matching variants into non-deterministic using roulette-wheel random selection, but UCB can't be converted in such a way. Note that your algorithm also suffers from this problem as well since it picks the lowest N - 0.5*sqrt(N).

My first impression about code required is wrong. I think most books, including scorpios, can be turned to use UCB quite easily. Disk based look up of children moves, that also handles transpositions well, is a nice way as it is so there is no need for tree data structure.

When used with a randomizing engine, or a group of equally strong deterministic engines, this will build an opening tree.

To prevent rash decisions for rejecting moves, based on insufficient statistics, I add 5 wins and 5 losses to each move before calculating its desired number of plays.

UCB visits all children once before it uses its formula, which is not a good idea especially for book selection, because the first thing it does is play non-book moves. So initializing all of them and using the UCB formula right-away is a good idea.

Posted: **Thu Jan 02, 2014 7:31 pm**

Well, note that I only consider moves that have already been played by an engine, in the move-selection process. WinBoard will never explore moves on its own accord. As soon as one engine suggests the move, when the existing moves were approximately played with the frequency they deserved, it is added to the repertoire, however, and treated as if it had (say) 5 wins and 6 losses (when the game ended in a loss). So it gets a result of 5/11 = 45%, meaning that it should be played anout 0.68 times as often as moves that scored 50%, which will mean that it now will be chosen from this position for quite a number of times if such other 50%-scoring moves had already, say, 20 plays. Because then it would deserve 13 plays. Of course after the next play the score % will change; if it continues to score many losses, the desired number of plays will be quickly pushed to a level below the actual number of plays, so that it won't be chosen anymore. At least not untill all the other moves also have more plays. But if it was a reasonable move, (scoring close to 50%), it will of course sooner rather than later correct its score towards 50% by scoring wins as well, driving up the desired number of plays, so that it will be chosen for a much longer time (giving the sub-tree below it the chance to converge to reasonable play).

Posted: **Thu Jan 02, 2014 9:05 pm**

The motive for exploring other book moves may not be strong enough. The reason is that hand-selected book moves already imply a certainty in the best move choice, thus we are better off exploiting that choice! However we should also not abandon the move after just one loss. A better idea commonly used for learning in repeated games is to use a discount factor that values recent rewards much more than future (past) rewards... So if e2-e4 results in a loss for us, we temporarily switch to other alternatives until the danger is gone (same opponent), then that loss starts to decay again.

Here is how the UCB selection and your version would look like. I generated a book for the initial position from a collection of PGN games

Code: Select all

 book
  Move   Played    Score     W        D        L     Learn   Sortv
 e2-e4     5351   57.17%   32.91%   48.51%   18.58%    100   59.01
 d2-d4     4515   57.46%   31.63%   51.67%   16.70%    100   59.47
Ng1-f3      963   55.76%   29.08%   53.37%   17.55%    100   60.10
 c2-c4      651   57.68%   32.10%   51.15%   16.74%    100   62.96
 g2-g3       84   59.52%   35.71%   47.62%   16.67%    100   74.21
 b2-b3       13   61.54%   38.46%   46.15%   15.38%    100   98.87
 f2-f4        5   60.00%   40.00%   40.00%   20.00%    100  120.19
 b2-b4        2   25.00%    0.00%   50.00%   50.00%    100  120.17
 c2-c3        1    0.00%    0.00%    0.00%  100.00%    100  134.60
Nb1-c3        1   50.00%    0.00%  100.00%    0.00%    100  184.60

Look at the last column for Sortv (i.e. sort value) calculated according to UCB formula with exploration coefficient C = 0.44. Clearly it prefers to explore more so even c2-c3 (that never won) gets much more preference than reliable e2-e4 or d2-d4. All that this implies is that C should be lowered. If set to 0, it will proportion based on average score (3rd column), which also implies c2-c3 is never played (again bad). Even with C=0.05, the moves that are not played often gets higher sortv

Code: Select all

  Move   Played    Score     W        D        L     Learn   Sortv
 e2-e4     5351   57.17%   32.91%   48.51%   18.58%    100   57.38
 d2-d4     4515   57.46%   31.63%   51.67%   16.70%    100   57.69
Ng1-f3      963   55.76%   29.08%   53.37%   17.55%    100   56.26
 c2-c4      651   57.68%   32.10%   51.15%   16.74%    100   58.28
 g2-g3       84   59.52%   35.71%   47.62%   16.67%    100   61.19
 b2-b3       13   61.54%   38.46%   46.15%   15.38%    100   65.78
 f2-f4        5   60.00%   40.00%   40.00%   20.00%    100   66.84
 b2-b4        2   25.00%    0.00%   50.00%   50.00%    100   35.82
 c2-c3        1    0.00%    0.00%    0.00%  100.00%    100   15.30
Nb1-c3        1   50.00%    0.00%  100.00%    0.00%    100   65.30

Posted: **Thu Jan 02, 2014 10:39 pm**

Many times the outcome of a game is not determined by the opening. So tuning by wins/losses is problematic IMO, at least with relatively small sample sizes.

I also have some automated book tuning that is based on scores a few moves out of book. I have result learning too but moves that lead to low scores are de-emphasized.

--Jon

Posted: **Thu Jan 02, 2014 10:56 pm**

Couldn't this be mimicked by score-based adjudication? WinBoard has the possibility to adjudicate a win when both engines agree for 3 moves that the score is above a certain limit. You could set that limit to 10 cP, if that is your window of acceptability. And you could set the 50-move rule (say) a 4-move rule, so that draw is adjudicated when the the out-of-book score is not above 10 cP. (I might have to change that adjudication in WinBoard for that, such that it is not used before the reversible-move counter is still below (say) 30 when you are still in the GUI book, even when the user set the number of moves lower than 30.)

Posted: **Thu Jan 02, 2014 11:00 pm**

Just found some discussion here where Don Dailey suggested something along these lines.

I guess the bottom line here is that no methods is really going to work that well unless it is strictly controlled. If humans don't like the best move or it's out of fashion, it won't get represented like it should. Same with computer programs. So I think it should be structured similar to Monte Carlo Tree Search where the games and frequency of play for any given move are controlled by a higher level function that adjusts this frequency as more statistical evidence comes in.

http://www.talkchess.com/forum/viewtopi ... 34&t=41529

The problem with book construction is the frequency of book moves is a equally as important as the score calculated. So one has to mix those two scores and not just the WDL score. A UCB with very low exploration coefficient seems to achieve that, but exploration seems to be less important specially given the fact that the frequency of book moves is an indication of quality. I am sure there are books that play based on frequency alone + randomization ...

Posted: **Fri Jan 03, 2014 1:11 am**

Many times the outcome of a game is not determined by the opening. So tuning by wins/losses is problematic IMO, at least with relatively small sample sizes.

Well you need some sort of feedback to learn, so we have to use either the result of the game or terminal search/eval() if we are afraid that the opening does not have much effect on the end result. Another problem is that book makers expect we use 'supervised learning' (them as supervisors), and that we play moves based on the frequency distribution established by the original PGN. The model does not change afterwards. For example, e2e4 has been played 5000 times, while c2c3 just once, so assuming both have equal rewards, the latter has a lot of catching up to do if we want to adapt the model. OTOH the book makers are indirectly telling us not to play this move frequently.

Reinforcement learning changes frequencies or add new variations based on our game results, and has an exploration/exploitation dilemma that was absent before. The former expects that we exploit only the suggested move frequency distribution. For this reason may be it is better to stick with one of them alone. Don't update the original book in anyway,or let the engine build its book all by itself from pure reinforcement learning, so that the move frequency distributions follow whatever selection algorithm it is using for learning. We can still use the original book to seed values for the latter, but it should not be in such a way that it takes way too long for some moves to lower their SD that, we end up playing bad book moves for the sake of SD (i.e. the e2e4, c2c3 case).

Posted: **Fri Jan 03, 2014 3:55 pm**

I think Don is right in that you can generally trust win/loss/draw statistics if you have a big enough sample, and that generally means moves nearer the root.

On the other hand, even frequently played moves have the problem that there can be a recent refutation. You will see this a lot in chess databases. Just look for moves that dropped off the play list for GMs within the last year or two. Usually this is not fashion. It is because there is either an outright refutation, or at least a safe line for the opponent that makes the move less attractive.

And near the leaves this is a very serious problem. If you see +8 -4 =2 in the statistics, that does not necessarily mean the move is good. That could mean several players were successful with it but their opponents did not know the right way to counter it. And when their opponent did know, or discover, the right way to proceed, the move was losing or drawing at best. Sometimes only a single game has the "best" move, and you might easily discard it, but it is the critical line.

Unfortunately I think that there is no substitute for having an accurate evaluation at the leaves and this is the whole crux of the problem. And there is no way to get that evaluation easily. The best way is probably to match multiple strong programs from a given starting position or a set of related positions using learning over many games. If you did this over all the leaves in a book you would eat up years of CPU but you would finally wind up with a book that wasn't just guesswork in many of the nodes.

I am not saying book construction and book learning can't be done without this, and still be effective, but without this they are subject to error.

--Jon

TalkChess.com

My new book

My new book

Re: My new book

Re: My new book

Re: My new book

Re: My new book

Re: My new book

Re: My new book

Re: My new book

Re: My new book

Re: My new book