Nebiyu-MCTS vs TSCP 1-0

Michel · Post by **Michel** » Sat Dec 16, 2017 5:41 pm

It turned out not to be as great as I hoped. It is definitely much better than TSCP now but its tactical blindness is there. Alpha's trick maybe that having a bunch of threads searching the MCTS tree with low exploration coefficient may eventually compensate for its tactical blindness. There maybe ways to emulate the LMR effect by careful management of the exploration coefficient using heurieustics, progressive widening (pruning) etc, but the plain MCTS as in alphazero's paper seems to be very week on one thread unless I am missing something.

I must confess these things are unclear in my mind. What would be the difference between A/B and UCT with low exploration? I guess A/B searches (in principle) all moves in ALL nodes and UCT has no such concept. But at low depth the difference should not be very big. Or am I misunderstanding this?

Daniel Shawul · Post by **Daniel Shawul** » Sun Dec 17, 2017 3:44 pm

Michel wrote:
It turned out not to be as great as I hoped. It is definitely much better than TSCP now but its tactical blindness is there. Alpha's trick maybe that having a bunch of threads searching the MCTS tree with low exploration coefficient may eventually compensate for its tactical blindness. There maybe ways to emulate the LMR effect by careful management of the exploration coefficient using heurieustics, progressive widening (pruning) etc, but the plain MCTS as in alphazero's paper seems to be very week on one thread unless I am missing something.
I must confess these things are unclear in my mind. What would be the difference between A/B and UCT with low exploration? I guess A/B searches (in principle) all moves in ALL nodes and UCT has no such concept. But at low depth the difference should not be very big. Or am I misunderstanding this?

It maybe better to compare MCTS with minmax instead as there is a big difference in branching factors. If I use low exploration coefficient the tree goes really deep with MCTS, and if I don't (say exploring all moves uniformly) it becomes minmax.

To complicate matters, AlphaZero's MCTS is not really MCTS as it cuts the rollouts with NN. If I replace the NN with just qsearch(), I am not really looking far ahead like the rollouts do. What this means is that the UCB selection formula based on mean action value is not that accurate. The mean is accumulated based on exploration of the move to different depths. Say I come up with a formula that ignores the mean action value altogether, and chooses moves to explore (using heurieustics) to get an LMR shaped tree. If I have a good move ordering, i could assign prior probabilities so that earlier moves are selected more often without looking at the mean action value. This is equivalent to no exploitation (because we didn't use the mean action value), and a static non-uniform exploration of moves (again confidence interval not used). I have something equivalent to a minimax+LMR with that, but to get the alphabeta+LMR, I should find a way to propagate the bounds between sibling moves. With standard MCTS with UCB, it may be possible to formulate the A/B bounds, based on the confidence interval, to be used for the search(d) at the leaves.

Daniel

abulmo2 · Post by **abulmo2** » Mon Dec 18, 2017 7:06 pm

tmokonen wrote:Have you tried MCTS with Nebiyu reversi, and if so how successful was it for you?

I do not know the internal of NebiyuReversi, but a few years ago when I made My Othello program Edax compatible with winboard, I ran a few tournaments between them. Compared to Edax, NebiyuReversi is very weak. Given 1 minute to play 40 moves, NebiyuReversi was weaker than Edax playing a fixed 1 ply search. Here is a summary of the resullts, with edax1 .. edax5 thinking instantly at a fixed depth from 1 to 5:

Code: Select all

   # PLAYER                &#58; RATING  ERROR   POINTS  PLAYED    (%)
   1 edax5                 &#58; 2667.5  117.2    296.0     300   98.7%
   2 edax4                 &#58; 2610.8   97.7    294.5     300   98.2%
   3 edax3                 &#58; 2390.3   53.0    281.5     300   93.8%
   4 edax2                 &#58; 2270.9   45.9    265.5     300   88.5%
   5 edax1                 &#58; 1947.2   39.6    164.5     300   54.8%
   6 NebiyuReversi_1.42    &#58; 1913.2   31.2    198.0    1500   13.2%

Daniel Shawul · Post by **Daniel Shawul** » Mon Dec 18, 2017 11:15 pm

I am more interested in whether Reversi was one of the games that are suitable for MCTS. It seems that was the case. Reversi, along with HEX and Amazons have MCTS as default because i think they did better than alpha-beta for Nebiyu. You have a very strong engine -- probably with very good eval -- that nebiyu can not compete with.

Chess was the worst for me at that time because the rollouts returned garbage.

Daniel

Michel · Post by **Michel** » Wed Dec 20, 2017 11:43 am

It maybe better to compare MCTS with minmax instead as there is a big difference in branching factors. If I use low exploration coefficient the tree goes really deep with MCTS, and if I don't (say exploring all moves uniformly) it becomes minmax.

To complicate matters, AlphaZero's MCTS is not really MCTS as it cuts the rollouts with NN. If I replace the NN with just qsearch(), I am not really looking far ahead like the rollouts do. What this means is that the UCB selection formula based on mean action value is not that accurate. The mean is accumulated based on exploration of the move to different depths. Say I come up with a formula that ignores the mean action value altogether, and chooses moves to explore (using heurieustics) to get an LMR shaped tree. If I have a good move ordering, i could assign prior probabilities so that earlier moves are selected more often without looking at the mean action value. This is equivalent to no exploitation (because we didn't use the mean action value), and a static non-uniform exploration of moves (again confidence interval not used). I have something equivalent to a minimax+LMR with that, but to get the alphabeta+LMR, I should find a way to propagate the bounds between sibling moves. With standard MCTS with UCB, it may be possible to formulate the A/B bounds, based on the confidence interval, to be used for the search(d) at the leaves.

Well I am more and more convinced that it is the NN that makes the difference in A0. UCT is just a tree search algorithm. It is extremely hard to believe that UCT could be so much better than A/B that it can offset a 1000 fold difference in evaluation speed. The authors claim that in A/B search, evaluation errors propagate more easily to the root. But this does not seem to be observed in practice. A/B averages as well although the mechanism through which this occurs is less clear to me.

So while I am quite skeptical about what can be achieved by a different search algorithm, I can easily believe that a NN can yield an infinitely better evaluation function than the standard "linear combination of features" used in almost all chess engines today. NN's are vastly more flexible but with this flexibility of course also comes complexity and the real innovation in A0 may be the fact that the authors found a way harness this complexity.

A distinguishing feature of A0 is the fact that the NN outputs a prior for moves. This is probably extremely important during training since in this way a single game generates much more data than just its outcome.

However I wonder if it is very important in actual play. The move ordering used by state of the art chess engines is already pretty good (most of the time a cutoff, it it exist, is given by the first move, even in weak engines such as gnuchess).

Perhaps one aspect that is important concerning move ordering is that the NN is also fed the last 8 positions. This likely helps it in pursuing common tactics patterns. Like Bxh7, Ng5+, Qh5. It is a bit like "followup" history in SF except that in SF's case the the followup history is generated during search and in A0 it might be generated during training.

trulses · Post by **trulses** » Wed Dec 20, 2017 1:35 pm

Michel wrote:The authors claim that in A/B search, evaluation errors propagate more easily to the root. But this does not seem to be observed in practice. A/B averages as well although the mechanism through which this occurs is less clear to me.

I think this comment was made with neural network eval in mind. Deep models with non-linearities suffer from spurious misevaluations, this isn't really an issue with the current linear eval models used in most AB-based engines.

Here's the output of an AB search on a perfectly ordered minimax game (max nodes are white and min nodes are grey),
https://i.imgur.com/sJB5gXr.png

Imagine if the leftmost leaf node was changed, it would change the value of the entire tree. Imagine changing some of the other leaf nodes, you'll notice how sensitive this tree is to leaf node values especially along the principle variation.

In comparison, the most extreme misevaluation in MCTS can only introduce an error of |max-min|/N(s, a) to the action-value estimate, where N(s, a) is the visit count along the edge a from the state s. Since you select the actions with the highest visit count the principle variation is also the branch least sensitive to errors.

Michel wrote:A distinguishing feature of A0 is the fact that the NN outputs a prior for moves. This is probably extremely important during training since in this way a single game generates much more data than just its outcome.

However I wonder if it is very important in actual play. The move ordering used by state of the art chess engines is already pretty good (most of the time a cutoff, it it exist, is given by the first move, even in weak engines such as gnuchess).

There's a lot of strength in the move priors, you can see this in the original AG0 paper where they tried to strip the search and just played the most probable move, ignoring therefore the entire eval head. This naked move-probability network from AG0 was slightly weaker than the original AlphaGo Fan running a full search on 176 GPUs.

There's also more information than just move-ordering in the move priors, they encode between them the relative scale of the subtree. I think Matthew Lai used move probabilities in a similar manner in Giraffe.

(On a side note it would be interesting to take the network from A0 and just place it into the probability based AB-search from Giraffe and see what would come out, since they mention the interaction between neural nets and AB search this might've been on their radar.)

Daniel Shawul · Post by **Daniel Shawul** » Wed Dec 20, 2017 6:54 pm

Michel wrote: Well I am more and more convinced that it is the NN that makes the difference in A0. UCT is just a tree search algorithm. It is extremely hard to believe that UCT could be so much better than A/B that it can offset a 1000 fold difference in evaluation speed. The authors claim that in A/B search, evaluation errors propagate more easily to the root. But this does not seem to be observed in practice. A/B averages as well although the mechanism through which this occurs is less clear to me.

Things are not clear to me and probably many others. I asked a question in computer go mailing list about mcts and tactics, and it seems nobody knows how tactics such as ladders (we have a lot of tactics in chess & shogi) are handled in AlphaZero. In AlphaGo-Lee/Master it turned out they have explicitly programmed ladders (shown in extended table 2.) that basically means a 1-ply deep lookahead can resolve full board ladders. How they did it in AlphaZero nobody knows. It is mentioned that ladders were the last of features to be learned by AlphaGoZero -- where it is not clear which ( the NN or MCTS ) learned them. In chess and shogi, you have way more tactics and you will suffer with that even if you take out the rollouts. It will be very very interesting to see the pure policy network player's elo for chess and shogi. In AlphaGoZero it is worth 3000 elos but i doubt that would be the case for chess and shogi. We and the reviewers should demand such a plot before publication

-- and it will settle the question of the NN or the MCTS.

So while I am quite skeptical about what can be achieved by a different search algorithm, I can easily believe that a NN can yield an infinitely better evaluation function than the standard "linear combination of features" used in almost all chess engines today. NN's are vastly more flexible but with this flexibility of course also comes complexity and the real innovation in A0 may be the fact that the authors found a way harness this complexity.

Yes, a good test would be to test their NN with Stockfish's search and match it against A0. They were not motivated to do that in the A0 paper because they don't want to mix in domain specific knowledge.

A distinguishing feature of A0 is the fact that the NN outputs a prior for moves. This is probably extremely important during training since in this way a single game generates much more data than just its outcome.

However I wonder if it is very important in actual play. The move ordering used by state of the art chess engines is already pretty good (most of the time a cutoff, it it exist, is given by the first move, even in weak engines such as gnuchess).

They use the priors to guide the MCTS selection -- which is its strength and weakness at the same time. A move with a low prior is visited less often and its subtree examined less -- which may result in "search traps" (combinations where one of them moves looks according to the prior). The trick could be to avoid such kind of situation as much as possible, and exploit what is potentially an infintely better NN eval.

Perhaps one aspect that is important concerning move ordering is that the NN is also fed the last 8 positions. This likely helps it in pursuing common tactics patterns. Like Bxh7, Ng5+, Qh5. It is a bit like "followup" history in SF except that in SF's case the the followup history is generated during search and in A0 it might be generated during training.

Ok, I see learning "last best reply" (counter/continuation moves) could help in the selection phase i guess.

Daniel

Daniel Shawul · Post by **Daniel Shawul** » Wed Dec 20, 2017 11:33 pm

Perhaps one aspect that is important concerning move ordering is that the NN is also fed the last 8 positions. This likely helps it in pursuing common tactics patterns. Like Bxh7, Ng5+, Qh5. It is a bit like "followup" history in SF except that in SF's case the the followup history is generated during search and in A0 it might be generated during training.

We may have underestimated the importance of the policy network. One would think the value network would suffice (i.e. doing a 1-ply lookahead search and probing the value network should give prior probability distribution over the moves (the policy network)). But they chose to train a separate policy network that will independently evaluate goodness of moves -- as you pointed out also looking at the last 8 moves (and probably figuring out from there where the action is concentrated).

Nebiyu-MCTS vs TSCP 1-0

Re: Nebiyu-MCTS vs TSCP 1-0

Re: Nebiyu-MCTS vs TSCP 1-0

Re: Nebiyu-MCTS vs TSCP 1-0

Re: Nebiyu-MCTS vs TSCP 1-0

Re: Nebiyu-MCTS vs TSCP 1-0

Re: Nebiyu-MCTS vs TSCP 1-0

Re: Nebiyu-MCTS vs TSCP 1-0

Re: Nebiyu-MCTS vs TSCP 1-0