Trying to understand AlphaZero and LC0

rdhoffmann · Post by **rdhoffmann** » Sat Feb 22, 2025 11:25 am

After understanding the relatively simple NNUE type networks for classic chess engines, I'm trying to wrap my head around more complex architectures like AlphaZero/LC0. There is a few things that I do not understand, however, perhaps someone here has a better understanding and can help me out:

Neural network structure:
- Why do we encode move history into the neural nets, as opposed to encoding just the current position?
- Is the lack of history the reason that LC0 does not analyze given positions? (in the build I downloaded it is unsupported)
- Why is the policy output head not simply from->to (64x64), is it only because of e.p. and promotion, or is there other reasons?
- Is it a requirement that the value and policy head are based on the same core network? Or is this only done for effiency?

MCTS:
- What is the logic in deciding when to stop searching a particular path and returning a value? I suppose we are not calculating all the way to a mate.

Any answers would be highly appreciated

chaeronanaut · Post by **chaeronanaut** » Sat Feb 22, 2025 12:32 pm

Q1: Why do we encode move history into the neural nets, as opposed to encoding just the current position?
There is a very clear explanation from David Wu, author of KataGo:

For alpha-zero-style training, input features that take into account recent history are often genuinely informative and valuable.
Why? Because if your data is taken from self play by an MCTS player, by giving the recent moves to the net, you are essentially telling the net:

"The MCTS search, which is vastly stronger than you the raw neural net, decided on the prior turn(s) that moves <recent history moves> were the likely best moves - please try to find the best next move conditional on the assumption that so far that the history so far is likely good play."

And that's a genuinely useful piece of information for making better predictions that help the search be better.

And why would it help the search, when the search will also consider many branches in which the history of play in those branches will contain terrible blunders and bad moves, violating the assumption implicit in the input to the net that the history is good play?

Well, in the abstract, in a game tree search it's most important that the policy and value estimates be good on or near the main line (i.e. principal variation) - and on the main line the history will be good play, so the assumption implicit in the input to the net is correct. By contrast, in explorative lines where one or both sides have blundered, it's not as important to avoid further mistakes - the winning side only needs to play well enough to refute the losing side and prove the line is not the main line, while the losing side blundering doesn't matter because the losing side was losing and won't be choosing this line anyways. (This is very similar to the principle behind alpha-beta pruning).

Another way to approach it is with the idea of consistency. For example consider a toy 1-player game where a bot starts on one of N tiles of a N-tile hallway and on the leftmost tile is a lever, and on the rightmost tile is a lever. The actions are to walk one tile left, or right, or pull a lever on the current tile if there is one. One lever is good, one is bad. And there's also a small penalty for wasting time just walking back and forth not pulling a lever.

Suppose the neural net can't accurately predict the value of pulling the levers because it's "tactically complex" for some reason, and it might as well be equally likely that either one is best. But suppose the MCTS search can evaluate accurately, once it actually simulates what happens with a lever pull.

Well, no matter which tile you start on, it's about 50-50 whether the optimal action is to walk left (and/or pull the left lever if you can), versus walking right (and/or pulling the right lever if you can).

However, suppose you know you didn't merely start on a tile, but rather a strong MCTS bot on the prior turn decided to walk left to this tile on the prior turn, and that MCTS bot had a search deep enough to reach the levers and simulate the results. That's pretty strong evidence, that you should continue to walk/pull left. An AlphaZero-style policy net with a recent history input will likely learn to bias to go left more if it sees a left-move in its history input, and right more if it sees a right move in history.

And... that's exactly what you want in a search. Because in the branch of a search where you first try a left walk, you shouldn't be wasting time trying a right walk after that that simply undoes your movement and incurs a penalty. Conditional on left being optimal on the first step, left is optimal again, and conditional on right being optimal, right is optimal again. You would prefer to linearly search 2 straight walks to the levers, and prune all the 2^n zigzagging paths that are provably non-optimal conditional on the first move. So basically, recent history inputs, in principle, can help AlphaZero-trained nets to avoid suggesting moves that cannot be part of correct lines given earlier moves (are inconsistent with earlier moves), making search more efficient.

Q2: Is the lack of history the reason that LC0 does not analyze given positions? (in the build I downloaded it is unsupported)
No. This is likely user error.

Q3: Why is the policy output head not simply from->to (64x64), is it only because of e.p. and promotion, or is there other reasons?
Attention policy was probably better motivated when Lc0 used CNNs - from the mouth of the Seer author:

for a CNN base network, this policy encoding is very natural as it takes advantage of spatial locality (local information about the to square and the from square can be collected separately and combined via the dot product -> distant information doesn't need to propagated to learn a reasonable policy)

...in chess, information about both the to square and the from square are required to determine if a move is good. Those squares can be very distant in chess + CNNs can struggle to propagate distant spatial information even in the case of very deep networks with total receptive field coverage.

Q4: Is it a requirement that the value and policy head are based on the same core network? Or is this only done for efficiency?
It is for efficiency. Things the network learns for predicting value are also useful for predicting moves, and vice versa. Using a fused network body allows computation to be shared and likely improves the learning dynamics. Other MCTS engines like Monty use separate networks for value and policy, and this works fine.

Q5: What is the logic in deciding when to stop searching a particular path and returning a value? I suppose we are not calculating all the way to a mate.
Each iteration searches to a leaf node of the in-memory tree, evaluates it, back-propagates information up the tree, then this repeats.

chaeronanaut · Post by **chaeronanaut** » Sat Feb 22, 2025 12:35 pm

Additionally, it's worth reading the Lc0 paper: https://arxiv.org/abs/2409.12272

smatovic · Post by **smatovic** » Sat Feb 22, 2025 2:34 pm

chaeronanaut wrote: ↑Sat Feb 22, 2025 12:35 pm Additionally, it's worth reading the Lc0 paper: https://arxiv.org/abs/2409.12272

Thanks for posting the paper.

--
Srdja

rdhoffmann · Post by **rdhoffmann** » Sat Feb 22, 2025 11:22 pm

Thank you for the paper and your explanations, now I get the basics I think

Trying to understand AlphaZero and LC0

Trying to understand AlphaZero and LC0

Re: Trying to understand AlphaZero and LC0

Re: Trying to understand AlphaZero and LC0

Re: Trying to understand AlphaZero and LC0

Re: Trying to understand AlphaZero and LC0