I’m a bit stuck to see the conceptual difference between TD (as described in various TD-Gammon write ups) and RL as used in AZ, LC0 and NNUE-SF, unless:
it’s the position target value derivation used to train on.
NNUE-SF uses a lambda mix of game result and SF-eval at N-ply (as far as I can tell).
LC0, uses game result (iirc)
TD-gammon uses game result, modified by move number. Eg, the final move in a win game scores 1.0, the first move 0.5? and the remaining moves interpolated. Or am I misreading?
TD learning by self play (TD-Gammon)
Moderators: hgm, Rebel, chrisw
-
- Posts: 2663
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: TD learning by self play (TD-Gammon)
Dunno about the others, but KnightCap and Giraffe for example used TDLeaf:
https://www.chessprogramming.org/Tempor ... ing#TDLeaf
https://www.chessprogramming.org/KnightCap
https://www.chessprogramming.org/Giraffe
--
Srdja
https://www.chessprogramming.org/Tempor ... ing#TDLeaf
https://www.chessprogramming.org/KnightCap
https://www.chessprogramming.org/Giraffe
--
Srdja
-
- Posts: 16
- Joined: Fri Dec 27, 2019 8:47 pm
- Full name: Jacek Dermont
Re: TD learning by self play (TD-Gammon)
NNUE-SF uses labeled positions, labels come from shallow search and/or game results. This could be seen more like (self)supervised learning based on samples labeled by 'expert'.
A0/LC0 is similar in that regard it generates positions labeled by itself from game tree search (value and policy) into some buffer and in learning stage it learns from that buffer.
Temporal Difference Learning learns after every move, that is it adjusts parameters so that this position's value is closer to next position's value. It may seem counterintuitive especially when in the beginning the weights are randomly initialized, but at the very least near endgame it will receive real value. Then with more time and training those real values will propagate earlier into game. AFAIK in TD-Gammon TD(0) was used, so it would adjust the weights just one-step ahead. There exist variants that would allow backpropagate weights into more earlier positions, with some decaying that would 'interpolate' results. I think it's called eligibility traces.
And yes, there are TD variants with minimax tree, when you adjust parameters to leaf of principal variations, like TD-Leaf or RootStrap/TreeStrap: https://www.chessprogramming.org/KnightCap https://www.chessprogramming.org/Meep
A0/LC0 is similar in that regard it generates positions labeled by itself from game tree search (value and policy) into some buffer and in learning stage it learns from that buffer.
Temporal Difference Learning learns after every move, that is it adjusts parameters so that this position's value is closer to next position's value. It may seem counterintuitive especially when in the beginning the weights are randomly initialized, but at the very least near endgame it will receive real value. Then with more time and training those real values will propagate earlier into game. AFAIK in TD-Gammon TD(0) was used, so it would adjust the weights just one-step ahead. There exist variants that would allow backpropagate weights into more earlier positions, with some decaying that would 'interpolate' results. I think it's called eligibility traces.
And yes, there are TD variants with minimax tree, when you adjust parameters to leaf of principal variations, like TD-Leaf or RootStrap/TreeStrap: https://www.chessprogramming.org/KnightCap https://www.chessprogramming.org/Meep