TD learning by self play (TD-Gammon)

chrisw · Post by **chrisw** » Sat Apr 10, 2021 1:34 pm

I’m a bit stuck to see the conceptual difference between TD (as described in various TD-Gammon write ups) and RL as used in AZ, LC0 and NNUE-SF, unless:
it’s the position target value derivation used to train on.

NNUE-SF uses a lambda mix of game result and SF-eval at N-ply (as far as I can tell).
LC0, uses game result (iirc)
TD-gammon uses game result, modified by move number. Eg, the final move in a win game scores 1.0, the first move 0.5? and the remaining moves interpolated. Or am I misreading?

smatovic · Post by **smatovic** » Sat Apr 10, 2021 5:50 pm

Dunno about the others, but KnightCap and Giraffe for example used TDLeaf:

https://www.chessprogramming.org/Tempor ... ing#TDLeaf

https://www.chessprogramming.org/KnightCap

https://www.chessprogramming.org/Giraffe

--
Srdja

derjack · Post by **derjack** » Sun Apr 11, 2021 11:24 am

NNUE-SF uses labeled positions, labels come from shallow search and/or game results. This could be seen more like (self)supervised learning based on samples labeled by 'expert'.

A0/LC0 is similar in that regard it generates positions labeled by itself from game tree search (value and policy) into some buffer and in learning stage it learns from that buffer.

Temporal Difference Learning learns after every move, that is it adjusts parameters so that this position's value is closer to next position's value. It may seem counterintuitive especially when in the beginning the weights are randomly initialized, but at the very least near endgame it will receive real value. Then with more time and training those real values will propagate earlier into game. AFAIK in TD-Gammon TD(0) was used, so it would adjust the weights just one-step ahead. There exist variants that would allow backpropagate weights into more earlier positions, with some decaying that would 'interpolate' results. I think it's called eligibility traces.

And yes, there are TD variants with minimax tree, when you adjust parameters to leaf of principal variations, like TD-Leaf or RootStrap/TreeStrap: https://www.chessprogramming.org/KnightCap https://www.chessprogramming.org/Meep

TD learning by self play (TD-Gammon)

TD learning by self play (TD-Gammon)

Re: TD learning by self play (TD-Gammon)

Re: TD learning by self play (TD-Gammon)