TD learning by self play (TD-Gammon)

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

chrisw
Posts: 4313
Joined: Tue Apr 03, 2012 4:28 pm

TD learning by self play (TD-Gammon)

Post by chrisw »

I’m a bit stuck to see the conceptual difference between TD (as described in various TD-Gammon write ups) and RL as used in AZ, LC0 and NNUE-SF, unless:
it’s the position target value derivation used to train on.

NNUE-SF uses a lambda mix of game result and SF-eval at N-ply (as far as I can tell).
LC0, uses game result (iirc)
TD-gammon uses game result, modified by move number. Eg, the final move in a win game scores 1.0, the first move 0.5? and the remaining moves interpolated. Or am I misreading?
smatovic
Posts: 2641
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: TD learning by self play (TD-Gammon)

Post by smatovic »

derjack
Posts: 16
Joined: Fri Dec 27, 2019 8:47 pm
Full name: Jacek Dermont

Re: TD learning by self play (TD-Gammon)

Post by derjack »

NNUE-SF uses labeled positions, labels come from shallow search and/or game results. This could be seen more like (self)supervised learning based on samples labeled by 'expert'.

A0/LC0 is similar in that regard it generates positions labeled by itself from game tree search (value and policy) into some buffer and in learning stage it learns from that buffer.

Temporal Difference Learning learns after every move, that is it adjusts parameters so that this position's value is closer to next position's value. It may seem counterintuitive especially when in the beginning the weights are randomly initialized, but at the very least near endgame it will receive real value. Then with more time and training those real values will propagate earlier into game. AFAIK in TD-Gammon TD(0) was used, so it would adjust the weights just one-step ahead. There exist variants that would allow backpropagate weights into more earlier positions, with some decaying that would 'interpolate' results. I think it's called eligibility traces.

And yes, there are TD variants with minimax tree, when you adjust parameters to leaf of principal variations, like TD-Leaf or RootStrap/TreeStrap: https://www.chessprogramming.org/KnightCap https://www.chessprogramming.org/Meep