Page 1 of 1

td-leaf

Posted: Tue Oct 06, 2015 10:42 pm
by brtzsnr
Hello,

I'm trying to tune my engine using the td-leaf algorithm.

The formula I use is similar to Knight-Cap. The evaluation function is V(beta * w, x) = tanh(w . x) where w are the weights, x is the feature vector, w . x is the crossproduct of w and x and beta is a coefficient. That's very much the standard way implemented by most engines, except for tanh which is used in all paper discussing temporal differences.

The training is similar to Giraffe: I pick a random position, self play 12 moves and compute the scores the I apply the classic formula:

\delta w = \delta V / \delta w * \sum(\lamda^i * (d_{i+1} - d_{i})

The whole code is just a few lines

Code: Select all

       v := play(pos)
        λ, td := 1.0, 0.0
        for i &#58;= 0; i < len&#40;v&#41;-1; i++ &#123;
                diff &#58;= v&#91;i+1&#93; - v&#91;i&#93;
                td += &#955; * diff
                &#955; *= 0.7 
        &#125;                
                         
        phase &#58;= float64&#40;engine.Phase&#40;pos&#41;) / 256
        for j &#58;= range engine.Weights&#91;0&#93; &#123;
                deltas&#91;0&#93;&#91;j&#93; = 1000000 * &#40;1 - v&#91;0&#93;*v&#91;0&#93;) * &#946; * float64&#40;e.Values&#91;j&#93;) * td * phase
                deltas&#91;1&#93;&#91;j&#93; = 1000000 * &#40;1 - v&#91;0&#93;*v&#91;0&#93;) * &#946; * float64&#40;e.Values&#91;j&#93;) * td * &#40;1 - phase&#41;
        &#125;                                 
The large coefficient 1,000,000 is to bring deltas into a normal range since my beta is very small because the weights are scaled (1 pawn = 10000).

I think my logic and code are correct but I don't seem to get good values.

These are some weights I after one day of training:
midgame: figure:[0 17244 38038 41685 54338 91321 0] mobility:[0 8367 22145 28807 26014 12671 8287]
endgame: figure:[0 15452 39275 43510 53788 91286 0] mobility:[0 11554 25354 27431 15930 18104 5879]

I.e, the mobility of knight is worth 1.5 pawns. The weights for mobility actually don't decrease, they keep increasing no matter how much I train without converging. The following graphs shows the total error for a fixed set of 2000 positions.

Image


Does anyone has experience with temporal difference learning and knows why weights don't converge?


References:

Knight cap https://chessprogramming.wikispaces.com/KnightCap
TD leaf http://arxiv.org/abs/cs/9901001

[/img]

Re: td-leaf

Posted: Thu Oct 08, 2015 4:13 pm
by jdart
You could possibly try regularization - i.e. penalizing large scores.

There is some literature on this: see for example http://www.cs.cmu.edu/~zkolter/pubs/kol ... b-full.pdf.

--Jon

Re: td-leaf

Posted: Thu Oct 08, 2015 6:14 pm
by matthewlai
brtzsnr wrote:Hello,

I'm trying to tune my engine using the td-leaf algorithm.

The formula I use is similar to Knight-Cap. The evaluation function is V(beta * w, x) = tanh(w . x) where w are the weights, x is the feature vector, w . x is the crossproduct of w and x and beta is a coefficient. That's very much the standard way implemented by most engines, except for tanh which is used in all paper discussing temporal differences.

The training is similar to Giraffe: I pick a random position, self play 12 moves and compute the scores the I apply the classic formula:

\delta w = \delta V / \delta w * \sum(\lamda^i * (d_{i+1} - d_{i})

The whole code is just a few lines

Code: Select all

       v &#58;= play&#40;pos&#41;
        &#955;, td &#58;= 1.0, 0.0
        for i &#58;= 0; i < len&#40;v&#41;-1; i++ &#123;
                diff &#58;= v&#91;i+1&#93; - v&#91;i&#93;
                td += &#955; * diff
                &#955; *= 0.7 
        &#125;                
                         
        phase &#58;= float64&#40;engine.Phase&#40;pos&#41;) / 256
        for j &#58;= range engine.Weights&#91;0&#93; &#123;
                deltas&#91;0&#93;&#91;j&#93; = 1000000 * &#40;1 - v&#91;0&#93;*v&#91;0&#93;) * &#946; * float64&#40;e.Values&#91;j&#93;) * td * phase
                deltas&#91;1&#93;&#91;j&#93; = 1000000 * &#40;1 - v&#91;0&#93;*v&#91;0&#93;) * &#946; * float64&#40;e.Values&#91;j&#93;) * td * &#40;1 - phase&#41;
        &#125;                                 
The large coefficient 1,000,000 is to bring deltas into a normal range since my beta is very small because the weights are scaled (1 pawn = 10000).

I think my logic and code are correct but I don't seem to get good values.

These are some weights I after one day of training:
midgame: figure:[0 17244 38038 41685 54338 91321 0] mobility:[0 8367 22145 28807 26014 12671 8287]
endgame: figure:[0 15452 39275 43510 53788 91286 0] mobility:[0 11554 25354 27431 15930 18104 5879]

I.e, the mobility of knight is worth 1.5 pawns. The weights for mobility actually don't decrease, they keep increasing no matter how much I train without converging. The following graphs shows the total error for a fixed set of 2000 positions.

Image


Does anyone has experience with temporal difference learning and knows why weights don't converge?


References:

Knight cap https://chessprogramming.wikispaces.com/KnightCap
TD leaf http://arxiv.org/abs/cs/9901001

[/img]
When I implemented it I had some trouble with signs and other silly stuff like that. I found it very helpful to print out intermediate results for just the first starting position, and make sure all the signs and magnitudes look right.

You can look at the scores from the self-play moves and compute as much of the td-leaf values as possible by hand, and make sure they match what the program computes.

Beyond that, it could just be something like the learning rate being too high.

Re: td-leaf

Posted: Fri Oct 09, 2015 8:04 pm
by brtzsnr
matthewlai wrote: When I implemented it I had some trouble with signs and other silly stuff like that. I found it very helpful to print out intermediate results for just the first starting position, and make sure all the signs and magnitudes look right.

You can look at the scores from the self-play moves and compute as much of the td-leaf values as possible by hand, and make sure they match what the program computes.

Beyond that, it could just be something like the learning rate being too high.
After reading several other papers and watching some tutorials on YouTube I wonder if I got the formulas right:

https://www.overleaf.com/read/rffvmyznwrmt

In this paper https://www.cs.princeton.edu/courses/ar ... ess-RL.pdf the authors start without using tanh but later (section 4.1) they modify the d_t. Do they actually modify J, too? I.e. is my derivative calculation ok in the above doc?

Re: td-leaf

Posted: Fri Oct 09, 2015 9:05 pm
by jdart
An excellent suggestion I got from here:

http://cilvr.cs.nyu.edu/diglib/lsml/bot ... s-2012.pdf

is to verify the derivatives by approximating them via finite differences.

--Jon

Re: td-leaf

Posted: Sat Oct 10, 2015 9:02 am
by matthewlai
brtzsnr wrote:
matthewlai wrote: When I implemented it I had some trouble with signs and other silly stuff like that. I found it very helpful to print out intermediate results for just the first starting position, and make sure all the signs and magnitudes look right.

You can look at the scores from the self-play moves and compute as much of the td-leaf values as possible by hand, and make sure they match what the program computes.

Beyond that, it could just be something like the learning rate being too high.
After reading several other papers and watching some tutorials on YouTube I wonder if I got the formulas right:

https://www.overleaf.com/read/rffvmyznwrmt

In this paper https://www.cs.princeton.edu/courses/ar ... ess-RL.pdf the authors start without using tanh but later (section 4.1) they modify the d_t. Do they actually modify J, too? I.e. is my derivative calculation ok in the above doc?
Shouldn't there be beta inside tanh^2() in the expression for the derivative as well?

I personally would just get rid of beta altogether, since it can be trained into w anyways.

Otherwise the expression looks right to me.

Re: td-leaf

Posted: Sun Oct 11, 2015 9:15 pm
by brtzsnr
I think I finally found my mistake. I was adding the derivite for only one position, but I have to add all of them till the end of the game.

The new graph looks much better and the score increases when adding new features. S

Code: Select all

   1 full4                         101     439     64%     14%
   2 full3                          26     437     54%     13%
   3 full2                          -4     437     49%     16%
   4 mobility                     -126     437     33%      7%
Image


So far I'm still ~450 ELO behind master, but I do have some features to add back.

Code: Select all

Rank Name                          ELO   Games   Score   Draws
   1 master                        308     534     85%      7%
   2 full4                        -112     533     34%     16%
   3 full3                        -147     533     30%     16%