td-leaf

brtzsnr · Post by **brtzsnr** » Tue Oct 06, 2015 10:42 pm

Hello,

I'm trying to tune my engine using the td-leaf algorithm.

The formula I use is similar to Knight-Cap. The evaluation function is V(beta * w, x) = tanh(w . x) where w are the weights, x is the feature vector, w . x is the crossproduct of w and x and beta is a coefficient. That's very much the standard way implemented by most engines, except for tanh which is used in all paper discussing temporal differences.

The training is similar to Giraffe: I pick a random position, self play 12 moves and compute the scores the I apply the classic formula:

\delta w = \delta V / \delta w * \sum(\lamda^i * (d_{i+1} - d_{i})

The whole code is just a few lines

Code: Select all

       v &#58;= play&#40;pos&#41;
        &#955;, td &#58;= 1.0, 0.0
        for i &#58;= 0; i < len&#40;v&#41;-1; i++ &#123;
                diff &#58;= v&#91;i+1&#93; - v&#91;i&#93;
                td += &#955; * diff
                &#955; *= 0.7 
        &#125;                
                         
        phase &#58;= float64&#40;engine.Phase&#40;pos&#41;) / 256
        for j &#58;= range engine.Weights&#91;0&#93; &#123;
                deltas&#91;0&#93;&#91;j&#93; = 1000000 * &#40;1 - v&#91;0&#93;*v&#91;0&#93;) * &#946; * float64&#40;e.Values&#91;j&#93;) * td * phase
                deltas&#91;1&#93;&#91;j&#93; = 1000000 * &#40;1 - v&#91;0&#93;*v&#91;0&#93;) * &#946; * float64&#40;e.Values&#91;j&#93;) * td * &#40;1 - phase&#41;
        &#125;

The large coefficient 1,000,000 is to bring deltas into a normal range since my beta is very small because the weights are scaled (1 pawn = 10000).

I think my logic and code are correct but I don't seem to get good values.

These are some weights I after one day of training:
midgame: figure:[0 17244 38038 41685 54338 91321 0] mobility:[0 8367 22145 28807 26014 12671 8287]
endgame: figure:[0 15452 39275 43510 53788 91286 0] mobility:[0 11554 25354 27431 15930 18104 5879]

I.e, the mobility of knight is worth 1.5 pawns. The weights for mobility actually don't decrease, they keep increasing no matter how much I train without converging. The following graphs shows the total error for a fixed set of 2000 positions.

Does anyone has experience with temporal difference learning and knows why weights don't converge?

References:

Knight cap https://chessprogramming.wikispaces.com/KnightCap
TD leaf http://arxiv.org/abs/cs/9901001

[/img]

jdart · Post by **jdart** » Thu Oct 08, 2015 4:13 pm

You could possibly try regularization - i.e. penalizing large scores.

There is some literature on this: see for example http://www.cs.cmu.edu/~zkolter/pubs/kol ... b-full.pdf.

--Jon

matthewlai · Post by **matthewlai** » Thu Oct 08, 2015 6:14 pm

brtzsnr wrote:Hello,

I'm trying to tune my engine using the td-leaf algorithm.

The formula I use is similar to Knight-Cap. The evaluation function is V(beta * w, x) = tanh(w . x) where w are the weights, x is the feature vector, w . x is the crossproduct of w and x and beta is a coefficient. That's very much the standard way implemented by most engines, except for tanh which is used in all paper discussing temporal differences.

The training is similar to Giraffe: I pick a random position, self play 12 moves and compute the scores the I apply the classic formula:

\delta w = \delta V / \delta w * \sum(\lamda^i * (d_{i+1} - d_{i})

The whole code is just a few lines
Code: Select all
       v &#58;= play&#40;pos&#41;
        &#955;, td &#58;= 1.0, 0.0
        for i &#58;= 0; i < len&#40;v&#41;-1; i++ &#123;
                diff &#58;= v&#91;i+1&#93; - v&#91;i&#93;
                td += &#955; * diff
                &#955; *= 0.7 
        &#125;                
                         
        phase &#58;= float64&#40;engine.Phase&#40;pos&#41;) / 256
        for j &#58;= range engine.Weights&#91;0&#93; &#123;
                deltas&#91;0&#93;&#91;j&#93; = 1000000 * &#40;1 - v&#91;0&#93;*v&#91;0&#93;) * &#946; * float64&#40;e.Values&#91;j&#93;) * td * phase
                deltas&#91;1&#93;&#91;j&#93; = 1000000 * &#40;1 - v&#91;0&#93;*v&#91;0&#93;) * &#946; * float64&#40;e.Values&#91;j&#93;) * td * &#40;1 - phase&#41;
        &#125;                                 
The large coefficient 1,000,000 is to bring deltas into a normal range since my beta is very small because the weights are scaled (1 pawn = 10000).

I think my logic and code are correct but I don't seem to get good values.

These are some weights I after one day of training:
midgame: figure:[0 17244 38038 41685 54338 91321 0] mobility:[0 8367 22145 28807 26014 12671 8287]
endgame: figure:[0 15452 39275 43510 53788 91286 0] mobility:[0 11554 25354 27431 15930 18104 5879]

I.e, the mobility of knight is worth 1.5 pawns. The weights for mobility actually don't decrease, they keep increasing no matter how much I train without converging. The following graphs shows the total error for a fixed set of 2000 positions.

Does anyone has experience with temporal difference learning and knows why weights don't converge?

References:

Knight cap https://chessprogramming.wikispaces.com/KnightCap
TD leaf http://arxiv.org/abs/cs/9901001

[/img]

When I implemented it I had some trouble with signs and other silly stuff like that. I found it very helpful to print out intermediate results for just the first starting position, and make sure all the signs and magnitudes look right.

You can look at the scores from the self-play moves and compute as much of the td-leaf values as possible by hand, and make sure they match what the program computes.

Beyond that, it could just be something like the learning rate being too high.

brtzsnr · Post by **brtzsnr** » Fri Oct 09, 2015 8:04 pm

matthewlai wrote: When I implemented it I had some trouble with signs and other silly stuff like that. I found it very helpful to print out intermediate results for just the first starting position, and make sure all the signs and magnitudes look right.

You can look at the scores from the self-play moves and compute as much of the td-leaf values as possible by hand, and make sure they match what the program computes.

Beyond that, it could just be something like the learning rate being too high.

After reading several other papers and watching some tutorials on YouTube I wonder if I got the formulas right:

https://www.overleaf.com/read/rffvmyznwrmt

In this paper https://www.cs.princeton.edu/courses/ar ... ess-RL.pdf the authors start without using tanh but later (section 4.1) they modify the d_t. Do they actually modify J, too? I.e. is my derivative calculation ok in the above doc?

jdart · Post by **jdart** » Fri Oct 09, 2015 9:05 pm

An excellent suggestion I got from here:

http://cilvr.cs.nyu.edu/diglib/lsml/bot ... s-2012.pdf

is to verify the derivatives by approximating them via finite differences.

--Jon

matthewlai · Post by **matthewlai** » Sat Oct 10, 2015 9:02 am

brtzsnr wrote:
matthewlai wrote: When I implemented it I had some trouble with signs and other silly stuff like that. I found it very helpful to print out intermediate results for just the first starting position, and make sure all the signs and magnitudes look right.

You can look at the scores from the self-play moves and compute as much of the td-leaf values as possible by hand, and make sure they match what the program computes.

Beyond that, it could just be something like the learning rate being too high.
After reading several other papers and watching some tutorials on YouTube I wonder if I got the formulas right:

https://www.overleaf.com/read/rffvmyznwrmt

In this paper https://www.cs.princeton.edu/courses/ar ... ess-RL.pdf the authors start without using tanh but later (section 4.1) they modify the d_t. Do they actually modify J, too? I.e. is my derivative calculation ok in the above doc?

Shouldn't there be beta inside tanh^2() in the expression for the derivative as well?

I personally would just get rid of beta altogether, since it can be trained into w anyways.

Otherwise the expression looks right to me.

brtzsnr · Post by **brtzsnr** » Sun Oct 11, 2015 9:15 pm

I think I finally found my mistake. I was adding the derivite for only one position, but I have to add all of them till the end of the game.

The new graph looks much better and the score increases when adding new features. S

Code: Select all

   1 full4                         101     439     64%     14%
   2 full3                          26     437     54%     13%
   3 full2                          -4     437     49%     16%
   4 mobility                     -126     437     33%      7%

So far I'm still ~450 ELO behind master, but I do have some features to add back.

Code: Select all

Rank Name                          ELO   Games   Score   Draws
   1 master                        308     534     85%      7%
   2 full4                        -112     533     34%     16%
   3 full3                        -147     533     30%     16%

td-leaf

td-leaf

Re: td-leaf

Re: td-leaf

Re: td-leaf

Re: td-leaf

Re: td-leaf

Re: td-leaf