improved evaluation function

Discussion of chess software programming and technical issues.

Moderators: bob, hgm, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
Post Reply
brtzsnr
Posts: 433
Joined: Fri Jan 16, 2015 3:02 pm
Contact:

improved evaluation function

Post by brtzsnr » Sat Mar 11, 2017 11:01 am

Hi, all!

In previous posts I explained how Zurichess uses Texel Tuning method for it's evaluation. To recap, the evaluation function is something very standard:

E(x, p) = (1-p)*x*w_m + p*x*w_e

where x is the vector of features (normally white - black features), w_m & w_e are column vector of midgame and endgame weights, p is the phase between 0 and 1. Texel Tuning Method [1] find weights which minimizes the following function. One can add some L1 or L2 regularization (for Zurichess I added L1).

\sum_{i=1..n}{y_i - sigmoid(E(x_i, p))}^2 / n

This works fine and you can build strong engines. Zurichess has ~2750 on CCRL and it's not even the strongest engine using this method. I train a set of quiet positions that I published before [2].

For people into machine learning this is very similar to a logistic regression, except for the phase thing. Of course, there are lots of experiments trying to build a deeper network (e.g. Giraffe, DeepChess) and I also want Zurichess improve its network. After some experiments I found that the following network actually works better.


E(x, p) = CReLU(x * w_i) * w_h

where x is the input vector and has shape 1*n, w_i is input weights and has shape n*k, CReLU is a concatenated rectifier linear unit (similar to a ReLU but keeps both negative and positive weights), w_h is outer layer weights and has shape [2*k, 1]. The final result has shape [1, 1]. One modification of the inputs is that now it includes the total number of pieces so the network has to figure out the phase by itself.

I chose k = 4, resulting in 2x more weights versus the previous network. How good is this network?

Old network: // Network has train error 0.05731946 and validation error 0.05754193.
New network: // Network has train error 0.05587117 and validation error 0.05608533.

As you see, it actually evaluates a deal better.

This is a set of positions on which the new network does worse

Code: Select all

+0.48092   +306 8/2K5/5p2/1p5p/p2k3P/6P1/8/5N2 b - - c9 "0-1"; ce 216;
+0.49426   +311 5R1r/1p4pk/p5qp/2P2p2/1PN5/2B1P3/4pP2/R4K2 w - - c9 "0-1"; ce 232;
+0.51940   +327 8/8/5p2/7p/1K4kP/p7/4NP2/8 b - - c9 "0-1"; ce 325;
+0.53780   +340 8/3k4/1B3p2/1p3Pp1/1p4P1/3p4/1P4K1/8 b - - c9 "0-1"; ce 327;
+0.57122   -402 1Q6/5pk1/2Q3pp/8/8/6PP/3q1P1K/2q5 w - - c9 "1-0"; ce -232;
+0.67389   -462 3Q4/8/8/6k1/4Q3/3K4/8/q6q b - - c9 "1-0"; ce -327;
+0.75176   -587 K7/8/8/3Q4/6qk/6q1/5Q2/8 w - - c9 "1-0"; ce -344;
+0.76482   -617 1Q6/5Q2/1K6/8/8/8/4q1k1/7q w - - c9 "1-0"; ce -349;
+0.77143   -587 7Q/2Q5/4qq2/8/3k2P1/7K/8/8 w - - c9 "1-0"; ce -377;
+0.77833   -577 QQ6/8/8/8/4k2p/1K3p2/3r2p1/6q1 b - - c9 "1-0"; ce -427;
This is a set of positions on which the new network does better

Code: Select all

-0.75222   -679 1Q4KQ/4q3/8/8/8/8/8/5rk1 b - - c9 "0-1"; ce -29;
-0.67275   -469 8/3k1p2/8/3p4/5nnP/8/p1Q5/N4K2 b - - c9 "0-1"; ce -46;
-0.60600   -559 2Q4Q/8/3k1p2/3p1p2/p1b5/6PK/5q2/8 b - - c9 "0-1"; ce 60;
-0.51627   +362 7r/4k3/1p6/p2Bp2N/P7/3P4/1P3P1p/7K w - - c9 "1-0"; ce 161;
-0.50598   +316 r4r1k/4n3/b1p1N2p/P1p1P3/2Pp4/1P5Q/3B1q1P/R2R2K1 w - - c9 "1-0"; ce 68;
-0.47179   -287 4k3/5N2/Q3p3/4P3/2Rqb2n/8/5P2/6K1 b - - c9 "0-1"; ce -20;
-0.45891   -302 6Q1/8/3k1p2/8/6KN/8/8/5q2 b - - c9 "0-1"; ce 60;
-0.44709   -270 5qk1/6b1/8/7R/8/4p2P/6PQ/6K1 b - - c9 "0-1"; ce -25;
-0.44571   -269 4B1k1/5ppb/7p/5P2/1P3q2/1Q2p3/8/4R1K1 b - - c9 "0-1"; ce -2;
-0.42739   -264 8/8/8/7r/4b3/3k4/Q7/3K4 b - - c9 "0-1"; ce -62;
Search from starting position

Code: Select all

./zurichess
zurichess ladon http://www.zurichess.xyz
build with devel +e458264aca Sat Feb 25 14:39:29 2017 +0000 at 2017-03-11 11:52:31, running on amd64
go
info depth 0 seldepth 0 multipv 1 score cp 21 nodes 2 time 0 nps 47067 pv
info depth 1 seldepth 1 multipv 1 score cp 101 nodes 47 time 0 nps 232028 pv e2e3
info depth 2 seldepth 2 multipv 1 score cp 21 nodes 180 time 0 nps 393806 pv e2e3 e7e6
info depth 3 seldepth 4 multipv 1 score cp 89 nodes 1348 time 1 nps 881850 pv g1f3 e7e6 d2d4
info depth 4 seldepth 4 multipv 1 score cp 21 nodes 2672 time 2 nps 1049758 pv g1f3 g8f6 d2d3 d7d6
info depth 5 seldepth 5 multipv 1 score cp 77 nodes 3865 time 3 nps 1170670 pv g1f3 g8f6 d2d3 d7d5 c1d2
info depth 6 seldepth 6 multipv 1 score cp 17 nodes 7702 time 6 nps 1147317 pv g1f3 g8f6 d2d3 d7d5 h2h3 b8c6
info depth 7 seldepth 7 multipv 1 score cp 66 nodes 13000 time 11 nps 1134197 pv g1f3 g8f6 d2d3 d7d5 h2h3 e7e6 c1d2
info depth 8 seldepth 10 multipv 1 score cp 45 nodes 28147 time 24 nps 1150378 pv e2e4 d7d5 e4d5 c7c6 d5c6 b8c6 b1c3 g8f6
info depth 9 seldepth 11 multipv 1 score cp 60 nodes 48191 time 39 nps 1221759 pv e2e4 d7d5 e4d5 d8d5 b1c3 d5e6 d1e2 e6d6 c3b5 d6c6
info depth 10 seldepth 13 multipv 1 score cp 29 nodes 207938 time 133 nps 1559738 pv g1f3 d7d5 d2d4 g8f6 e2e3 e7e6 c2c4 f8e7 c4d5 e6d5
info depth 11 seldepth 13 multipv 1 score cp 66 nodes 353603 time 210 nps 1678706 pv g1f3 g8f6 c2c4 c7c5 d2d3 d7d6 b1c3 b8c6 e2e4 e7e5 h2h3
info depth 12 seldepth 13 multipv 1 score cp 64 nodes 704204 time 401 nps 1755582 pv e2e4 e7e5 g1f3 b8c6 d2d4 e5d4 f3d4 c6d4 d1d4 d7d6 f1b5 c8d7 b5e2
info depth 13 seldepth 15 multipv 1 score cp 72 nodes 1038771 time 573 nps 1810385 pv e2e4 c7c5 d2d3 e7e6 g1f3 b8c6 f1e2 d7d5 e4d5 e6d5 e1g1 f8d6 c1g5
info depth 14 seldepth 15 multipv 1 score cp 64 nodes 1458980 time 793 nps 1839252 pv e2e4 c7c5 g1f3 b8c6 b1c3 e7e6 f1b5 g8e7 e1g1 e7g6 b5c6 d7c6 d2d3 f8d6
info depth 15 seldepth 16 multipv 1 score cp 76 nodes 2004627 time 1085 nps 1847471 pv e2e4 c7c5 g1f3 b8c6 b1c3 e7e6 f1b5 f8d6 e1g1 g8e7 d2d3 e8g8 c1e3 e7g6 d1e2
info depth 16 seldepth 17 multipv 1 score cp 50 nodes 3522429 time 1888 nps 1865161 pv e2e4 c7c5 g1f3 b8c6 f1b5 a7a6 b5c6 b7c6 e1g1 d7d5 e4d5 c6d5 d2d4 e7e6 f3e5 a8b8
Despite the fact that I believe it performs better, overall I'm seeing a ~80Elo drop. Some of it is from 15% slower search, another part is from worse futility weights.

If you feel like exploring this further let me know your successes.

[1] https://chessprogramming.wikispaces.com ... ing+Method
[2] https://bitbucket.org/zurichess/tuner/d ... s/tuner.7z
[3] https://www.tensorflow.org/api_docs/python/tf/nn/crelu

brtzsnr
Posts: 433
Joined: Fri Jan 16, 2015 3:02 pm
Contact:

Re: improved evaluation function

Post by brtzsnr » Sat Mar 11, 2017 1:05 pm

After a bit of cleanup and speed up the Elo drop is only ~30 Elo.

I also published this post here https://medium.com/@brtzsnr after cleaning it up a bit.

jdart
Posts: 3825
Joined: Fri Mar 10, 2006 4:23 am
Location: http://www.arasanchess.org

Re: improved evaluation function

Post by jdart » Sat Mar 11, 2017 3:21 pm

It is expected that a more elaborate model would give you a lower evaluation error in training. But execution speed does matter.

--Jon

nionita
Posts: 161
Joined: Fri Oct 22, 2010 7:47 pm
Location: Austria

Re: improved evaluation function

Post by nionita » Sun Mar 12, 2017 7:59 pm

brtzsnr wrote:After a bit of cleanup and speed up the Elo drop is only ~30 Elo.

I also published this post here https://medium.com/@brtzsnr after cleaning it up a bit.
Thanks for the insight. Are you doing the network operations in int(64) or float?

There is a point with testing (or even training) only with quiet positions: when you stand pat in QS (beta cutoff) your position is not quiet. So this kind of positions should also be used for training (and testing). Not yet sure in which proportion it should be...

brtzsnr
Posts: 433
Joined: Fri Jan 16, 2015 3:02 pm
Contact:

Re: improved evaluation function

Post by brtzsnr » Mon Mar 13, 2017 8:12 pm

nionita wrote:Thanks for the insight. Are you doing the network operations in int(64) or float?
I train floats, but I multiply by 10^3 (or 10^4) so the engine uses int32 internall for speed.
nionita wrote:There is a point with testing (or even training) only with quiet positions: when you stand pat in QS (beta cutoff) your position is not quiet. So this kind of positions should also be used for training (and testing). Not yet sure in which proportion it should be...
I actually tried to teach the evaluation function to understand violent positions but I wonly ended up with reimplementing QS inside the evaluation function in a very weird, buggy way. Here are the comments

Code: Select all

2555ae3 engine: Decrease futility margin.
c5799ca material: Don't include hanging pieces in unprotected.
0b7a699 material: Drop PawnThreat.
e17b22d material: Consider hanging if it's attacked by pawns.
acaca9d material: Do not consider king for hanging pieces.
f5baaa7 material: Retrain with 438K positions.
0934caf material: Skip second piece for the other side to move.
1f02e21 material: Ignore hanging for first piece of the side to move.
7b33dd2 material: Handle attacks on last rank passed pawn.
654b3b4 engine: Handle hanging pieces.
As you can see most of the commits where just to handle various threats that are easy for QS because of the minimax search. Even after these changes, there were still lots of positions on which QS did a lot better with a quiet eval. Overall handling violent positions made the evaluation function a lot slower for no Elo gain (the best I could get was about 60Elo weaker than the master branch). Said that, my training data is far from perfect, and if you have working suggestions please let me know.

nionita
Posts: 161
Joined: Fri Oct 22, 2010 7:47 pm
Location: Austria

Re: improved evaluation function

Post by nionita » Mon Mar 13, 2017 10:33 pm

brtzsnr wrote:Said that, my training data is far from perfect, and if you have working suggestions please let me know.
This was not meant as a critique of your data, just a remark.

When I collected a few million quiet positions from QS and tried SGD on them to improve the eval in Barbarossa I got exclusively ELO drops, and looking of what could be the reason for this I came to the insight that QS is deciding a lot in unquiet positions (in beta or delta cuts). So I tought this must be the reason, and collected ~11 million positions from all those situations (i.e., no capture, beta cut, delta cut, exactly in the frequence the appeared in they games). This still did not bring better evaluations. But the fact that QS decides in all those kind of positions remains. (I did not use the Texel method tough. Maybe this is the main reason for my insuccess.)

Post Reply