Troubles with Texel Tuning

AlvaroBegue · Post by **AlvaroBegue** » Fri Sep 29, 2017 8:09 pm

Daniel Shawul wrote:
AlvaroBegue wrote:
Daniel Shawul wrote:[
@Alavaro, it seems about 3% of the positions (about 30000 positions) in your database are <= mate_in_1. There are even some stalemate/mated positions with no moves to make. I think this should be removed since they have nothing to do with the material but just a lucky placement.
There should be no positions where the king is in check. I'm not sure if the others are a problem: The evaluation function does encounter positions like these, apparently. So I don't know on what grounds they should be removed.
Stalemates (a draw = 0.5) with inferior material count should be bad for the tuner. The first 10 are

1. No moves in Position 26472 Fen: 8/8/8/4bk2/5p1p/7P/1p3q2/7K w - -
2. No moves in Position 27385 Fen: 8/7k/5Q1N/8/5PP1/6K1/6P1/8 b - -
3. No moves in Position 51566 Fen: 8/8/5Q2/7k/8/6K1/8/8 b - -
4. No moves in Position 60333 Fen: 8/1p1b2pk/p6p/P6P/7K/8/6r1/8 w - -
5. No moves in Position 66519 Fen: 8/6p1/6k1/7p/5p1P/3p1P1K/4r3/8 w - -
6. No moves in Position 66578 Fen: 8/8/5q2/7K/8/5k2/8/8 w - -
7. No moves in Position 72318 Fen: 8/p5p1/1p1k4/3P4/7p/7P/r5PK/5q2 w - -
8. No moves in Position 73913 Fen: 8/2P5/P2R3p/7k/5Pp1/P5B1/3K3P/8 b - -
9. No moves in Position 76410 Fen: 6R1/8/8/8/4K3/8/8/6Bk b - -

[board]8/8/8/4bk2/5p1p/7P/1p3q2/7K w - -[/board]

Hmmm... I have to think about this a bit more, but I think you are right.

In principle I want the tuner to learn to separate positions by the result of the game, no matter in how many steps. But the nature of stalemate is probably such that under most circumstances the side with more material could have played slightly differently and avoided the stalemate. The exceptions are some well-known endgames (like KQ-vs-KP, with a rook pawn or a knight pawn on 7th rank) where the stalemate rule really makes the position a draw. But those should be covered by special cases or EGTBs.

jdart · Post by **jdart** » Sat Sep 30, 2017 12:41 am

You have a point but the MSE over all positions is still a smooth differentiable function. In fact my tuning code computes its gradient explicitly. With the closed-form gradient you know which direction in which to move and can do a good guess how far, so convergence is fast. Black-box optimizers generally have to do some exploration of the whole search space so they can build a model, and then start converging to the answer, updating the model along the way. Still, I am not saying that approach wouldn't work, it just seems not the best for this problem. I am getting convergence with a couple of hundred evaluations of the MSE at most and that is with 900 or so variables being tuned: most global methods would require a large multiple of N (the dimension) evals and may not work at all with 900 variables.

--Jon

Daniel Shawul · Post by **Daniel Shawul** » Sat Sep 30, 2017 8:10 pm

I am now getting reasonable piece values by starting from 0 in a few iterations. I used a local search method with a big and small step [100 and 10].

Code: Select all

QUEEN_MG 550
QUEEN_EG 610
ROOK_MG 330
ROOK_EG 410
BISHOP_MG 280
BISHOP_EG 240
KNIGHT_MG 250
KNIGHT_EG 200
PAWN_MG 60
PAWN_EG 80

The gradient methods, on the other hand, get stuck at 0 for some reason... I am sampling a fraction of the 1M positions in each call so maybe this stochastic nature is causing a problem for the optimizer. Also, i had to round off the numbers to use scipy's BFGS so that could be a problem too.

In principle I want the tuner to learn to separate positions by the result of the game, no matter in how many steps. But the nature of stalemate is probably such that under most circumstances the side with more material could have played slightly differently and avoided the stalemate.

The stalemates are bad but very few in the database. I am not sure if the mate_in_1s (3% of the database) should be removed as it seems that the material balance is Ok there -- probably because they were extracted from real games.

jdart · Post by **jdart** » Sat Sep 30, 2017 8:33 pm

I am sampling a fraction of the 1M positions in each call so maybe this stochastic nature is causing a problem for the optimizer.

I have read a recent article claiming that batching is not really effective, because you are taking more iterations to get convergence, even though each iteration is faster.

Not sure why you wouldn't move off zero, but you need to be sure you are correctly computing the gradient. In addition it is is helpful if you have floating point score values during the tuning process, otherwise rounding could be a problem.

--Jon

AlvaroBegue · Post by **AlvaroBegue** » Sun Oct 01, 2017 2:33 am

jdart wrote:
I am sampling a fraction of the 1M positions in each call so maybe this stochastic nature is causing a problem for the optimizer.
I have read a recent article claiming that batching is not really effective, because you are taking more iterations to get convergence, even though each iteration is faster.

I would be interested in that reference, but that advice goes against the generalized opinion of the ML community. Stochastic gradient descent with minibatches is the standard tool these days, and there are some very good reasons for this.

Not sure why you wouldn't move off zero, but you need to be sure you are correctly computing the gradient. In addition it is is helpful if you have floating point score values during the tuning process, otherwise rounding could be a problem.

You should probably use small random numbers to start from. In neural networks the details of the initialization of the weights can be very important, particularly if you have many hidden layers.

You should definitely use floating-point numbers during training. I use double-precision numbers in RuyTune, but that's probably overkill. I am sure 32-bit floats would do just fine.

Daniel Shawul · Post by **Daniel Shawul** » Sun Oct 01, 2017 3:51 pm

I am now getting good results after making BOTH changes.

Code: Select all

QUEEN_MG 1013.24755982
QUEEN_EG 1013.24755982
ROOK_MG 906.734146706
ROOK_EG 906.734146541
BISHOP_MG 612.106614937
BISHOP_EG 612.107242748
KNIGHT_MG 496.290143726
KNIGHT_EG 496.290143755
PAWN_MG 196.442975231
PAWN_EG 196.442972439

Note that the previous test with local search used actual eval of engine returning an int,however for this test i used the toy eval from ruy_tune with double precision scores. The thing that surprized me the most is doing the stochastic sampling also breaks BFGS. It looks like that the iterations start from very small delta of about 1e-6 so truncating piece values or making a stochastic sampling breaks it. What I do now is to still sample a minibatch but keep that batch the same for the next call with fixed random seed of srand(0).
Maybe if there is a way to subscribe a bigger starting delta for BFGS which would help with getting started the iterations, and might lead to no need of double precision or non-stochastic sampling.

AlvaroBegue · Post by **AlvaroBegue** » Sun Oct 01, 2017 4:18 pm

Daniel Shawul wrote:I am now getting good results after making BOTH changes.

Code: Select all

QUEEN_MG 1013.24755982
QUEEN_EG 1013.24755982
ROOK_MG 906.734146706
ROOK_EG 906.734146541
BISHOP_MG 612.106614937
BISHOP_EG 612.107242748
KNIGHT_MG 496.290143726
KNIGHT_EG 496.290143755
PAWN_MG 196.442975231
PAWN_EG 196.442972439

Those don't look like good results to me. Why should the values be nearly identical for midgame and endgame? I am pretty sure that is a mistake. EDIT: Also, queen and rook have values that are too close.

Note that the previous test with local search used actual eval of engine returning an int,however for this test i used the toy eval from ruy_tune with double precision scores. The thing that surprized me the most is doing the stochastic sampling also breaks BFGS. It looks like that the iterations start from very small delta of about 1e-6 so truncating piece values or making a stochastic sampling breaks it. What I do now is to still sample a minibatch but keep that batch the same for the next call with fixed random seed of srand(0).
Maybe if there is a way to subscribe a bigger starting delta for BFGS which would help with getting started the iterations, and might lead to no need of double precision or non-stochastic sampling.

Stochastic sampling introduces noise in the evaluation of the loss function. BFGS tries to construct an approximation to the Hessian (matrix of second derivatives) by looking at how the gradient changes. Adding noise is likely to seriously mess up that computation.

With stochastic samples you should use first-order learning algorithms, like pure SGD, or something with momentum, or a slightly more sophisticated one, like Adam.

Daniel Shawul · Post by **Daniel Shawul** » Sun Oct 01, 2017 5:59 pm

AlvaroBegue wrote:
Those don't look like good results to me. Why should the values be nearly identical for midgame and endgame? I am pretty sure that is a mistake. EDIT: Also, queen and rook have values that are too close.

The mid/end game disctiction is not present in the toy material eval i used for this test; the other one used my engines actual eval. As a side note, the MG and EG are perfectly positively correlated and the optimizer understood that. Algorithms like "Nelder-Mead" seems to have a problem with those though.

The queen-rook value problem is mostly likely because i am using 1% of the database to compute the MSE for the sake of testing. I would probably get close to your values if did it on the whole database.

Stochastic sampling introduces noise in the evaluation of the loss function. BFGS tries to construct an approximation to the Hessian (matrix of second derivatives) by looking at how the gradient changes. Adding noise is likely to seriously mess up that computation.

With stochastic samples you should use first-order learning algorithms, like pure SGD, or something with momentum, or a slightly more sophisticated one, like Adam.

Using a first order method such as the conjugate-gradient method seem to suffer from the same problem. I think there is something in the way mini batches are processed in SGD that i am not doing here...

Daniel

AlvaroBegue · Post by **AlvaroBegue** » Sun Oct 01, 2017 7:23 pm

Daniel Shawul wrote:Using a first order method such as the conjugate-gradient method seem to suffer from the same problem. I think there is something in the way mini batches are processed in SGD that i am not doing here...

I think of conjugate-gradient as a second order method. I am talking about simpler algorithms than that. Adam is very popular these days (perhaps because it's the default learning algorithm in TensorFlow), but even plain gradient descent works well.

Daniel Shawul · Post by **Daniel Shawul** » Sun Oct 01, 2017 7:25 pm

It turns out the issue with the queen/rook was that BFGS was not able to reach convergence which i think is just due to luck because the limited memory version does converge quickly and gives pretty good results even with a fraction of the database. I tested CG too which converges quickly, and L-BFGS-B with 1%, 10% and 100% of database.

Conjugate gradient (1% of database)

Code: Select all

QUEEN_MG 1142.76602383
ROOK_MG 655.257826481
BISHOP_MG 426.493574055
KNIGHT_MG 393.499340413
PAWN_MG 135.282928453

Engine&#40;1506876202&#41; <<< mse 0.01 0 0.00575646273249
Engine&#40;1506876202&#41; >>> 0.0641004662947481
Optimization terminated successfully.
         Current function value&#58; 0.064096
         Iterations&#58; 7
         Function evaluations&#58; 546
         Gradient evaluations&#58; 78

BFGS (1% of database) - did not converge before termination. MSE still at 0.0728 unlike the other tests

Code: Select all

QUEEN_MG 790.336356246
ROOK_MG 708.923819364
BISHOP_MG 487.571887953
KNIGHT_MG 397.340874602
PAWN_MG 134.400750389
Engine&#40;1506876435&#41; <<< mse 0.01 0 0.00575646273249
Engine&#40;1506876436&#41; >>> 0.0728747827548153
Warning&#58; Desired error not necessarily achieved due to precision loss.
         Current function value&#58; 0.072875
         Iterations&#58; 33
         Function evaluations&#58; 733
         Gradient evaluations&#58; 103

L-BFGS-B (1 %)

Code: Select all

QUEEN_MG 1058.86352822
ROOK_MG 584.991952205
BISHOP_MG 405.884050744
KNIGHT_MG 368.371260557
PAWN_MG 121.1108663
Engine&#40;1506876669&#41; <<< mse 0.01 0 0.00575646273249
Engine&#40;1506876669&#41; >>> 0.0645460295281579

L-BFGS-B (10 %)

Code: Select all

QUEEN_MG 1346.81983429
ROOK_MG 701.899008353
BISHOP_MG 452.197100014
KNIGHT_MG 420.921981631
PAWN_MG 135.015669148
Engine&#40;1506876851&#41; <<< mse 0.1 0 0.00575646273249
Engine&#40;1506876851&#41; >>> 0.0629618034765828

L-BFGS-B (100 %)

Code: Select all

QUEEN_MG 1306.61747006
ROOK_MG 689.059711567
BISHOP_MG 466.333842663
KNIGHT_MG 430.103190607
PAWN_MG 140.821092241
Engine&#40;1506877344&#41; <<< mse 1 0 0.00575646273249
Engine&#40;1506877349&#41; >>> 0.0632888443450795

I am using a scaling constant of "ln(10)/400" for the logistic. If I scale mine down to your values the L-BFGS-B (100%) result is very close to yours. You use tanh for the sigmoid so there will be some difference.

I will try to figure out the issue with the stochastic mse

Daniel

Troubles with Texel Tuning

Re: Troubles with Texel Tuning

Re: Troubles with Texel Tuning

Re: Troubles with Texel Tuning

Re: Troubles with Texel Tuning

Re: Troubles with Texel Tuning

Re: Troubles with Texel Tuning

Re: Troubles with Texel Tuning

Re: Troubles with Texel Tuning

Re: Troubles with Texel Tuning

Re: Troubles with Texel Tuning