Ab-initio evaluation tuning

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
Evert
Posts: 2929
Joined: Sat Jan 22, 2011 12:42 am
Location: NL

Re: Ab-initio evaluation tuning

Post by Evert »

Evert wrote: From what I remember, simulated annealing performs poorly when tuning chess evaluation functions. I'll try to find where I read that.
I think it was this post by Rémy Coulom: http://www.talkchess.com/forum/viewtopi ... 611#400611
Henk
Posts: 7216
Joined: Mon May 27, 2013 10:31 am

Re: Ab-initio evaluation tuning

Post by Henk »

Simulated annealing is usually terribly slow. Maybe finding a (global) maximum is bad too for evaluation must generalize well when it encounters unknown positions. So strangely bad tuning might work better.

Tuning must be fast. Maybe that's only requirement and it must find a solution far beyond average but not too much near the top for otherwise it won't generalize.
User avatar
Evert
Posts: 2929
Joined: Sat Jan 22, 2011 12:42 am
Location: NL

Re: Ab-initio evaluation tuning

Post by Evert »

Gerd Isenberg wrote:Isn't it necessary to introduce at least a disjoint feature for most volatile advaned passers?
The answer, it turns out, is yes.
I added a fairly simple term: passers get a bonus dependent on their rank that increases quadratically and saturates at ~200 cp on the 7th rank (so the total value of the pawn is ~ a minor).
With this term added, I get

Code: Select all

    MG     EG
P   0.72   0.94 
N   2.81   3.43 
B   2.85   3.44 
R   3.87   5.94 
Q   9.42  10.22
BB  0.13   0.24
NN  0.17  -0.12
RR -0.03  -0.16
The next big mystery, for me, is the low value of the Rook in MG positions. I'm guessing it's low because N and B should get a nice boost from mobility, so the base score of the Rook can be higher compared to the base score of the minors, but I don't know that for sure.

First up is improving the tuning algorithm though.
User avatar
hgm
Posts: 27790
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Ab-initio evaluation tuning

Post by hgm »

Perhaps an open-file bonus would help. I noticed from imbalance testing that orthogonal sliders of all ranges tend to test 25cP below the value you would expect, as opening value. That makes a Wazir hardly better than a Pawn. If you start the Wazir in front of the Pawn chain it is about 130 cP, though.
User avatar
Evert
Posts: 2929
Joined: Sat Jan 22, 2011 12:42 am
Location: NL

Re: Ab-initio evaluation tuning

Post by Evert »

hgm wrote:Perhaps an open-file bonus would help. I noticed from imbalance testing that orthogonal sliders of all ranges tend to test 25cP below the value you would expect, as opening value. That makes a Wazir hardly better than a Pawn. If you start the Wazir in front of the Pawn chain it is about 130 cP, though.
I thought about an open-file bonus, but I think it would go the wrong way. What's needed is an increase of the base value of the Rook relative to the minor pieces in middle-game positions. Giving a situational bonus to the Rook isn't going to do that (if anything, it will decrease the base value), but giving a situational bonus to the minors should. Of course at the end of the day an evaluation function would have all these things.
User avatar
Evert
Posts: 2929
Joined: Sat Jan 22, 2011 12:42 am
Location: NL

Re: Ab-initio evaluation tuning

Post by Evert »

Evert wrote: First up is improving the tuning algorithm though.
I used GSL again to build up the fitting code, but rather than using the minimisation family of functions (which are horrendously slow for this), I used the non-linear least-square fitting (which seem to be much faster, depending on the algorithm). The Levenberg-Marquardt algorithm proofs to be too slow, but what it calls the "Double dogleg" and "Two Dimensional Subspace" methods seem to perform well. I picked the latter by default.
The downside of using the GSL routines is that they don't work with stochastic gradient descent. I tried fitting on a small batch and then again on another batch. Results were poor. What might work is first fitting a subset of the positions and then increasing the number of positions that are to be fitted once the result converges, but I haven't tried that yet. For now, I just feed the full set of positions to the tuner, which is still reasonably fast.
It can use two threads to calculate the evaluation for all positions, which gives a nice speedup (t1c/t2c = 1.72). I could extend that so it can use more threads on my desktop, but I'm not sure it's all that useful.

The tuner now spits out

Code: Select all

     0   0.76  3.11  3.05  3.78  9.18  0.93  3.12  3.25  5.90  10.28  0.10  0.22  0.06 -0.02  0.14 -0.16  0.75  3.26 1e+10 0.0598891
     1   0.73  3.08  3.02  3.72  8.78  0.93  3.18  3.32  5.98  10.63  0.13  0.22  0.06 -0.02  0.17 -0.20  0.65  3.33 100.432 0.0598639
     2   0.73  3.08  3.02  3.72  8.74  0.93  3.18  3.32  5.97  10.65  0.13  0.22  0.06 -0.02  0.17 -0.20  0.64  3.33 99.472 0.0598638
     3   0.73  3.08  3.02  3.72  8.74  0.93  3.18  3.32  5.97  10.65  0.13  0.22  0.06 -0.02  0.17 -0.20  0.64  3.33 99.342 0.0598634
     4   0.73  3.08  3.02  3.72  8.74  0.93  3.18  3.32  5.97  10.65  0.13  0.22  0.06 -0.02  0.17 -0.20  0.64  3.33 99.342 0.0598634
status = success
summary from method 'trust-region/2D-subspace'
number of iterations: 4
function evaluations: 89
Jacobian evaluations: 0
reason for stopping: small step size
initial |f(x)| = 182.124960
final   |f(x)| = 182.087293
chisq/dof = 0.0598654
VALUEL_P_MG          =   0.73 +/-   0.02 [186]
VALUEL_N_MG          =   3.08 +/-   0.10 [788]
VALUEL_B_MG          =   3.02 +/-   0.10 [774]
VALUEL_R_MG          =   3.72 +/-   0.21 [953]
VALUEL_Q_MG          =   8.74 +/-   0.46 [2238]
VALUEL_P_EG          =   0.93 +/-   0.02 [238]
VALUEL_N_EG          =   3.18 +/-   0.07 [814]
VALUEL_B_EG          =   3.32 +/-   0.07 [849]
VALUEL_R_EG          =   5.97 +/-   0.11 [1529]
VALUEL_Q_EG          =  10.65 +/-   0.32 [2726]
VALUEQ_BB_MG         =   0.13 +/-   0.04 [33]
VALUEQ_BB_EG         =   0.22 +/-   0.04 [57]
VALUEQ_NN_MG         =   0.06 +/-   0.03 [16]
VALUEQ_NN_EG         =  -0.02 +/-   0.04 [-6]
VALUEQ_RR_MG         =   0.17 +/-   0.10 [43]
VALUEQ_RR_EG         =  -0.20 +/-   0.06 [-51]
VALUEQ_PASS_MG       =   0.64 +/-   0.20 [164]
VALUEQ_PASS_EG       =   3.33 +/-   0.11 [852]
{  0.725672,  3.077,  3.02305,  3.72302,  8.74231,  0.930309,  3.17822,  3.31578,  5.9733,  10.6504,  0.13009,  0.222306,  0.0640046, -0.0233892,  0.166455, -0.200781,  0.639021,  3.32864,}
I like that GSL lets you calculate the formal error bars on the different parameters and reports the chi-squared/dof. I'm not sure that these are actually useful though (in particular the chi-squared, but that's because it's never going to report a good fit).

Now to improve that rook value.

By the way, if there's interest, I'm more than happy to share the code (after cleaning it up a bit).
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: Ab-initio evaluation tuning

Post by Michel »

Evert wrote: Now to improve that rook value.
How did you select your positions? From normal games?

Are you optimizing an evaluation function with only material terms?

In normal games a player would not give up an exchange without proper compensation. So if you only optimize for material values the value of a rook may indeed appear to be close to that of a minor, but that would be an artifact of the bias in the selection of the positions and the fact that there are no compensating terms in the eval for measuring the compensation.
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.
User avatar
Evert
Posts: 2929
Joined: Sat Jan 22, 2011 12:42 am
Location: NL

Re: Ab-initio evaluation tuning

Post by Evert »

Michel wrote:
Evert wrote: Now to improve that rook value.
How did you select your positions? From normal games?
They're positions taken from those sampled during gameplay. Positions that were not "quiet" were removed from the set, the result comes from a playout by Stockfish. See http://talkchess.com/forum/viewtopic.php?p=686204 for a description of the test positions.
I filtered these further by only using positions with unbalanced material (otherwise material doesn't matter anyway).
Are you optimizing an evaluation function with only material terms?
For the moment. I'm in the stage of adding other positional terms, now that I'm reasonably satisfied that the code is working correctly.
In normal games a player would not give up an exchange without proper compensation. So if you only optimize for material values the value of a rook may indeed appear to be close to that of a minor, but that would be an artifact of the bias in the selection of the positions and the fact that there are no compensating terms in the eval for measuring the compensation.
Indeed. I did a quick trial by adding a simple mobility term for B and N, and this has the desired effect of increasing the value of the Rook compared to the minors. Unfortunately it does this by reducing the value of the minors to to ~200cp or so, well below their EG value (which is problematic).
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: Ab-initio evaluation tuning

Post by Michel »

Evert wrote:Positions that were not "quiet" were removed from the set, the result comes from a playout by Stockfish. See http://talkchess.com/forum/viewtopic.php?p=686204 for a description of the test positions.
I wonder if these positions were sampled during game play or during search. If they were sampled during game play then they would be biased as the material balance is not independent of other positional factors. This would lead to problems for tuning a "material only" evaluation (possibly not for tuning a "well rounded" evaluation).

I think it is more reasonable to first do a few random moves on sample positions before recording them for a playout.
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.
User avatar
Evert
Posts: 2929
Joined: Sat Jan 22, 2011 12:42 am
Location: NL

Re: Ab-initio evaluation tuning

Post by Evert »

Michel wrote:
Evert wrote:Positions that were not "quiet" were removed from the set, the result comes from a playout by Stockfish. See http://talkchess.com/forum/viewtopic.php?p=686204 for a description of the test positions.
I wonder if these positions were sampled during game play or during search. If they were sampled during game play then they would be biased as the material balance is not independent of other positional factors. This would lead to problems for tuning a "material only" evaluation (possibly not for tuning a "well rounded" evaluation).
To my understanding, they were sampled during search:
Alexandru Mosoi wrote:From each game 20 positions were sampled from the millions of positions evaluated by the engine during the game play.
This seems fair.

Adding mobility to the search and trying to tune that without tuning the rest of the evaluation terms doesn't go so well, but I guess it might be that my treatment is too simple (for one thing, it is not centred, so the base value is the value of a piece with no moves. That's probably wrong). Anyway, the condition number for the Jacobian jumps up and the program aborts after a few iterations without assigning weights to the mobility. I guess I'll do a proper job of it and try again. At some point I should try the evaluation function out in actual gameplay too...