Page 1 of 6

Ab-initio evaluation tuning

Posted: Wed Aug 30, 2017 2:18 pm
by Evert
I've been writing up a new evaluation function from scratch, and one of the things I've been looking into is tuning it from the beginning. Right now it only has material evaluation (the material evaluation is a general quadratic function to handle material imbalances), and I'm trying to tune the piece values. The results confuse me a bit, however.

Some details:
1. By "tune" I mean that I filter the evaluation function through a logistic that maps the evaluation to a game result (0...1, with 0=black wins, 1=white wins, 0.5=draw). The idea is pretty standard. I use f(x) = 1/(1+exp(-k*x)), with x the evaluation in pawn units and k a scaling constant (which happens to be 1 according to a monte-carlo best parameter estimate).
2. I use the test position set by Alexandru Mosoi (http://talkchess.com/forum/viewtopic.php?p=686204).
3. To fit the data, I use stochastic gradient descent with 1000 positions in each estimate for the gradient. I haven't tried anything more fancy, but I did first try the Simplex algorithm from GSL as an alternative. This works ok too.
4. In the evaluation, I have fixed the value of a pawn in the end game (VALUE_P_EG) at 256. This fixes the scale for the evaluation, which is otherwise arbitrary.
5. During tuning, evaluation parameters are treated as double-precision floating point numbers.
6. I started with 11 parameters to tune: piece values for N, B, R, Q in MG and EG, pawn value in MG and bishop pair bonus for MG and EG (this is in fact just the quadratic term in the material evaluation).

The initial parameters are

Code: Select all

   MG    EG
P  0.80  1.00
N  3.25  3.50
B  3.25  3.50
R  4.50  5.50
Q  9.00  9.75
BB 0.00  0.00
After tuning, I get

Code: Select all

   MG    EG
P  0.96  1.00
N  3.46  2.39
B  3.37  2.54
R  4.40  4.70
Q  8.79  9.29
BB 0.17  0.21
What I find particularly odd is the lower end-game value of the minor pieces. Without playing any games (which is a while away yet), they look wrong to me. With these values, an engine that is a minor ahead would adopt a trade-avoiding strategy (because his minor would get devalued) unless it can get an extra pawn in the bargain. Conversely, an engine that is a minor behind will gladly exchange material.

This makes me wonder if there's something I'm overlooking in how I've implemented my optimiser. What I'm wondering is whether anyone else has done a similar experiment with similar results? In particular, I would like to know if anyone has tried to match Alexandru Mosoi's dataset with just material and arrived at "correct" piece values that represent the data.

Re: Ab-initio evaluation tuning

Posted: Wed Aug 30, 2017 3:37 pm
by Gerd Isenberg
Isn't it necessary to introduce at least a disjoint feature for most volatile advaned passers?

Re: Ab-initio evaluation tuning

Posted: Wed Aug 30, 2017 4:42 pm
by hgm
Well, 1000 positions does not sound like very much. How many of those would have been end-game positions where one of the sides was a minor ahead? And how well were these finally fitted?

Re: Ab-initio evaluation tuning

Posted: Wed Aug 30, 2017 4:53 pm
by D Sceviour
Evert wrote:What I find particularly odd is the lower end-game value of the minor pieces. Without playing any games (which is a while away yet), they look wrong to me. With these values, an engine that is a minor ahead would adopt a trade-avoiding strategy (because his minor would get devalued) unless it can get an extra pawn in the bargain. Conversely, an engine that is a minor behind will gladly exchange material.

This makes me wonder if there's something I'm overlooking in how I've implemented my optimiser. What I'm wondering is whether anyone else has done a similar experiment with similar results? In particular, I would like to know if anyone has tried to match Alexandru Mosoi's dataset with just material and arrived at "correct" piece values that represent the data.
Examine the PSQT effect on material values. If PSQT delta for knight averages 1.07 between MG and EG, then the difference might be ( 3.46 - 2.39 ). That is the type of difference I found. This can be cross-tested by turning PSQT off and comparing the values.

Re: Ab-initio evaluation tuning

Posted: Wed Aug 30, 2017 5:39 pm
by jdart
1000 positions in each estimate
So I guess you are using batching? SGD generally makes most sense when you have noisy observations. And batching is not actually more performant, according to recent research.

Anyway, regardless of the optimization method, I think you need to code for special-case low-material endgames, such as Rook vs minor, which is generally drawn, as well as known draw positions such as King vs two Knights. Don't tune piece values for these endgames because the material balance does not predict the result well.

Consider also coding and tuning an explicit trade down bonus for those endgames that are not special-cased.

--Jon

Re: Ab-initio evaluation tuning

Posted: Wed Aug 30, 2017 5:49 pm
by Evert
Gerd Isenberg wrote:Isn't it necessary to introduce at least a disjoint feature for most volatile advaned passers?
Good question. I don't know. Worth investigating, but I'd like to keep the number of evaluation features down while I figure out if this works.

Re: Ab-initio evaluation tuning

Posted: Wed Aug 30, 2017 5:56 pm
by Evert
hgm wrote:Well, 1000 positions does not sound like very much. How many of those would have been end-game positions where one of the sides was a minor ahead? And how well were these finally fitted?
The 1000 positions are drawn at random from a pool of 725000 positions (which again doesn't sound like a lot) so it differs from iteration to iteration what positions those are.
Testing for convergence is actually something I find remarkably tricky as well, because for all values the gradient is very shallow and the error estimate hardly changes (which is at least consistent). There is of course the constant in the logistic, but I don't think that should matter.

Re: Ab-initio evaluation tuning

Posted: Wed Aug 30, 2017 5:57 pm
by Evert
D Sceviour wrote: Examine the PSQT effect on material values. If PSQT delta for knight averages 1.07 between MG and EG, then the difference might be ( 3.46 - 2.39 ). That is the type of difference I found. This can be cross-tested by turning PSQT off and comparing the values.
This is just piece values, so without PST. I normally centre those anyway (so they average to 0 over the entire board).

Re: Ab-initio evaluation tuning

Posted: Wed Aug 30, 2017 6:03 pm
by Evert
jdart wrote: So I guess you are using batching? SGD generally makes most sense when you have noisy observations. And batching is not actually more performant, according to recent research.
Really? The page I read suggested otherwise, but it makes some sense sense that it wouldn't matter much in the end if you average the gradients first, or if you apply them one after the other.
Anyway, I did notice that the values become very unstable if I don't batch the positions at each iteration, but perhaps I shouldn't look at what happens from one iteration to the next, and only look at what happens in the end.
Anyway, regardless of the optimization method, I think you need to code for special-case low-material endgames, such as Rook vs minor, which is generally drawn, as well as known draw positions such as King vs two Knights. Don't tune piece values for these endgames because the material balance does not predict the result well.
Indeed, I haven't done that yet. Such positions shouldn't be in the set though, but I'll double check that.

Re: Ab-initio evaluation tuning

Posted: Wed Aug 30, 2017 7:54 pm
by AlvaroBegue
Evert wrote:
jdart wrote: Anyway, regardless of the optimization method, I think you need to code for special-case low-material endgames, such as Rook vs minor, which is generally drawn, as well as known draw positions such as King vs two Knights. Don't tune piece values for these endgames because the material balance does not predict the result well.
Indeed, I haven't done that yet. Such positions shouldn't be in the set though, but I'll double check that.
I would discard any positions with 6 men or fewer, since we can get those from EGTBs. That should take care of most of the special cases.

I would also be curious if you get the same results using my collection of positions: https://bitbucket.org/alonamaloh/ruy_tu ... th_results

These were collected from calls to the evaluation function in RuyDos. I labelled them with a result by playing a very quick game of Stockfish against itself. I then replaced each position with the position that the quiescence search score came from.