How Do You Automatically Tune Your Evaluation Tables

Jan Brouwer · Post by **Jan Brouwer** » Fri Mar 07, 2014 4:33 pm

AlvaroBegue wrote:
tpetzke wrote:In my tests the values derived are often far away from intuitive values. I also repeated tuning runs that derived then very different parameter sets with almost equal strength.

So strange looking values are not so bad, however they might give you a hint about important and lesser important (maybe even redundant) parameters.
There are regularization methods to handle this type of situation. For instance, you could replace your error function E with

E' = E + some_constant * L1(parameters)

Here L1(parameters) means the sum of the absolute values of all the parameters in your evaluation function. (For this to make sense, you need to somehow normalize things in advance; so if your parameters are simply coefficients in front of features, you should normalize your features so they have mean 0 and variance 1, or something like that.)

These L1-regularization methods seem to be very much in fashion these days. They have some very neat properties. If you have two features that are nearly identical, so only the sum of its coefficients is fixed, a little bit of L1 regularization will make the coefficients be about equal. If you have a feature that is purely noise, it might very well get a coefficient of exactly 0, which is something that wouldn't normally happen with least-squares regression. You might be able to identify parts of your code that don't even need to run!

Has anyone tried anything like that for tuning evaluation functions in chess?

Thanks for mentioning this.
I was unaware of regularization and it does indeed seem appropriate to the tuning of evaluation features or even selection of evaluation features.

tpetzke · Post by **tpetzke** » Mon Mar 10, 2014 8:33 am

Hi,

I tried it with I and used a genetic algorithm to optimize the evaluation error. I documented my results here

http://macechess.blogspot.de/2014/03/th ... ng_10.html

Thomas...

AlvaroBegue · Post by **AlvaroBegue** » Mon Mar 10, 2014 12:47 pm

tpetzke wrote:Hi,

I tried it with I and used a genetic algorithm to optimize the evaluation error. I documented my results here

http://macechess.blogspot.de/2014/03/th ... ng_10.html

Thomas...

Your results seem a bit suspicious to me, but perhaps I just don't understand enough detail of what you did. In particular, why are your weights between 0 and 1?

tpetzke · Post by **tpetzke** » Mon Mar 10, 2014 6:23 pm

Hi,

not the weights are between 0 and 1, but the weights consist of a bit sequence of 0 and 1s.

I'm using PBIL (Population Based Incremental Learning) as optimizer. The whole solution space is transformed into a bit vector. Each bit in the vector carries a probability to be a "1" in the solution. At the beginning all probabilities are initialized to 50%.

In the course of the evolution the bits are flipped to a 0 or 1 according to the probabilities for each individual to produce the generation. Then the fitness of all individuals is calculated. The fittest individual in a generation modifies the probabilities of the vector by a small amount (the learning rate) towards its own.

In the course of the evolution the bits converge from their initial 50% towards either 0% or 100%, some fast some slower and in my former tests the amounts of 0 and 1s was finally about even, here the 0s dominated. And if all bits in a weight are 0 then the weight is 0 (means it is eliminated). This happened here a lot.

So I think betting on a DRAW might be a good strategy to minimize E, if the heavy terms (material, mobility, passed pawns) don't decide the score.

Thomas..

tpetzke · Post by **tpetzke** » Tue Mar 11, 2014 2:00 pm

I had a 2nd read of the Junior paper referenced by you above and there is the statement

Code: Select all

For example, any probability distribution that attaches equal probability to a win and a loss has expectation of 1/2. Consequently, the position
evaluation alone cannot provide a result probability distribution.

I think this is exactly what I experience, the tuner encourages the evaluation to output drawish scores. A 2nd run with a different set leads to a similar result. If my training set would consist of only drawn games, it would probably kill all my evaluation terms. I have 40% drawn games in my set, this is probably enough to set a trend into that direction.

Thomas...

AlvaroBegue · Post by **AlvaroBegue** » Tue Mar 11, 2014 7:20 pm

tpetzke wrote:I had a 2nd read of the Junior paper referenced by you above and there is the statement
Code: Select all
For example, any probability distribution that attaches equal probability to a win and a loss has expectation of 1/2. Consequently, the position
evaluation alone cannot provide a result probability distribution.

I didn't quite understand why that is relevant when I first read it, and I still can't. The position evaluation's job is to estimate the expected value of the distribution, not the distribution itself. There is no need to distinguish between a draw and a 50% chance of a win and a 50% chance of a loss.

I think this is exactly what I experience, the tuner encourages the evaluation to output drawish scores. A 2nd run with a different set leads to a similar result. If my training set would consist of only drawn games, it would probably kill all my evaluation terms. I have 40% drawn games in my set, this is probably enough to set a trend into that direction.

I suspect what you are seeing is a result of the binary nature of your optimization. If you were to do a more traditional non-linear regression where you compute a linear combination of features and the pass the result through a sigmoid function, the presence of more draws would simply make the coefficients [appropriately] smaller, but they wouldn't end up at zero.

jdart · Post by **jdart** » Wed Mar 12, 2014 2:23 am

I have been doing some research in this area.

Another interesting algorithm is Particle Swarm Optimization (see http://en.wikipedia.org/wiki/Particle_s ... timization). It apparently copes well with noisy functions (chess evaluation is noisy because there is only ever a probability of getting the right result by evaluation).

I also mentioned Support Vector Machines (SVM) earlier. SVM takes a training set of parameter weights and associated outcomes and computes the set of values that minimize error over the training set.

While very simple implementations of this take a lot of runtime, modern versions can easily handle hundreds of parameters and millions of observations. One I have tried is Budgeted SVM http://www.dabi.temple.edu/budgetedsvm/ which is very fast and the code seems solid.

The problem is, a simple approach assumes that relationships are linear, but this is often not true for chess. However running a linear model can be a first step and give you an idea of what is important and what is not.

You can use a nonlinear kernel to model relationships but then the weights that are output are hard to interpret.

--Jon

tpetzke · Post by **tpetzke** » Wed Mar 12, 2014 9:03 am

Hi,

if you have a set with only draws obviously the evaluation that produces the lowest evaluation error (global optimum) consists of a simple "return 0" statement.

The PBIL optimization will eventually dig that out because it is has a chance to also leave local optima again and find the global one.

I don't know what you mean with binary nature of my weights. I guess all chess engines that run on a classical computer will represent their weights in some kind of binary form. My knight material weight is a integer consisting of 8 bits.

Code: Select all

    |                                           |    WEIGHT    |
 ID | WEIGHT NAME                               |   MG     EG  |   MAX
----+-------------------------------------------+--------------+-------
  0 | PAWN_MATERIAL                             |   100    100 |    127
  1 | KING_MATERIAL                             |   INF    INF | 32.767
----+-------------------------------------------+--------------+-------
  2 | KNIGHT_MATERIAL                   MIN 300 |   460    386 |    555
  3 | BISHOP_MATERIAL                   MIN 300 |   477    380 |    555
  4 | ROOK_MATERIAL                     MIN 500 |   688    599 |    755
  5 | QUEEN_MATERIAL                    MIN 900 | 1.400  1.176 |  1.411
  6 | BISHOP_PAIR                       MIN  40 |    70     63 |     71

I don't think this is much different to what other engines do.

Just in the optimization if a population is produced bits in the weights are flipped at random to produce a set of possible solutions that are tested. The weights are encoded using a GRAY code so that neighboring values always are only 1 bit apart. It is as smooth as it can get.

And actually the optimizer works the way it is designed. It finds a parameter set with a much lower evaluation error than then the original engine. Unfortunately this does not cause the engine with that set playing better. This is not the problem of the optimizer. I just let it solve a problem I did not have. I don't really care about the evaluation error, I care about a strong engine. And a lower evaluation error does not necessarily mean a stronger engine. To bad. It would have been nice.

Thomas...

AlvaroBegue · Post by **AlvaroBegue** » Wed Mar 12, 2014 10:30 am

tpetzke wrote:Just in the optimization if a population is produced bits in the weights are flipped at random to produce a set of possible solutions that are tested. The weights are encoded using a GRAY code so that neighboring values always are only 1 bit apart. It is as smooth as it can get.

Oh, I was wondering about how flipping from 0111 to 1000 was handled. This is a very elegant solution.

And actually the optimizer works the way it is designed. It finds a parameter set with a much lower evaluation error than then the original engine. Unfortunately this does not cause the engine with that set playing better. This is not the problem of the optimizer. I just let it solve a problem I did not have. I don't really care about the evaluation error, I care about a strong engine. And a lower evaluation error does not necessarily mean a stronger engine. To bad. It would have been nice.

It is certainly possible to get a lower evaluation error and a weaker engine, but I expect that to be a bit pathological.

Perhaps you don't want to spend any more time on this, but another possibility is that you are lowering the evaluation error for a particular set of positions, but in a way that doesn't reduce it for general positions (overfitting). You could set aside a test set of positions that you use only to compute an independent estimate of fitness of the evaluation function, and you can plot that as a function of the generation number. If that also looks like a nice steady improvement, perhaps you are right and this idea won't help your engine.

Steve Maughan · Post by **Steve Maughan** » Wed Mar 12, 2014 11:33 am

Hi Thomas,

I see you're fixing the value of a pawn at 100 centipawns in the middlegame and endgame. While I don't think it's a big issues, I think the value of a pawn can vary since you're actually optimizing on the objective scale of "probability of a win / draw / lose".

Robert Houdart has said Houndini's evaluation is optimized such that 100 centipawns has a 90% probability of a win. This seams reasonable and this fixes the constant in the logistic function.

In contrast, when you optimize based on game play you do need and objective value to anchor the other values.

Steve

How Do You Automatically Tune Your Evaluation Tables

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm