Ab-initio evaluation tuning

hgm · Post by **hgm** » Thu Aug 31, 2017 2:41 pm

Evert wrote:
AlvaroBegue wrote: This should not bother you at all. A probability is a continuous quantity that expresses our expectation of a binary result, and we use logistic regression to fit probabilities to discrete outcomes all the time. If you have time to play more games, you should probably play them from new positions so you get a more complete sample of situations on the board, instead of playing them from the same positions.
Maybe.
But how certain are you that the outcome is correct? If the result is 1-0, but a match over 10 games would result in 7-3, then we would be better off using 0.7 rather than 1.0 for that position. Worse, what if the result of a 10-game match is 1-9? Right now we treat positions that are won because you are a minor ahead as though they should give the same score as positions that are won because you are a queen ahead. They aren't, of course. Allowing for a difference there should reduce the noise in the fit. Of course adding more positions can also do that, but each individual position is then less important. On the other hand, reducing noise is not a goal per se.

The interesting positions here are not those where you are a piece ahead (or behind), of course, but those closer to the draw score, around the inflection point of the logistic function. I might try to make an estimate for how much time it would take to get a better estimate for some of those.

By playing the same position multiple times you are just creating more data points. The only difference is that you are creating multiple points for the same position, instead of for different (but similar) positions.

If a certain huge data set would be fitted by a certain function, a random selection of 10% of the data points should still be fitted by the same function, except with somewhat larger error bars for the fitted parameters. Having only a single result for each position can be seen as making a random selection of an N times larger set where each position was played N times.

So playing positions multiple times isn't particularly better than any other way for creating more data points. But a difference is that you would actually have to generate the games, while more data points from other positions can be obtained almost for free, by selecting from more games that were played anyway.

asanjuan · Post by **asanjuan** » Thu Aug 31, 2017 2:51 pm

The only drawback then is that you need to determine k again for the next run.

No.
Continue with the same K. There's no need to determine K. K is introduced to define a scale (Are you using centipawns, milipawns... quarter of a pawn?).
If you are simply starting a new evaluation function, forget K. Set K=1 and use it forever.
If you need (for example) to switch to milipawns, use K=0.1 and run the process. But once K is established, there's no need to recompute K.

Evert · Post by **Evert** » Thu Aug 31, 2017 5:04 pm

asanjuan wrote:
The only drawback then is that you need to determine k again for the next run.
No.
Continue with the same K. There's no need to determine K. K is introduced to define a scale (Are you using centipawns, milipawns... quarter of a pawn?).
If you are simply starting a new evaluation function, forget K. Set K=1 and use it forever.
If you need (for example) to switch to milipawns, use K=0.1 and run the process. But once K is established, there's no need to recompute K.

Depends. If the evaluation scale changes by a factor alpha, then k should be scaled to k/alpha to keep the result invariant. If you fix k, then the evaluation scale is implicitly fixed. If instead you fix the value of a pawn (say), then you should allow k to be varied. Suppose that the tuner want to decrease the value of the pawn. Because it's fixed what it needs to do instead is increase all other terms relative to the pawn. That changes the overall scale of the evaluation.
That was my point.

zenpawn · Post by **zenpawn** » Thu Aug 31, 2017 11:43 pm

Evert wrote: That's it in a nutshell. I hope that wasn't too technical. In principle you can carry the expansion further to cubic terms, which would let you include Q vs BBN and Q vs BNN. I think the only other important terms there are RR vs BBN, RR vs BNN and R vs BN (R vs BB and R vs NN are quadratic terms), but perhaps there are pawn terms that are important too, I don't know. I don't think anyone has tried to do this, but I might throw in some cubic terms once I get the quadratic stuff working as intended.

Very interesting. So, you create a mathematical function with a good number of the various piece combinations and let the tuning assign all the multipliers. Thanks for the explanation.

BTW, what makes Q vs BNN a cubic term? Is it just that at least one side of the comparison has three pieces?

Evert · Post by **Evert** » Fri Sep 01, 2017 12:25 am

zenpawn wrote:BTW, what makes Q vs BNN a cubic term? Is it just that at least one side of the comparison has three pieces?

There are three piece types involved.
If you put the number of all pieces in a vector, a quadratic term gives you all possible combinations of all piece types (all terms have the form Mij*Vi*Vj for indices i and j). A cubic term multiplies that again by the same vector, so all terms have the form Mijk*Vi*Vj*Vk.

Evert · Post by **Evert** » Fri Sep 01, 2017 4:46 pm

As a separate test to see if things work, I started with all material values set to 0, and decided to tune MG and EG values for all pieces, and pair bonuses for B, N and R (any Q-pair bonus is probably unmeasurable without explicitly selecting positions that have more than one Queen).

The method I use is still straight-up gradient descent, with one tweak: I throttle the step size as the number of iterations increases, so the algorithm can do a better job at zooming in on the actual optimum. Unfortunately, the only convergence criterion I have that seems to work so far is to eye-ball the results, which means I had to restart the run from the last result a couple of times (until I got bored with it, basically).

The first figure shows the residual of the fit:

As is evident from the image, convergence is slow. There is probably a lot to win here by being more clever about choosing the step size, or scaling the step-size for different variables. It's hard to tell, but it looks like the residual is still decreasing at the end.

The norm of the gradient itself is no help at all when it comes to deciding on whether things have converged or not, it just looks like random noise (which it isn't, or there would be no progress at all):

On to the interesting bits. The pawn value seems to converge well:

The EG value is probably inflated by passed pawns in the dataset, which have no corresponding evaluation term. What's funny is that the tuner seems to bounce around in a circle, as can be seen from the quasi-periodic pattern in the value of the pawn. This is not good.

The value of the Knight is also interesting:

The EG value seems to converge a bit faster than the MG value, which is not odd given that the positions in the tuning set are somewhat biased towards later game stages. Of note is that the EG value rises more quickly than the MG value, but MG value eventually becomes bigger.

The value of the Bishop,

does something similar: the EG value converges quickly, the MG value has not converged yet at the end of the run. Observe again the oscillations, particularly towards the end of the run. The same oscillation shows up in the EG and MG values, but in the opposite direction.

For the Rook,

the EG value again converges quickly. The MG value has not converged at all yet, and is pathologically small.

The Queen value looks the most smooth overall (which is in part due to the scale being larger and hiding the fluctuations):

The MG value is blatantly too low at the moment, and I expect the EG value will come down a bit if the run were left to continue.

The N-pair bonus is tiny, slightly positive in the MG and slightly negative in the EG.

The R-pair is more significant, but it looks like this will converge to a small number in the end. It seems to go down as the rook value increases during the iteration process.

The B-pair is noisy, but looks reasonable:

Conclusions:
My code seems to work and lead to correct results (eventually). I need a better choice for the step size, and/or a better algorithm than plain gradient descent.

Any other suggestions/comments/feedback?

Henk · Post by **Henk** » Fri Sep 01, 2017 6:03 pm

Use gradient descent with restart. Gradient descent only finds a local minimum. So to get a better result restart it many times with different initial value.

[Actually I only trust simulated annealing for getting a global maximum/minimum]

There are so many optimizing algorithms.

hgm · Post by **hgm** » Fri Sep 01, 2017 7:40 pm

It seems you are suffering from the 'river-valley problem'. The method becomes unstable (and just before that starts to oscillate) when you try to make steps that are larger than the distance to the closest minimum along a line in the direction of the gradient. So your maximum acceptable step size is determined by the width of the valley, making progress towards the sea exceedingly slow.

What I often used for optimizing functions like that is just make small steps in random directions, but add the moving average (e.g. exponentially decaying) of the latest accepted steps. This learns in which direction it should step, and will keep increasing the step size as long as that direction works. At all times keep statistics on recent acceptance rate, and if it drops below a certain value, half the moving average.

Evert · Post by **Evert** » Fri Sep 01, 2017 8:06 pm

Henk wrote:Use gradient descent with restart. Gradient descent only finds a local minimum. So to get a better result restart it many times with different initial value.

[Actually I only trust simulated annealing for getting a global maximum/minimum]

There are so many optimizing algorithms.

From what I remember, simulated annealing performs poorly when tuning chess evaluation functions. I'll try to find where I read that.
It's true that gradient descent can only find local optima, but that is true of any general optimisation algorithm, including simulated annealing (but that at least has a decent chance of getting kicked out of whatever valley it ends up being stuck in).

Evert · Post by **Evert** » Fri Sep 01, 2017 8:17 pm

hgm wrote:It seems you are suffering from the 'river-valley problem'. The method becomes unstable (and just before that starts to oscillate) when you try to make steps that are larger than the distance to the closest minimum along a line in the direction of the gradient. So your maximum acceptable step size is determined by the width of the valley, making progress towards the sea exceedingly slow.

What I often used for optimizing functions like that is just make small steps in random directions, but add the moving average (e.g. exponentially decaying) of the latest accepted steps. This learns in which direction it should step, and will keep increasing the step size as long as that direction works. At all times keep statistics on recent acceptance rate, and if it drops below a certain value, half the moving average.

What I used to do when calculating stellar structure models is monitor whether steps were still helping with convergence, and when they didn't cut back the step size. That was usually good enough to get the thing back on track, but it could still get stuck bouncing around a valley now and then. It would then try a restart with a smaller timestep (which sometimes exacerbated the problem).
The main difference is that I was trying to solve a system of equations, rather than find an optimum. I always thought of the algorithms to do either as being fairly similar, but I find this even more of a black art!

Your method sounds similar to methods described on http://ruder.io/optimizing-gradient-descent/index.html (which was posted here in another thread recently); I'll give it a try.

Ab-initio evaluation tuning

Re: Ab-initio evaluation tuning

Re: Ab-initio evaluation tuning

Re: Ab-initio evaluation tuning

Re: Ab-initio evaluation tuning

Re: Ab-initio evaluation tuning

Re: Ab-initio evaluation tuning

Re: Ab-initio evaluation tuning

Re: Ab-initio evaluation tuning

Re: Ab-initio evaluation tuning

Re: Ab-initio evaluation tuning