Texel tuning method question

Desperado · Post by **Desperado** » Wed Jun 07, 2017 6:00 pm

hgm wrote:
Desperado wrote:If i come close with my understanding which would mean a static score from the pv leaf, i bet it provides the same noise as any other static score you compute in any other situation.
What you seem to miss is that the tuning method doesn't use only the score, but also the gradient of the score, i.e. the direction in parameter space in which the score increases fastest. It needs this to know the relative magnitude of the changes it has to apply to all the parameters to get closer to the optimum.

To know this, you have to know how the score of the individual test cases will change as a result of a change in each parameter. But you cannot see how much each parameter contributes to the score in the root. E.g. the root could have a Rook for the opponent, and none for you. Then the Root score would get worse for you if the Rook value increases. But the PV might be NxR, PxN, so that the position at the end of the PV does not contain any Rooks at all, and the static eval of that position (and thus the root score )would be insensitive to the Rook value.

Isn't this still a matter of the raw data, as i tried to point out in my last post for you?

zenpawn · Post by **zenpawn** » Wed Jun 07, 2017 6:07 pm

This post http://talkchess.com/forum/viewtopic.php?t=61427 might help. It links to the EPD of 730k quiet positions Alexandru Mosoi, author of Zurichess, used in his Texel tuning.

jdart · Post by **jdart** » Wed Jun 07, 2017 6:15 pm

I think several posters here misunderstand the process.

You are trying to minimize the error between estimated game result (from position score) and the actual game results, over the training set.

If you have only quiescent positions, such as might occur at the end of the qsearch, in your training set, then you don't need to do any searching. You can use the static eval directly.

If your position is not quiescent then in general you cannot obtain an even grossly approximate value for the position from the static value. At least, you need to call the qsearch to get a value.

I do not assume the training set has only quiescent positions, so a search is necessary. Once this search is done you basically have a set of derived positions that are quiescent and from then on the static eval and its gradient can be computed for those positions. However, note the game result, which you are actually using in the gradient computation, belongs to the root (original) training position, not the end-of-PV position. So you are assuming that the quiescent position's eval is a proxy for evaluating the root position, which cannot be directly evaluated statically.

--Jon

hgm · Post by **hgm** » Wed Jun 07, 2017 6:27 pm

Desperado wrote:Isn't this still a matter of the raw data, as i tried to point out in my last post for you?

I don't know what you mean by 'raw data'. If it is the position at the end of the PV, then yes, the gradient follows from that. The gradient cannot be calculated from the position at the root. It could be calculated from as many root scores with different parameters as there are parameters (+1).

zenpawn · Post by **zenpawn** » Wed Jun 07, 2017 6:29 pm

jdart wrote:I think several posters here misunderstand the process.

You are trying to minimize the error between estimated game result (from position score) and the actual game results, over the training set.

If you have only quiescent positions, such as might occur at the end of the qsearch, in your training set, then you don't need to do any searching. You can use the static eval directly.

If your position is not quiescent then in general you cannot obtain an even grossly approximate value for the position from the static value. At least, you need to call the qsearch to get a value.

I do not assume the training set has only quiescent positions, so a search is necessary. Once this search is done you basically have a set of derived positions that are quiescent and from then on the static eval and its gradient can be computed for those positions. However, note the game result, which you are actually using in the gradient computation, belongs to the root (original) training position, not the end-of-PV position. So you are assuming that the quiescent position's eval is a proxy for evaluating the root position, which cannot be directly evaluated statically.

--Jon

The positions linked in the post above seek to address some of these issues by providing quiet positions (according to the readme, "From the set were removed all positions on which quiescence search found a wining capture. The remaining positions were stored in quiet.epd."). There's also a slightly smaller labeled set (725k) where the game result comes from Stockfish 080916 in self-play starting from the quiet position rather than something earlier.

petero2 · Post by **petero2** » Wed Jun 07, 2017 7:43 pm

AlvaroBegue wrote:
Desperado wrote:Maybe i should think about it twice, but the pv eval should be passed to the root as search result. So at first glance i don't know in what way the "eval at the end of the pv" is different to the search result score. :?: :!:
The trick is doing the gradient descent. While it would be possible to do it on the search function itself, it would be hard to make that efficient. So instead, you need to recover what position gave the eval that was propagated to the root, and then compute the gradient of the evaluation function at that node.

I will add some comments here that will hopefully help clear up some of the confusion.

The basic texel tuning method treats the evaluation function and the q-search function as black boxes. You put in a position and a set of parameter values, and you get out an evaluation score. How the score is computed is completely irrelevant for the tuning algorithm.

Without any assumptions about how the evaluation function works internally, you are restricted to quite primitive algorithms for finding a minimum in parameter space. The pseudo code on the CPW for example varies one parameter at a time, following the downwards direction. It stops when no smaller value can be found in any direction.

If we assume that the function to minimize is differentiable almost everywhere but still treat the function as a black box, we could use various gradient based optimization methods to speed up the search for a local minimum. Since the function is a black box it would not be possible to directly compute the required partial derivatives, so they would have to be approximated using finite differences instead. Typically something like

Code: Select all

dE/dPi ~= E(pi+1) - E(pi)

or

Code: Select all

dE/dPi ~= (E(pi+1) - E(pi-1))/2

If there are M parameters the first formula would require M+1 evaluations to compute the value and all partial derivatives. The second formula is more accurate but would require 2*M+1 evaluations.

If we further assume that the evaluation function has a certain structure, so that the evaluation score is computed from the position and parameters using only a well-defined set of operations, and assume that the evaluation function is written in a language that supports generic types and operator overloading, it is possible to implement a framework that automatically computes the partial derivatives at the same time as the evaluation score is computed. See for example this article for an explanation of how this can be done.

Álvaro has implemented such a framework, which is called ruy_tune. It is written in C++ and a requirement for it to work is that the evaluation function is converted to a template, where the score type is a template parameter. With such a modified evaluation function, the gradient can be computed much faster than if it were computed using finite differences. (At least I think it will be much faster, I have not actually tested this.)

However, for this to work you would have to find the position at the end of the PV and use that position to compute the evaluation score and the corresponding gradient. If you wanted to apply the automatic gradient computation technique to the q-search function, the q-search function would also have to be converted to a template, and the framework would have to be extended to overload also comparison operators in order to make the mini-maxing work.

AlvaroBegue · Post by **AlvaroBegue** » Wed Jun 07, 2017 8:08 pm

petero2 wrote:[...]
Álvaro has implemented such a framework, which is called ruy_tune. It is written in C++ and a requirement for it to work is that the evaluation function is converted to a template, where the score type is a template parameter. With such a modified evaluation function, the gradient can be computed much faster than if it were computed using finite differences. (At least I think it will be much faster, I have not actually tested this.)

However, for this to work you would have to find the position at the end of the PV and use that position to compute the evaluation score and the corresponding gradient. If you wanted to apply the automatic gradient computation technique to the q-search function, the q-search function would also have to be converted to a template, and the framework would have to be extended to overload also comparison operators in order to make the mini-maxing work.

Correct. Computing the gradient on the QS directly is a colossal waste of time, at least with the method I implemented. It is much faster to run the QS saving the PV and then compute the gradient using the end-of-PV position.

What I did with RuyTune is turn the original positions into quiet positions using this end-of-PV method using my existing evaluation function, and then not worry too much about the fact that tweaking the evaluation function could result in a different position being picked. I could rerun this periodically (as someone else has suggested in this thread), but I think it would make very little difference in practice.

jdart · Post by **jdart** » Wed Jun 07, 2017 9:03 pm

Because my eval function is relatively simple I actually do a closed-form gradient computation. For me this entails some duplication of logic between the actual eval function and the tuning code. I also have some code that does the finite difference calculation and compares with the gradient computation and verifies that they are giving the same result within some small error margin.

Computing the gradient is fairly straightforward but it is important to take account of material value scaling, if that is used. And I have a few bits, notably king safety computation, that are nonlinear and for which gradient computation is non-trivial, but still doable.

This is quite a bit more dev work than finite differences though.

--Jon

Cheney · Post by **Cheney** » Wed Jul 19, 2017 9:49 pm

Hi,

I have a question at this point in regards to speed and method for testing various parameters.

I am just testing the mechanics and speed of the tuning method on a single thread and get about 1M positions tested in just under 20 seconds. I am OK with this for now but as I think about it, and I see this has been discussed in other posts, if I expose a few dozen parameters to tune, this could take weeks or longer, right?

What I envision is this, for example:
- I have 10 parameters that I want to tune.
- I want to use a delta for each parameter to test, let's just say delta +/- 10. For each parameter, this is 21 different values from value-10 to value+10.

That is a lot of combinations for ~20 seconds per test.

I do not know if this is called a "local search" or not, but the more parameters I want to expose the more time that is needed.

Am I seeing this wrong? Maybe this is the simplest brute force and there are better ways. Can someone help clear this idea up and give me a nudge in the right direction?

AlvaroBegue · Post by **AlvaroBegue** » Wed Jul 19, 2017 11:28 pm

Cheney wrote:Hi,

I have a question at this point in regards to speed and method for testing various parameters.

I am just testing the mechanics and speed of the tuning method on a single thread and get about 1M positions tested in just under 20 seconds. I am OK with this for now but as I think about it, and I see this has been discussed in other posts, if I expose a few dozen parameters to tune, this could take weeks or longer, right?

What I envision is this, for example:
- I have 10 parameters that I want to tune.
- I want to use a delta for each parameter to test, let's just say delta +/- 10. For each parameter, this is 21 different values from value-10 to value+10.

That is a lot of combinations for ~20 seconds per test.

I do not know if this is called a "local search" or not, but the more parameters I want to expose the more time that is needed.

Am I seeing this wrong? Maybe this is the simplest brute force and there are better ways. Can someone help clear this idea up and give me a nudge in the right direction?

Let me see if I understand what you are saying. If we only had one parameter to tune, you could imagine computing the derivative of your loss function (the thing you are minimizing) with respect to your parameter by setting the parameter 10 points higher, then 10 points lower, and approximating the derivative like this:

(d Loss) / (d Param) ~= (Loss(Param+10) - Loss(Param-10)) / 20

If you want to do this with P parameters, you would need 2P evaluations of the loss function (i.e., 2P passes through all the data), which gets expensive quickly.

Enter automatic differentiation. You can actually compute all those derivatives in 2 or 3 times the cost of computing the loss function once, regardless of P. The method is called "reverse-mode automatic differentiation". Neural nets people call it "backpropagation". And people that want to point out how obvious it all is in retrospect call it "the chain rule".

Last year I made RuyTune available so people could do this kind of thing on their engines. Unfortunately, I don't think I managed to make it user-friendly enough. But if you are interested, I can try to help you to make use of it, or at least the automatic-differentiation piece of it. See here: https://bitbucket.org/alonamaloh/ruy_tune

Texel tuning method question

Re: Texel tuning method question

Re: Texel tuning method question

Re: Texel tuning method question

Re: Texel tuning method question

Re: Texel tuning method question

Re: Texel tuning method question

Re: Texel tuning method question

Re: Texel tuning method question

Re: Texel tuning method question

Re: Texel tuning method question