Tapered Evaluation and MSE (Texel Tuning)

Pio · Post by **Pio** » Thu Jan 21, 2021 12:35 am

Sven wrote: ↑Wed Jan 20, 2021 6:45 pm
Desperado wrote: ↑Wed Jan 20, 2021 5:45 pm
Pio wrote: ↑Wed Jan 20, 2021 5:36 pm
Desperado wrote: ↑Wed Jan 20, 2021 5:25 pm
hgm wrote: ↑Wed Jan 20, 2021 5:04 pm
Sven wrote: ↑Tue Jan 19, 2021 11:52 pmIt is unlikely to find "the minimum MSE" with Texel tuning. The algorithm stops at a local minimum of the MSE, and there may be several of that kind. It seems you ignore this fact.
A good optimizer should not get stuck in local optima. And note that for terms in which the evaluation is linear (like most terms in hand-crafted evaluations, and certainly the piece values) there shouldn't be any local optima at all. I don't think the sigmoid correction can change that, as it is equivalent to fitting the centi-Pawn scores with a deminishing weight for the scores far from zero.
@Sven
It is not the tuner's goal to find the local minimum. This is merely the result of the fact that the algorithm cannot prevent it. In the course of this topic it could also be shown that it was not the algorithm that led to being stuck in a specific local minimum, but its configuration. The algorithm leaves some local minima if the step size can also deviate from one unit. A parameter of the algorithm is not the algorithm.

I can understand rejecting a vector with a minimal MSE if reasons exist in the overall context. (for example Elo regression or maybe other reasons).

The reason to ignore a MSE that exists but is achieved with other parameters is not one of them. (that was my first impression from your answer).
I agree with Sven that something is fishy with the different step sizes and I believe that using bigger step size the optimiser values more to get the averages down to very good values and that became more important than getting the evaluation to zero. Using the one size step made the optimiser not having to adjust the values so that the averages or combinations to centipawn accuracy and could focus setting everything to zero.
Sorry Pio, the step size has NOTHING to do with the mse. The value of the mse exists or it does not exist.
How the tuner scans the search space doesn't affect the value of the mse in any form.
I think you mean "the global minimum mse" (or MSE), not "the mse".

Are you able to provide an algorithm that can be practically used and that can find the global minimum MSE (and the parameter set for it, of course)?

As to step sizes: my observation simply was that larger step sizes could let the tuning jump to a different parameter set that formed a different local minimum of the MSE and (regardless whether that second MSE was smaller than the one with step size 1 or not) led to weaker game play than the first parameter set. Of course it depends on the parameter set that I used to start tuning: if I set all values to 0 (or even with random values ?!) then I would certainly need higher step sizes to reach reasonable values in a reasonable time. In reality I started with values that looked "reasonable" to a human already.

I can use different training data, or different starting values, and can be lucky with higher step sizes in that case, for sure. I just reported what I saw when I did the tuning back then with my setup, I did not intend to say that other step sizes are bad in general.

I found your report about different step sizes very interesting because I assumed (even though I haven’t said it in this thread) like HGM that there would most likely be only one stable global minimum when optimising linear terms like piece values. The only way (without too much thinking) I could maybe explain the very different results could be that the averages and relative values of the terms were more important than to put all values close to zero for the MG. The next question I asked myself was why was the optimiser trying to put the MG values to zero and came up with that you might not have included equal number of wins, draws and losses in your dataset. Having too many draws would lead the optimiser to predict most positions as draws. The next problem might be that with MG positions they are on average further away from the end result thus reducing the correlation between position and end result. The next problem was that I realised that MG positions are usually a lot more tactical in their nature also making them harder to predict accurately. Most of these problems will be exaggerated when using the squared error function compared with using the absolute error. Using different K values for the different phases might also improve things since I guess they (the different Ks) are very off because of tactics in the MG positions (when determining the different K values you should however use the squared error function). I don’t know even if the sigmoid always is so accurate. Maybe some other function makes a better fit. Weighing positions close to end result a lot more will probably also improve things.

Sven · Post by **Sven** » Thu Jan 21, 2021 12:52 am

Pio, in your last post it seems you are confusing myself with Michael (Desperado). I did not tune just piece values (except for non-productive testing purposes), I did not write about any special effects of MG values only, and my tuning process did not return parameters close to zero. Also I still trust my training data (quiet-labeled.epd).

Pio · Post by **Pio** » Thu Jan 21, 2021 1:19 am

Sven wrote: ↑Thu Jan 21, 2021 12:52 am Pio, in your last post it seems you are confusing myself with Michael (Desperado). I did not tune just piece values (except for non-productive testing purposes), I did not write about any special effects of MG values only, and my tuning process did not return parameters close to zero. Also I still trust my training data (quiet-labeled.epd).

Sorry I must have been mistaken. There has been so many posts so I have got a little bit dizzy

.

Sven · Post by **Sven** » Thu Jan 21, 2021 1:22 am

Desperado wrote: ↑Wed Jan 20, 2021 10:56 pm
Sven wrote: ↑Wed Jan 20, 2021 6:45 pm
Desperado wrote: ↑Wed Jan 20, 2021 5:45 pm
Pio wrote: ↑Wed Jan 20, 2021 5:36 pm
Desperado wrote: ↑Wed Jan 20, 2021 5:25 pm
hgm wrote: ↑Wed Jan 20, 2021 5:04 pm
Sven wrote: ↑Tue Jan 19, 2021 11:52 pmIt is unlikely to find "the minimum MSE" with Texel tuning. The algorithm stops at a local minimum of the MSE, and there may be several of that kind. It seems you ignore this fact.
A good optimizer should not get stuck in local optima. And note that for terms in which the evaluation is linear (like most terms in hand-crafted evaluations, and certainly the piece values) there shouldn't be any local optima at all. I don't think the sigmoid correction can change that, as it is equivalent to fitting the centi-Pawn scores with a deminishing weight for the scores far from zero.
@Sven
It is not the tuner's goal to find the local minimum. This is merely the result of the fact that the algorithm cannot prevent it. In the course of this topic it could also be shown that it was not the algorithm that led to being stuck in a specific local minimum, but its configuration. The algorithm leaves some local minima if the step size can also deviate from one unit. A parameter of the algorithm is not the algorithm.

I can understand rejecting a vector with a minimal MSE if reasons exist in the overall context. (for example Elo regression or maybe other reasons).

The reason to ignore a MSE that exists but is achieved with other parameters is not one of them. (that was my first impression from your answer).
I agree with Sven that something is fishy with the different step sizes and I believe that using bigger step size the optimiser values more to get the averages down to very good values and that became more important than getting the evaluation to zero. Using the one size step made the optimiser not having to adjust the values so that the averages or combinations to centipawn accuracy and could focus setting everything to zero.
Sorry Pio, the step size has NOTHING to do with the mse. The value of the mse exists or it does not exist.
How the tuner scans the search space doesn't affect the value of the mse in any form.
I think you mean "the global minimum mse" (or MSE), not "the mse".

Are you able to provide an algorithm that can be practically used and that can find the global minimum MSE (and the parameter set for it, of course)?

As to step sizes: my observation simply was that larger step sizes could let the tuning jump to a different parameter set that formed a different local minimum of the MSE and (regardless whether that second MSE was smaller than the one with step size 1 or not) led to weaker game play than the first parameter set. Of course it depends on the parameter set that I used to start tuning: if I set all values to 0 (or even with random values ?!) then I would certainly need higher step sizes to reach reasonable values in a reasonable time. In reality I started with values that looked "reasonable" to a human already.

I can use different training data, or different starting values, and can be lucky with higher step sizes in that case, for sure. I just reported what I saw when I did the tuning back then with my setup, I did not intend to say that other step sizes are bad in general.
1.
There is only a measure, "the MSE", associated to a paramter vector.
The MSE can be computed for any vector that belongs to the search space.
Nothing else affects this value, especially not the search strategy (or stepsize parameters)
Thus,pairs of vectors and MSE exists.

2.
The optimizer searches for the lowest MSE it can find.
This is purely a search task, and this core task is identical for a global/local optimizer.

3.
The fact that the results of local/global optimizers differ, do not change the basic task to search for minimum.

4a. I have noticed what is the reason for you to reject a better MSE. The value does not represent a useful value for you (Elo regression).
It is ok for me. However, this value definitely remains the best value that the optimizer could provide. The optimizer thus fulfills its task perfectly.

It should search and report the lowest MSE it is able to find. It should not find the most useful vector.
You can be happy if this equals. Otherwise you would need to define what most useful is, then it might be able to search for it.
I am 100% sure that you will agree that the a minimum MSE != most useful output in general.

4b.
My general remark now on this is(was), that it would be fatal to change the algorithm because the value was not useful.
Only if the optimizer fails to find a solution which is known as lowest MSE and it should be able to find it, an update is mandatory.
Another reason would be to make it more effective or efficient.

5.
The optimizer can only evaluate what it receives as input. So you can start with the data, so that the best result equals an useful result.
The other possiblity would be to take advantage if you keep the data and work on parts of the engine until the data works with it.

In my case, my optimizer worked from the beginning with the quiet-labeled.epd. Very unreflected i made a change of the data.
Because the results began to look very suspicious, i was open minded and did not exclude a bug or anything else.
In the course of thread and working on parts of the tuner i introduced bugs which i was able to identify and fix.

Many people whitewash and ignore facts.

1. Look into this thread, everybody has the complete information to reproduce the results 1:1. A lot of people still ignore that.
A very simple way to proof me wrong would be to report an MSE < myMSE with a more useful vector (not including the divergence of mg/eg).
Another way would be repeat the experiment. But there is silence and arguments like "i know what i do".
Someone could easily put in my result vector an report the mse in its framework, two minute job.
But people still tell me i do sth. wrong, without evidence.
2. At first, i got the impression, like HG, that you did use the optimizer in a, say humanized way. I know the real reason now.
But here are some professionals too, that behave similar in a slightly different context. The level "I like it more, so it is right"...
3. A good point is, but i personally would not point to HG, that the talk "shoud be, if ,might ,could bla bla bla" does not help.
Having ideas and giving that as input into a discussion is a great thing. But it should be a mixture from experience and theoretical knowhow.
At least it should consider the findings so far.

Ferdy gave me by far the most useful information. I was able to verify my complete code base so to say.
But HG guided me very well at some point in that thread, as you did when you gave me the idea of cpw.

I really appreciate your analytical skills as i appreciate HG's knowledge.
I dislike when people behave unprofessional because that does not result in anything and is very amateurish.

So, if there is an issue with my 5 points at the beginning, just let me know what the issue is.
But, please, do not ask me things you already know, like if am able to provide an global optimizer which practically is doing this and that.
Let's simply talk about the differences or misunderstandings. But keep factual please.

@Michael,

this thread (opened by yourself) is about Texel tuning, not about other possible tuning methods. Texel tuning is not design to find a global minimum MSE, in general you can only expect to find a local minimum MSE. The algorithm stops if modifying the current parameter set does not return a lower MSE, regardless whether a lower one exists "somewhere else" in the search space. Read again its description.

You, like HGM, are still blaming me for something I did not do: I did not reject something I "did not like" but I did what every single chess programmer would do after tuning his eval parameters (how often do I have to repeat that?):
0) Initial state was V0 with playing strength E0.
1) I tuned starting at V0 with the standard step size +/-1 and found that this improved the engine's playing strength significantly (by ~100 Elo, so E1 = E0+100).
2) I tuned with higher step sizes (starting from V0 again!), got different parameters of which I do not remember whether their MSE was higher or lower than the one from 1) - but that does not matter here!! - and found that these parameters gave significantly less improvement over V0 then with 1), so E2 < E0+100.

Therefore *for my engine* it was best to use the results from 1). For me using different step sizes does not really mean to use a "different algorithm", it is more like a configuration parameter, like K, or like the definition of the set of parameters I include into (or exclude from) the tuning process. The algorithm itself stays the same but may take a different path.

I really can't see where I did something "unprofessional", or behaved like that. "Unprofessional" would be to run a tuning process and automatically use its results without reflecting them.

Desperado · Post by **Desperado** » Thu Jan 21, 2021 2:07 am

Sven wrote: ↑Thu Jan 21, 2021 1:22 am
Desperado wrote: ↑Wed Jan 20, 2021 10:56 pm
Sven wrote: ↑Wed Jan 20, 2021 6:45 pm
Desperado wrote: ↑Wed Jan 20, 2021 5:45 pm
Pio wrote: ↑Wed Jan 20, 2021 5:36 pm
Desperado wrote: ↑Wed Jan 20, 2021 5:25 pm
hgm wrote: ↑Wed Jan 20, 2021 5:04 pm
Sven wrote: ↑Tue Jan 19, 2021 11:52 pmIt is unlikely to find "the minimum MSE" with Texel tuning. The algorithm stops at a local minimum of the MSE, and there may be several of that kind. It seems you ignore this fact.
A good optimizer should not get stuck in local optima. And note that for terms in which the evaluation is linear (like most terms in hand-crafted evaluations, and certainly the piece values) there shouldn't be any local optima at all. I don't think the sigmoid correction can change that, as it is equivalent to fitting the centi-Pawn scores with a deminishing weight for the scores far from zero.
@Sven
It is not the tuner's goal to find the local minimum. This is merely the result of the fact that the algorithm cannot prevent it. In the course of this topic it could also be shown that it was not the algorithm that led to being stuck in a specific local minimum, but its configuration. The algorithm leaves some local minima if the step size can also deviate from one unit. A parameter of the algorithm is not the algorithm.

I can understand rejecting a vector with a minimal MSE if reasons exist in the overall context. (for example Elo regression or maybe other reasons).

The reason to ignore a MSE that exists but is achieved with other parameters is not one of them. (that was my first impression from your answer).
I agree with Sven that something is fishy with the different step sizes and I believe that using bigger step size the optimiser values more to get the averages down to very good values and that became more important than getting the evaluation to zero. Using the one size step made the optimiser not having to adjust the values so that the averages or combinations to centipawn accuracy and could focus setting everything to zero.
Sorry Pio, the step size has NOTHING to do with the mse. The value of the mse exists or it does not exist.
How the tuner scans the search space doesn't affect the value of the mse in any form.
I think you mean "the global minimum mse" (or MSE), not "the mse".

Are you able to provide an algorithm that can be practically used and that can find the global minimum MSE (and the parameter set for it, of course)?

As to step sizes: my observation simply was that larger step sizes could let the tuning jump to a different parameter set that formed a different local minimum of the MSE and (regardless whether that second MSE was smaller than the one with step size 1 or not) led to weaker game play than the first parameter set. Of course it depends on the parameter set that I used to start tuning: if I set all values to 0 (or even with random values ?!) then I would certainly need higher step sizes to reach reasonable values in a reasonable time. In reality I started with values that looked "reasonable" to a human already.

I can use different training data, or different starting values, and can be lucky with higher step sizes in that case, for sure. I just reported what I saw when I did the tuning back then with my setup, I did not intend to say that other step sizes are bad in general.
1.
There is only a measure, "the MSE", associated to a paramter vector.
The MSE can be computed for any vector that belongs to the search space.
Nothing else affects this value, especially not the search strategy (or stepsize parameters)
Thus,pairs of vectors and MSE exists.

2.
The optimizer searches for the lowest MSE it can find.
This is purely a search task, and this core task is identical for a global/local optimizer.

3.
The fact that the results of local/global optimizers differ, do not change the basic task to search for minimum.

4a. I have noticed what is the reason for you to reject a better MSE. The value does not represent a useful value for you (Elo regression).
It is ok for me. However, this value definitely remains the best value that the optimizer could provide. The optimizer thus fulfills its task perfectly.

It should search and report the lowest MSE it is able to find. It should not find the most useful vector.
You can be happy if this equals. Otherwise you would need to define what most useful is, then it might be able to search for it.
I am 100% sure that you will agree that the a minimum MSE != most useful output in general.

4b.
My general remark now on this is(was), that it would be fatal to change the algorithm because the value was not useful.
Only if the optimizer fails to find a solution which is known as lowest MSE and it should be able to find it, an update is mandatory.
Another reason would be to make it more effective or efficient.

5.
The optimizer can only evaluate what it receives as input. So you can start with the data, so that the best result equals an useful result.
The other possiblity would be to take advantage if you keep the data and work on parts of the engine until the data works with it.

In my case, my optimizer worked from the beginning with the quiet-labeled.epd. Very unreflected i made a change of the data.
Because the results began to look very suspicious, i was open minded and did not exclude a bug or anything else.
In the course of thread and working on parts of the tuner i introduced bugs which i was able to identify and fix.

Many people whitewash and ignore facts.

1. Look into this thread, everybody has the complete information to reproduce the results 1:1. A lot of people still ignore that.
A very simple way to proof me wrong would be to report an MSE < myMSE with a more useful vector (not including the divergence of mg/eg).
Another way would be repeat the experiment. But there is silence and arguments like "i know what i do".
Someone could easily put in my result vector an report the mse in its framework, two minute job.
But people still tell me i do sth. wrong, without evidence.
2. At first, i got the impression, like HG, that you did use the optimizer in a, say humanized way. I know the real reason now.
But here are some professionals too, that behave similar in a slightly different context. The level "I like it more, so it is right"...
3. A good point is, but i personally would not point to HG, that the talk "shoud be, if ,might ,could bla bla bla" does not help.
Having ideas and giving that as input into a discussion is a great thing. But it should be a mixture from experience and theoretical knowhow.
At least it should consider the findings so far.

Ferdy gave me by far the most useful information. I was able to verify my complete code base so to say.
But HG guided me very well at some point in that thread, as you did when you gave me the idea of cpw.

I really appreciate your analytical skills as i appreciate HG's knowledge.
I dislike when people behave unprofessional because that does not result in anything and is very amateurish.

So, if there is an issue with my 5 points at the beginning, just let me know what the issue is.
But, please, do not ask me things you already know, like if am able to provide an global optimizer which practically is doing this and that.
Let's simply talk about the differences or misunderstandings. But keep factual please.
@Michael,

this thread (opened by yourself) is about Texel tuning, not about other possible tuning methods. Texel tuning is not design to find a global minimum MSE, in general you can only expect to find a local minimum MSE. The algorithm stops if modifying the current parameter set does not return a lower MSE, regardless whether a lower one exists "somewhere else" in the search space. Read again its description.

You, like HGM, are still blaming me for something I did not do: I did not reject something I "did not like" but I did what every single chess programmer would do after tuning his eval parameters (how often do I have to repeat that?):
0) Initial state was V0 with playing strength E0.
1) I tuned starting at V0 with the standard step size +/-1 and found that this improved the engine's playing strength significantly (by ~100 Elo, so E1 = E0+100).
2) I tuned with higher step sizes (starting from V0 again!), got different parameters of which I do not remember whether their MSE was higher or lower than the one from 1) - but that does not matter here!! - and found that these parameters gave significantly less improvement over V0 then with 1), so E2 < E0+100.

Therefore *for my engine* it was best to use the results from 1). For me using different step sizes does not really mean to use a "different algorithm", it is more like a configuration parameter, like K, or like the definition of the set of parameters I include into (or exclude from) the tuning process. The algorithm itself stays the same but may take a different path.

I really can't see where I did something "unprofessional", or behaved like that. "Unprofessional" would be to run a tuning process and automatically use its results without reflecting them.

Sorry Sven,

it was not my intention to address these attributes to you. I mixed some things up and i understand that...

1. It is ok to reject a solution because of global context (like elo regression). I simply did misunderstand you at the beginning, that is my point.

2. Well, there might be more local minima, why do you stop with the first you find ? Maybe that makes the confusion, i am not sure.
All that i was able to read is, that you do it because you do not change the basic settings for the algorithm given by the original author.
A better answer i could imagine would be like: this might end in an endless process / or it takes too much time / first i try to validate by
gameplay and continue this process so the time invest and usability ratio will be improved / whatever...

...The algorithm stops if modifying the current parameter set does not return a lower MSE...

As you said yourself, the parameter settings of the optimizer will not change the algorithm but they might provide a local minimum that is closer
to a global one. "...stops if modifying..." does not say anything what the modification includes, like using different step size for example.
So, the end of the modification is not defined very accurate, so the stop condition either. And i do not think that we can make a Hoffmanns Tuning
out of it just by saying, i stop by the second local minimum. That would be a little bit funny. Whatever the stop condition is, the first, second or
third local minimum, imo it is still the Texel-Tuning.

To be clear, i really would like to know more people that have your skills and professional attitude. No doubt.
Related to my second point, it is unlucky because it is mixed up with other remarks in the current thread which are confusing to me.

By the way, i know what the thread is about.

Desperado · Post by **Desperado** » Thu Jan 21, 2021 2:50 am

The meaning of first, second or any other count of local minima is finally related to the internal iterations the algorithm runs.
The iterations include the modification updates (e.g stepsize logic). Of course any process stops at some point for the first time,
if you look at it from outside the box. Hopefully i can avoid another discussion with that addtion

hgm · Post by **hgm** » Thu Jan 21, 2021 8:57 am

There seems to be some disagreement about what 'Texel tuning' means. As far as I am concerned it is minimizing the MSE in the prediction of the result of the game from which the test positions were taken. That is, finding the global minimum. If you allow yourself to get stuck in a local minimum, you can at best call it 'a failed attempt at Texel tuning'. Of course it cannot be excluded that using the results from such a failed attempt can increase the Elo of an engine. After all, there is no limit to how much the evaluation you had before the attempt sucked.

Pio · Post by **Pio** » Thu Jan 21, 2021 9:09 am

hgm wrote: ↑Thu Jan 21, 2021 8:57 am There seems to be some disagreement about what 'Texel tuning' means. As far as I am concerned it is minimizing the MSE in the prediction of the result of the game from which the test positions were taken. That is, finding the global minimum. If you allow yourself to get stuck in a local minimum, you can at best call it 'a failed attempt at Texel tuning'. Of course it cannot be excluded that using the results from such a failed attempt can increase the Elo of an engine. After all, there is no limit to how much the evaluation you had before the attempt sucked.

+1

Sven · Post by **Sven** » Thu Jan 21, 2021 9:24 am

@Michael: The automated Texel tuning does not know about any possible other local minima when it stops at one. "Modifying the parameter set" is what the algorithm does all the time: it increments/decrements eval parameters by the step size and sees whether that results in a lower MSE. It stops if an iteration (= epoch) fails to improve the best MSE that was found so far. Everything else that I do, like changing my tuning configuration (e.g. step size, K, ...) is a change of the program source code and not part of the automated process.

Sorry if I had not been clear enough about that.

I thought (and still think) that my Texel tuning worked exactly as it is described by its author, except for using quiet positions only and therefore calling static eval instead of QS. I did not expect that people who claim to have followed the same method would have so much trouble in understanding what I have done.

Sven · Post by **Sven** » Thu Jan 21, 2021 9:27 am

Pio wrote: ↑Thu Jan 21, 2021 9:09 am
hgm wrote: ↑Thu Jan 21, 2021 8:57 am There seems to be some disagreement about what 'Texel tuning' means. As far as I am concerned it is minimizing the MSE in the prediction of the result of the game from which the test positions were taken. That is, finding the global minimum. If you allow yourself to get stuck in a local minimum, you can at best call it 'a failed attempt at Texel tuning'. Of course it cannot be excluded that using the results from such a failed attempt can increase the Elo of an engine. After all, there is no limit to how much the evaluation you had before the attempt sucked.
+1

@HGM, @Pio: please read the original description and stop to spread false information. Texel tuning does *not* try to find the global minimum (it can't).

Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)