Tapered Evaluation and MSE (Texel Tuning)

Pio · Post by **Pio** » Wed Jan 20, 2021 5:36 pm

Desperado wrote: ↑Wed Jan 20, 2021 5:25 pm
hgm wrote: ↑Wed Jan 20, 2021 5:04 pm
Sven wrote: ↑Tue Jan 19, 2021 11:52 pmIt is unlikely to find "the minimum MSE" with Texel tuning. The algorithm stops at a local minimum of the MSE, and there may be several of that kind. It seems you ignore this fact.
A good optimizer should not get stuck in local optima. And note that for terms in which the evaluation is linear (like most terms in hand-crafted evaluations, and certainly the piece values) there shouldn't be any local optima at all. I don't think the sigmoid correction can change that, as it is equivalent to fitting the centi-Pawn scores with a deminishing weight for the scores far from zero.
@Sven
It is not the tuner's goal to find the local minimum. This is merely the result of the fact that the algorithm cannot prevent it. In the course of this topic it could also be shown that it was not the algorithm that led to being stuck in a specific local minimum, but its configuration. The algorithm leaves some local minima if the step size can also deviate from one unit. A parameter of the algorithm is not the algorithm.

I can understand rejecting a vector with a minimal MSE if reasons exist in the overall context. (for example Elo regression or maybe other reasons).

The reason to ignore a MSE that exists but is achieved with other parameters is not one of them. (that was my first impression from your answer).

I agree with Sven that something is fishy with the different step sizes and I believe that using bigger step size the optimiser values more to get the averages down to very good values and that became more important than getting the evaluation to zero. Using the one size step made the optimiser not having to adjust the values so that the averages or combinations to centipawn accuracy and could focus setting everything to zero.

Desperado · Post by **Desperado** » Wed Jan 20, 2021 5:39 pm

Pio wrote: ↑Wed Jan 20, 2021 5:29 pm
hgm wrote: ↑Wed Jan 20, 2021 5:04 pm
Sven wrote: ↑Tue Jan 19, 2021 11:52 pmIt is unlikely to find "the minimum MSE" with Texel tuning. The algorithm stops at a local minimum of the MSE, and there may be several of that kind. It seems you ignore this fact.
A good optimizer should not get stuck in local optima. And note that for terms in which the evaluation is linear (like most terms in hand-crafted evaluations, and certainly the piece values) there shouldn't be any local optima at all. I don't think the sigmoid correction can change that, as it is equivalent to fitting the centi-Pawn scores with a deminishing weight for the scores far from zero.
I am not so sure about that. I think the use of different K, absolute error and weigh positions close to en result will contribute to much more reasonable values. I don’t think anything is wrong with their optimisers.

We will see

I guess it was a general remark. Any optimizer has the goal to find the global optimum, but some algorithms are limited to local optimum because
the mechanics do not allow to leave them. Furthermore, this does not mean that a found local optimum is not also the global optimum

Desperado · Post by **Desperado** » Wed Jan 20, 2021 5:45 pm

Pio wrote: ↑Wed Jan 20, 2021 5:36 pm
Desperado wrote: ↑Wed Jan 20, 2021 5:25 pm
hgm wrote: ↑Wed Jan 20, 2021 5:04 pm
Sven wrote: ↑Tue Jan 19, 2021 11:52 pmIt is unlikely to find "the minimum MSE" with Texel tuning. The algorithm stops at a local minimum of the MSE, and there may be several of that kind. It seems you ignore this fact.
A good optimizer should not get stuck in local optima. And note that for terms in which the evaluation is linear (like most terms in hand-crafted evaluations, and certainly the piece values) there shouldn't be any local optima at all. I don't think the sigmoid correction can change that, as it is equivalent to fitting the centi-Pawn scores with a deminishing weight for the scores far from zero.
@Sven
It is not the tuner's goal to find the local minimum. This is merely the result of the fact that the algorithm cannot prevent it. In the course of this topic it could also be shown that it was not the algorithm that led to being stuck in a specific local minimum, but its configuration. The algorithm leaves some local minima if the step size can also deviate from one unit. A parameter of the algorithm is not the algorithm.

I can understand rejecting a vector with a minimal MSE if reasons exist in the overall context. (for example Elo regression or maybe other reasons).

The reason to ignore a MSE that exists but is achieved with other parameters is not one of them. (that was my first impression from your answer).
I agree with Sven that something is fishy with the different step sizes and I believe that using bigger step size the optimiser values more to get the averages down to very good values and that became more important than getting the evaluation to zero. Using the one size step made the optimiser not having to adjust the values so that the averages or combinations to centipawn accuracy and could focus setting everything to zero.

Sorry Pio, the step size has NOTHING to do with the mse. The value of the mse exists or it does not exist.
How the tuner scans the search space doesn't affect the value of the mse in any form.

Sven · Post by **Sven** » Wed Jan 20, 2021 6:34 pm

hgm wrote: ↑Wed Jan 20, 2021 5:04 pm
Sven wrote: ↑Tue Jan 19, 2021 11:52 pmIt is unlikely to find "the minimum MSE" with Texel tuning. The algorithm stops at a local minimum of the MSE, and there may be several of that kind. It seems you ignore this fact.
A good optimizer should not get stuck in local optima. And note that for terms in which the evaluation is linear (like most terms in hand-crafted evaluations, and certainly the piece values) there shouldn't be any local optima at all. I don't think the sigmoid correction can change that, as it is equivalent to fitting the centi-Pawn scores with a deminishing weight for the scores far from zero.

According to the original description of the method you are wrong, Texel tuning does nothing else but search for a local minimum.

As to the rest of your statement, again I get the impression that you never did automated eval parameter tuning yourself.

Sven · Post by **Sven** » Wed Jan 20, 2021 6:45 pm

Desperado wrote: ↑Wed Jan 20, 2021 5:45 pm
Pio wrote: ↑Wed Jan 20, 2021 5:36 pm
Desperado wrote: ↑Wed Jan 20, 2021 5:25 pm
hgm wrote: ↑Wed Jan 20, 2021 5:04 pm
Sven wrote: ↑Tue Jan 19, 2021 11:52 pmIt is unlikely to find "the minimum MSE" with Texel tuning. The algorithm stops at a local minimum of the MSE, and there may be several of that kind. It seems you ignore this fact.
A good optimizer should not get stuck in local optima. And note that for terms in which the evaluation is linear (like most terms in hand-crafted evaluations, and certainly the piece values) there shouldn't be any local optima at all. I don't think the sigmoid correction can change that, as it is equivalent to fitting the centi-Pawn scores with a deminishing weight for the scores far from zero.
@Sven
It is not the tuner's goal to find the local minimum. This is merely the result of the fact that the algorithm cannot prevent it. In the course of this topic it could also be shown that it was not the algorithm that led to being stuck in a specific local minimum, but its configuration. The algorithm leaves some local minima if the step size can also deviate from one unit. A parameter of the algorithm is not the algorithm.

I can understand rejecting a vector with a minimal MSE if reasons exist in the overall context. (for example Elo regression or maybe other reasons).

The reason to ignore a MSE that exists but is achieved with other parameters is not one of them. (that was my first impression from your answer).
I agree with Sven that something is fishy with the different step sizes and I believe that using bigger step size the optimiser values more to get the averages down to very good values and that became more important than getting the evaluation to zero. Using the one size step made the optimiser not having to adjust the values so that the averages or combinations to centipawn accuracy and could focus setting everything to zero.
Sorry Pio, the step size has NOTHING to do with the mse. The value of the mse exists or it does not exist.
How the tuner scans the search space doesn't affect the value of the mse in any form.

I think you mean "the global minimum mse" (or MSE), not "the mse".

Are you able to provide an algorithm that can be practically used and that can find the global minimum MSE (and the parameter set for it, of course)?

As to step sizes: my observation simply was that larger step sizes could let the tuning jump to a different parameter set that formed a different local minimum of the MSE and (regardless whether that second MSE was smaller than the one with step size 1 or not) led to weaker game play than the first parameter set. Of course it depends on the parameter set that I used to start tuning: if I set all values to 0 (or even with random values ?!) then I would certainly need higher step sizes to reach reasonable values in a reasonable time. In reality I started with values that looked "reasonable" to a human already.

I can use different training data, or different starting values, and can be lucky with higher step sizes in that case, for sure. I just reported what I saw when I did the tuning back then with my setup, I did not intend to say that other step sizes are bad in general.

hgm · Post by **hgm** » Wed Jan 20, 2021 7:01 pm

It is rather easy to see why local minima cannot be a problem (for linear evaluation terms), provided the position evaluations are more or less homogenously spaced on the scale of the with of the rising part of the sigmoid. Each position would result in an eval in which the piece value is an additive term, so varying the piece value will just move the point along the curve, and the square error for that point would be the sigmoid squared (for losses), 1 minus the sigmoid, squares, for the wins (which is actually just the 180-degree rotated version of that curve). The function describing the total square error of all positions would be the sum of many such curves, each shifted horizontally by the evaluation belonging to that position. This will give a quite smooth curve if these evaluations are relatively close. Only when the cP evals cluster into far-apart groups you would get a 'bumpy' behavior of the total MSE. E.g. if the sigmoid rises from near 0 to near 1 over the course of 200 cP, the SE for a drawn position would be a horizontal line at 0.25, with a dip to 0 about 200 cP wide in the middle.. If you would tune on 5 positions spaced 500 cP, the MSE would also be a horizontal line at 0.25, but with 5 dips to 0.20 in it, spaced by 500 cP and 200 cP wide. The central dip would be marginally deeper; the other 4 would be local optima. Or in other words, you can only predict one of the draws right; all others are so far off in score that they would always be maximally wrong, no matter what you do.

If you would have 200 positions, one every 10 cP between -1000 cP and +100 cP, all the dips would coalesce into one wide minimum, though, with a very slight corrugation of period 10cp. The maximum slope of this corrugation would not be able to overcome the average slope of the large dip (which it would have to do to create a local minimum) unless you are already very close to the minimum itself (where the graph of the large dip runs nearly horizontal). So you might have a few local minima very close to the global minima, which would result in an mse even closer to the optimal MSE. You wouldn't care much which of those you ended up in.

Sven · Post by **Sven** » Wed Jan 20, 2021 7:18 pm

"if", "would", "might", did I miss some?

You have a theory about it. Go ahead and implement it, then test it and make it work under real conditions. Then come back. Until then, please accept my apologies that I ignore what you say about this topic.

hgm · Post by **hgm** » Wed Jan 20, 2021 9:38 pm

Apparently you miss proper understanding of the word 'would'.

But suit yourself. As long as others are forewarned that they'd better ignore you...

Desperado · Post by **Desperado** » Wed Jan 20, 2021 10:55 pm

Sven wrote: ↑Wed Jan 20, 2021 6:45 pm
Desperado wrote: ↑Wed Jan 20, 2021 5:45 pm
Pio wrote: ↑Wed Jan 20, 2021 5:36 pm
Desperado wrote: ↑Wed Jan 20, 2021 5:25 pm
hgm wrote: ↑Wed Jan 20, 2021 5:04 pm
Sven wrote: ↑Tue Jan 19, 2021 11:52 pmIt is unlikely to find "the minimum MSE" with Texel tuning. The algorithm stops at a local minimum of the MSE, and there may be several of that kind. It seems you ignore this fact.
A good optimizer should not get stuck in local optima. And note that for terms in which the evaluation is linear (like most terms in hand-crafted evaluations, and certainly the piece values) there shouldn't be any local optima at all. I don't think the sigmoid correction can change that, as it is equivalent to fitting the centi-Pawn scores with a deminishing weight for the scores far from zero.
@Sven
It is not the tuner's goal to find the local minimum. This is merely the result of the fact that the algorithm cannot prevent it. In the course of this topic it could also be shown that it was not the algorithm that led to being stuck in a specific local minimum, but its configuration. The algorithm leaves some local minima if the step size can also deviate from one unit. A parameter of the algorithm is not the algorithm.

I can understand rejecting a vector with a minimal MSE if reasons exist in the overall context. (for example Elo regression or maybe other reasons).

The reason to ignore a MSE that exists but is achieved with other parameters is not one of them. (that was my first impression from your answer).
I agree with Sven that something is fishy with the different step sizes and I believe that using bigger step size the optimiser values more to get the averages down to very good values and that became more important than getting the evaluation to zero. Using the one size step made the optimiser not having to adjust the values so that the averages or combinations to centipawn accuracy and could focus setting everything to zero.
Sorry Pio, the step size has NOTHING to do with the mse. The value of the mse exists or it does not exist.
How the tuner scans the search space doesn't affect the value of the mse in any form.
I think you mean "the global minimum mse" (or MSE), not "the mse".

Are you able to provide an algorithm that can be practically used and that can find the global minimum MSE (and the parameter set for it, of course)?

As to step sizes: my observation simply was that larger step sizes could let the tuning jump to a different parameter set that formed a different local minimum of the MSE and (regardless whether that second MSE was smaller than the one with step size 1 or not) led to weaker game play than the first parameter set. Of course it depends on the parameter set that I used to start tuning: if I set all values to 0 (or even with random values ?!) then I would certainly need higher step sizes to reach reasonable values in a reasonable time. In reality I started with values that looked "reasonable" to a human already.

I can use different training data, or different starting values, and can be lucky with higher step sizes in that case, for sure. I just reported what I saw when I did the tuning back then with my setup, I did not intend to say that other step sizes are bad in general.

Desperado · Post by **Desperado** » Wed Jan 20, 2021 10:56 pm

Sven wrote: ↑Wed Jan 20, 2021 6:45 pm
Desperado wrote: ↑Wed Jan 20, 2021 5:45 pm
Pio wrote: ↑Wed Jan 20, 2021 5:36 pm
Desperado wrote: ↑Wed Jan 20, 2021 5:25 pm
hgm wrote: ↑Wed Jan 20, 2021 5:04 pm
Sven wrote: ↑Tue Jan 19, 2021 11:52 pmIt is unlikely to find "the minimum MSE" with Texel tuning. The algorithm stops at a local minimum of the MSE, and there may be several of that kind. It seems you ignore this fact.
A good optimizer should not get stuck in local optima. And note that for terms in which the evaluation is linear (like most terms in hand-crafted evaluations, and certainly the piece values) there shouldn't be any local optima at all. I don't think the sigmoid correction can change that, as it is equivalent to fitting the centi-Pawn scores with a deminishing weight for the scores far from zero.
@Sven
It is not the tuner's goal to find the local minimum. This is merely the result of the fact that the algorithm cannot prevent it. In the course of this topic it could also be shown that it was not the algorithm that led to being stuck in a specific local minimum, but its configuration. The algorithm leaves some local minima if the step size can also deviate from one unit. A parameter of the algorithm is not the algorithm.

I can understand rejecting a vector with a minimal MSE if reasons exist in the overall context. (for example Elo regression or maybe other reasons).

The reason to ignore a MSE that exists but is achieved with other parameters is not one of them. (that was my first impression from your answer).
I agree with Sven that something is fishy with the different step sizes and I believe that using bigger step size the optimiser values more to get the averages down to very good values and that became more important than getting the evaluation to zero. Using the one size step made the optimiser not having to adjust the values so that the averages or combinations to centipawn accuracy and could focus setting everything to zero.
Sorry Pio, the step size has NOTHING to do with the mse. The value of the mse exists or it does not exist.
How the tuner scans the search space doesn't affect the value of the mse in any form.
I think you mean "the global minimum mse" (or MSE), not "the mse".

Are you able to provide an algorithm that can be practically used and that can find the global minimum MSE (and the parameter set for it, of course)?

As to step sizes: my observation simply was that larger step sizes could let the tuning jump to a different parameter set that formed a different local minimum of the MSE and (regardless whether that second MSE was smaller than the one with step size 1 or not) led to weaker game play than the first parameter set. Of course it depends on the parameter set that I used to start tuning: if I set all values to 0 (or even with random values ?!) then I would certainly need higher step sizes to reach reasonable values in a reasonable time. In reality I started with values that looked "reasonable" to a human already.

I can use different training data, or different starting values, and can be lucky with higher step sizes in that case, for sure. I just reported what I saw when I did the tuning back then with my setup, I did not intend to say that other step sizes are bad in general.

1.
There is only a measure, "the MSE", associated to a paramter vector.
The MSE can be computed for any vector that belongs to the search space.
Nothing else affects this value, especially not the search strategy (or stepsize parameters)
Thus,pairs of vectors and MSE exists.

2.
The optimizer searches for the lowest MSE it can find.
This is purely a search task, and this core task is identical for a global/local optimizer.

3.
The fact that the results of local/global optimizers differ, do not change the basic task to search for minimum.

4a. I have noticed what is the reason for you to reject a better MSE. The value does not represent a useful value for you (Elo regression).
It is ok for me. However, this value definitely remains the best value that the optimizer could provide. The optimizer thus fulfills its task perfectly.

It should search and report the lowest MSE it is able to find. It should not find the most useful vector.
You can be happy if this equals. Otherwise you would need to define what most useful is, then it might be able to search for it.
I am 100% sure that you will agree that the a minimum MSE != most useful output in general.

4b.
My general remark now on this is(was), that it would be fatal to change the algorithm because the value was not useful.
Only if the optimizer fails to find a solution which is known as lowest MSE and it should be able to find it, an update is mandatory.
Another reason would be to make it more effective or efficient.

5.
The optimizer can only evaluate what it receives as input. So you can start with the data, so that the best result equals an useful result.
The other possiblity would be to take advantage if you keep the data and work on parts of the engine until the data works with it.

In my case, my optimizer worked from the beginning with the quiet-labeled.epd. Very unreflected i made a change of the data.
Because the results began to look very suspicious, i was open minded and did not exclude a bug or anything else.
In the course of thread and working on parts of the tuner i introduced bugs which i was able to identify and fix.

Many people whitewash and ignore facts.

1. Look into this thread, everybody has the complete information to reproduce the results 1:1. A lot of people still ignore that.
A very simple way to proof me wrong would be to report an MSE < myMSE with a more useful vector (not including the divergence of mg/eg).
Another way would be repeat the experiment. But there is silence and arguments like "i know what i do".
Someone could easily put in my result vector an report the mse in its framework, two minute job.
But people still tell me i do sth. wrong, without evidence.
2. At first, i got the impression, like HG, that you did use the optimizer in a, say humanized way. I know the real reason now.
But here are some professionals too, that behave similar in a slightly different context. The level "I like it more, so it is right"...
3. A good point is, but i personally would not point to HG, that the talk "shoud be, if ,might ,could bla bla bla" does not help.
Having ideas and giving that as input into a discussion is a great thing. But it should be a mixture from experience and theoretical knowhow.
At least it should consider the findings so far.

Ferdy gave me by far the most useful information. I was able to verify my complete code base so to say.
But HG guided me very well at some point in that thread, as you did when you gave me the idea of cpw.

I really appreciate your analytical skills as i appreciate HG's knowledge.
I dislike when people behave unprofessional because that does not result in anything and is very amateurish.

So, if there is an issue with my 5 points at the beginning, just let me know what the issue is.
But, please, do not ask me things you already know, like if am able to provide an global optimizer which practically is doing this and that.
Let's simply talk about the differences or misunderstandings. But keep factual please.

Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)