Tapered Evaluation and MSE (Texel Tuning)

Desperado · Post by **Desperado** » Mon Jan 18, 2021 5:26 pm

Hi Sven,

...
Apart from that, I still believe that extreme outcomes of the tuning process, like those you have shown several times, are caused by the tuning algorithm (including concepts like using higher and/or variable step sizes) and not by the training data. I always got the best results when using a fixed step size of 1, as in the original Texel tuning. Higher step sizes may lead to a faster convergence to "something" but that "something" is not necessarily a stronger result in terms of playing strength, even if it would result in a smaller MSE.

You can believe that, but...

A tuner has a pretty easy defined job. It must minimize or maximize a fitness.
In our case the task is to minimize the mse. It is not the job to translate that into elo.

So, when your tuning algorithm stops with a stepsize of 1 and another instance gets a better result with a stepsize of 5,
this one did definitely a better job.

If you are not happy with a/the result, because it fits not other criterias like better gameplay or cosmetic aspects like "I have seen such values before", it is not the problem of the tuner. In my final conclusion i mentioned that, that is human. Just like accepting a local minimum because you like the data better. This distorts the task and the value of the optimization.

My final conclusion was that the characteristics of the data used and the ability of the evaluation function to work together are critical.
I was able to show some effects that belong to the data (quiet criterion) and the difference between a more complex and naive evaluation function.

As long there is no bug included the tuner simply reduces the mse as long it finds a better solution.
This is the desired behavior, regardless of whether the result satisfies a further purpose or criterion.

Ferdy · Post by **Ferdy** » Mon Jan 18, 2021 5:28 pm

A good challenge next would be tuning the material and passer. The position can be dominated by material on one side but can be countered by the presence of a passed pawn on the other side.

Sven · Post by **Sven** » Mon Jan 18, 2021 5:51 pm

Desperado wrote: ↑Mon Jan 18, 2021 5:26 pm Hi Sven,
...
Apart from that, I still believe that extreme outcomes of the tuning process, like those you have shown several times, are caused by the tuning algorithm (including concepts like using higher and/or variable step sizes) and not by the training data. I always got the best results when using a fixed step size of 1, as in the original Texel tuning. Higher step sizes may lead to a faster convergence to "something" but that "something" is not necessarily a stronger result in terms of playing strength, even if it would result in a smaller MSE.
You can believe that, but...

A tuner has a pretty easy defined job. It must minimize or maximize a fitness.
In our case the task is to minimize the mse. It is not the job to translate that into elo.

So, when your tuning algorithm stops with a stepsize of 1 and another instance gets a better result with a stepsize of 5,
this one did definitely a better job.

If you are not happy with a/the result, because it fits not other criterias like better gameplay or cosmetic aspects like "I have seen such values before", it is not the problem of the tuner. In my final conclusion i mentioned that, that is human. Just like accepting a local minimum because you like the data better. This distorts the task and the value of the optimization.

My final conclusion was that the characteristics of the data used and the ability of the evaluation function to work together are critical.
I was able to show some effects that belong to the data (quiet criterion) and the difference between a more complex and naive evaluation function.

As long there is no bug included the tuner simply reduces the mse as long it finds a better solution.
This is the desired behavior, regardless of whether the result satisfies a further purpose or criterion.

With different tuning behaviour you might jump into different local minima, though ... And my experience has been that eval tuning does not always return a new parameter set that leads to stronger play, which *is* the main practical purpose of parameter tuning. So I have to perform an acceptance test of the tuning result before replacing the old parameter set by the new one. And when doing so I found (in the past) that my engine does not like parameter sets where (for instance) material values are very far away from "well-known" values. E.g. when allowing a step size of 8 or even more, I have observed that the pawn MG value jumps away to more than 200cp and both queen values to 1400, 1500 cp and more, leading to weaker play. I have never observed such extremes when using a small step size (1).

Desperado · Post by **Desperado** » Mon Jan 18, 2021 5:52 pm

Ferdy wrote: ↑Mon Jan 18, 2021 5:28 pm A good challenge next would be tuning the material and passer. The position can be dominated by material on one side but can be countered by the presence of a passed pawn on the other side.

Hello Ferdy,

please be so kind and give me a clear answer to the question, do you compare mse from data that differ from each other? yes or no ?

And what do you say to the fact that the values drift further and further apart more efficiently the algorithm works.
The values shown should also show a smaller MSE for you. Can you please confirm or deny this ? more results

Desperado · Post by **Desperado** » Mon Jan 18, 2021 5:58 pm

Sven wrote: ↑Mon Jan 18, 2021 5:51 pm
Desperado wrote: ↑Mon Jan 18, 2021 5:26 pm Hi Sven,
...
Apart from that, I still believe that extreme outcomes of the tuning process, like those you have shown several times, are caused by the tuning algorithm (including concepts like using higher and/or variable step sizes) and not by the training data. I always got the best results when using a fixed step size of 1, as in the original Texel tuning. Higher step sizes may lead to a faster convergence to "something" but that "something" is not necessarily a stronger result in terms of playing strength, even if it would result in a smaller MSE.
You can believe that, but...

A tuner has a pretty easy defined job. It must minimize or maximize a fitness.
In our case the task is to minimize the mse. It is not the job to translate that into elo.

So, when your tuning algorithm stops with a stepsize of 1 and another instance gets a better result with a stepsize of 5,
this one did definitely a better job.

If you are not happy with a/the result, because it fits not other criterias like better gameplay or cosmetic aspects like "I have seen such values before", it is not the problem of the tuner. In my final conclusion i mentioned that, that is human. Just like accepting a local minimum because you like the data better. This distorts the task and the value of the optimization.

My final conclusion was that the characteristics of the data used and the ability of the evaluation function to work together are critical.
I was able to show some effects that belong to the data (quiet criterion) and the difference between a more complex and naive evaluation function.

As long there is no bug included the tuner simply reduces the mse as long it finds a better solution.
This is the desired behavior, regardless of whether the result satisfies a further purpose or criterion.
With different tuning behaviour you might jump into different local minima, though ... And my experience has been that eval tuning does not always return a new parameter set that leads to stronger play, which *is* the main practical purpose of parameter tuning. So I have to perform an acceptance test of the tuning result before replacing the old parameter set by the new one. And when doing so I found (in the past) that my engine does not like parameter sets where (for instance) material values are very far away from "well-known" values. E.g. when allowing a step size of 8 or even more, I have observed that the pawn MG value jumps away to more than 200cp and both queen values to 1400, 1500 cp and more, leading to weaker play. I have never observed such extremes when using a small step size (1).

That is whitewashing facts. The reported mse is smaller, and the behaviour how the tuner found it, is not relevant!
Relevant is, that it exists (and was found). Now you can think about how to handle it.
Otherwise i can tell my tuner to produce 100,300,300,500,1000 and i am happy too.

The main purpose is clear. But you need to work on other parts of the puzzle.

You need to work on the tuner if you miss a minimum mse, that you should find,
but in any case you should ignore a minimum mse that was found. (excluding bug based results of course)

That would make the tuning process useless, for sure. How you will handle the result is another problem too.
Of course i would not use the values i reported, but it triggered me to go deeper into the topic.
That is also very useful. Ignoring does not help.

Another point is, that a mse not always results in better gameplay, even when it results in some kind of perfect values. Countless reasons ...

Desperado · Post by **Desperado** » Mon Jan 18, 2021 6:34 pm

@Sven By the way "best result" is not the same as "most useful result". The best result a (bugfree) tuner reports in our context is the smallest mse it finds.

You need to work on the tuner if you miss a minimum mse, that you should find,
but in any case you should ignore a minimum mse that was found. (excluding bug based results of course)

Of course i meant "do not ignore a minimum mse"

hgm · Post by **hgm** » Mon Jan 18, 2021 10:39 pm

Sven wrote: ↑Mon Jan 18, 2021 5:51 pmE.g. when allowing a step size of 8 or even more, I have observed that the pawn MG value jumps away to more than 200cp and both queen values to 1400, 1500 cp and more, leading to weaker play. I have never observed such extremes when using a small step size (1).

Basically what you saying is that Texel tuning sucks, and you should try to break it as much as possible in order to prevent it can complete its task. So that it won't spoil your starting parameters as much as it otherwise would.

I don't doubt your experience, (in fact it is exactly what I expect from the currently derived set: pretty weak play). But your reaction to this is not consistent. More logical would be to not use it at all, or at least not use it for the piece values, but fix these, and use it to tune only the more subtle eval terms.

And the problem is not in the method, the problem is in the tuning set. If you tune on a set that doesn't contain any info on the piece values, you will of course get poor piece values. Because it wil abuse the piece values to improve some minor terms that accidentally correlate with the material composition, not caring how much that spoils the evaluation of the heavily unbalanced positions missing from your test set.

Engines that play over 3200 Elo search a tree that for >99.9% consists of moronic play, visiting positions even a 1000 Elo player would not tolerate in his games. They can only find the 3200 Elo PV if they score the garbage positions well enough to realize they are worse than the PV. Tuning the eval on the very narrow range of 'sensible' positions will give an eval that very wrongly extrapolates to idiotic positions. To the point where it might start preferring those. E.g. trade Queen for two Bishops, if the piece values are off.

Desperado · Post by **Desperado** » Mon Jan 18, 2021 10:54 pm

...logical would be to not use it at all...

Very true, thought i wrote that tue, but seems that i deleted it.

Ferdy · Post by **Ferdy** » Tue Jan 19, 2021 7:15 am

Desperado wrote: ↑Mon Jan 18, 2021 5:52 pm
Ferdy wrote: ↑Mon Jan 18, 2021 5:28 pm A good challenge next would be tuning the material and passer. The position can be dominated by material on one side but can be countered by the presence of a passed pawn on the other side.
Hello Ferdy,

please be so kind and give me a clear answer to the question, do you compare mse from data that differ from each other? yes or no ?

I can compare mse from 2 different data but have to know what are in those data and of course I know what parameters I am trying to optimize.

Desperado wrote: ↑Mon Jan 18, 2021 5:52 pm And what do you say to the fact that the values drift further and further apart more efficiently the algorithm works.
The values shown should also show a smaller MSE for you. Can you please confirm or deny this ? more results

The training position has a result field, if result is 1-0 its target score is 1.0. Calculating the material only evaluation score of the engine would say give you +300cp for a position with one knight ahead. Now convert that to sigmoid and you will get a value sig. The error is 1.0 - sig. As you increase the knight value its sig would also increase, if sig increases the error goes down.

Ferdy · Post by **Ferdy** » Tue Jan 19, 2021 8:56 am

Desperado wrote: ↑Sun Jan 17, 2021 9:12 pm Now the really interesting part...

THE VALIDATION SETUP
Algorithm: cpw-algorithm
Stepsize: 5
evaltype: qs() tapered - material only
initial vector: 100,100,300,300,300,300,500,500,1000,1000
param-content: P,P,N,N,B,B,R,R,Q,Q
anchor: none
K: 1.0
database: material.epd 13994 positions
batchsize: 13994
Data: no modification
Code: Select all
MG:  65 350 350 410 970
EG: 115 330 370 615 1080 best: 0.136443 epoch: 31
THE CHALLENGE SETUP
Algorithm: cpw-algorithm
Stepsize: 8,4,2,1
evaltype: qs() tapered - material only
initial vector: 100,100,300,300,300,300,500,500,1000,1000
param-content: P,P,N,N,B,B,R,R,Q,Q
anchor: none
K: 1.0
database: material.epd 13994 positions
batchsize: 13994
Data: no modification
Code: Select all
MG:  64 348 344 404  952
EG: 124 384 436 696 1260 best: 0.135957 epoch: 56 (Ferdy reports mse: 0.13605793198602772)
THE TRIAL SETUP
Algorithm: cpw-algorithm
Stepsize: 8,7,6,5,4,3,2,1
evaltype: qs() tapered - material only
initial vector: 100,100,300,300,300,300,500,500,1000,1000
param-content: P,P,N,N,B,B,R,R,Q,Q
anchor: none
K: 1.0
database: material.epd 13994 positions
batchsize: 13994
Data: no modification
Code: Select all
MG:  62 333 333 389   899
EG: 134 416 476 752 1400 best: 0.135663 epoch: 107 (even better)
The better the mse the more the material phase values diverge. Ferdy choosed the data (it is a subset of ccrl_3200_texel.epd)

The material.epd is not a subset of ccrl_3200_texel.epd.

Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)