Tapered Evaluation and MSE (Texel Tuning)

Desperado · Post by **Desperado** » Mon Jan 18, 2021 11:21 am

@Ferdy. Did you see that post already?

hgm · Post by **hgm** » Mon Jan 18, 2021 11:27 am

Desperado wrote: ↑Mon Jan 18, 2021 11:14 amAre you refering to my last answer now? I tried to put some light on the data.

Sort of. You gave (fictitious) numbers for the piece values. But no indication of the MSE. Has that spectacularly dropped from the 0.13, for the full evaluation? To, say, around 0.06?

Desperado · Post by **Desperado** » Mon Jan 18, 2021 11:32 am

hgm wrote: ↑Mon Jan 18, 2021 11:27 am
Desperado wrote: ↑Mon Jan 18, 2021 11:14 amAre you refering to my last answer now? I tried to put some light on the data.
Sort of. You gave (fictitious) numbers for the piece values. But no indication of the MSE. Has that spectacularly dropped from the 0.13, for the full evaluation? To, say, around 0.06?

Yes, it did. I would be able to reproduce my processing steps. But that needs a little bit of time.

Desperado · Post by **Desperado** » Mon Jan 18, 2021 12:05 pm

Desperado wrote: ↑Mon Jan 18, 2021 11:32 am
hgm wrote: ↑Mon Jan 18, 2021 11:27 am
Desperado wrote: ↑Mon Jan 18, 2021 11:14 amAre you refering to my last answer now? I tried to put some light on the data.
Sort of. You gave (fictitious) numbers for the piece values. But no indication of the MSE. Has that spectacularly dropped from the 0.13, for the full evaluation? To, say, around 0.06?
Yes, it did. I would be able to reproduce my processing steps. But that needs a little bit of time.

Algorithm: cpw-algorithm
Stepsize: 5
evaltype: qs() tapered - material only
initial vector: 100,100,300,300,300,300,500,500,1000,1000
param-content: P,P,N,N,B,B,R,R,Q,Q
anchor: none
K: 1.0
database: selection.epd (private,subset of ccrl_3200_texel.epd) 1904154 positions
batchsize: 50K
Data: no modification

Code: Select all

MG: 40 340 340 360 1070
EG: 95  320 335 550 875 best: 0.088626 epoch: 32

Algorithm: cpw-algorithm
Stepsize: 5
evaltype: qs() tapered - full eval
initial vector: 100,100,300,300,300,300,500,500,1000,1000
param-content: P,P,N,N,B,B,R,R,Q,Q
anchor: none
K: 1.0
database: selection.epd (private,subset of ccrl_3200_texel.epd) 1904154 positions
batchsize: 50K
Data: no modification

Code: Select all

MG: 60 350 355 410 1205
EG: 65 310 315 545   845 best: 0.084390 epoch: 48

Well i am not sure about the settings that i used yesterday or if it was exactly the same selection.
But you can see clearly a drop in the mse. You can also see the divergence of the pawn value is left for material only,
and it disapears using a full evaluation. So, my interpration keeps to be the same.

The criteria to be included in the selection.epd was to loop over all positions of the main file.
Computing a pv of length 3 and comparing if the material phase did change (i used a phase for pawn too for that reason).
If phase(root) == phase(leaf) of the pv the position entered the selection.

The effect should be to get some kind of quiet positions, let's say stable positions where short tactics are excluded.
I guess, i was able to achieve that, because the material only evaluation already gets reasonable results.

The full evaluation, underlines my interpretation that there is a lot compensation introduced by other parameters.
The prediction value comes closer to the result while the pawn value do not longer diverge.

Desperado · Post by **Desperado** » Mon Jan 18, 2021 1:12 pm

In the end, I found out what was obvious from the beginning.

The quality/content of the evaluation function in combination with the quality/selection of the data strongly influence the result of tuning.

What is not so obvious is how this data is determined from general sources that fit the own evaluation (parameters).

The immensely high quality of engines in the 3200 range does not necessarily map to an engine with level 2300 as example.
The evaluation parameters that compensate e.g. the value on a pawn are projected on the "wrong" parameters to minimize the MSE.
It is difficult to map parameters of two different evaluation functions. If there is a big difference like material only and a complex one,
it easily leads to useless results. In that case i would not say that the data is poor, but the target evaluation is not able to handle that. (it is two sided)

The next impression i got in the course of the thread is, that people still like to see results, that fit into their (human) perspective. That includes myself too. But in most cases there are facts/indications and it is most important to stay open to them in order to make progress.

The next topic for me, based on the findings, could be how to generate data for an own engine.
Here, I would like to focus on how to determine data that can be mapped well to one's own evaluation function.
I mean, where the properties of the data have a good projection on the existing evaluation and do not address to many
features that are not implemented.

My final conclusion on the threads topic is, that the data of high rated games include properties that cannot be mapped (well) on a material
only evaluation. Additionally using the raw data without any pre-selection will provide non-stable / non-quiet positions that have strong influence on the tuning result too.

hgm · Post by **hgm** » Mon Jan 18, 2021 1:51 pm

In any case there now is a significant reduction of the MSE compared to that of a constant evaluation. And, as expected, material is by far the most important term; including the other terms only gives a very small improvement of the MSE.

That the optimum piece values change just means that they were not orthogonal to the positional terms, and that the latter can partly mimic their function, and sometimes do.

That the optimiztion can get away with piece values that are 'somewhat original' could be an indication that the test set doesn't put really tight restrictions on those. E.g. because there aren't enough Q-vs-2R positions in the set, it can afford a suspect Q/R ratio to improve the prediction. The prediction would then be good on the positions in the test set, but it would suck for Q-vs-2R imbalances in general, with as a consequence that an engine playing with this evaluation would engage in many poor Q-for-2R trades against engines that do know better.

The proof as always would be in the Elo of the engine using this evaluation. It is not obvious to me that tuning on a set obtained from games that start from balanced positions would lead to a good evaluation; there is not enough information in such a test set on how really bad features are punished in terms of score. Because the playing engines would be able to avoid them for too long a time.

One would probably get better evaluations (in term of Elo for the engines using them, not in terms of MSE) by testing on games of lower quality. The quicker development of large imbalances would probably be more important than whether an imbalance, once it occurs, would be perfectly mapped on its game-theoretical result (which would average out over many games). Quality of play should not be too low, however; a random mover certainly wouldn't do.

Desperado · Post by **Desperado** » Mon Jan 18, 2021 2:41 pm

hgm wrote: ↑Mon Jan 18, 2021 1:51 pm
...

One would probably get better evaluations (in term of Elo for the engines using them, not in terms of MSE) by testing on games of lower quality. The quicker development of large imbalances would probably be more important than whether an imbalance, once it occurs, would be perfectly mapped on its game-theoretical result (which would average out over many games). Quality of play should not be too low, however; a random mover certainly wouldn't do.

My idea is to play handicap matches, self games from clones not different versions. The handicap is time(depth,nodes) based.
My hope is that the games will provide a good (configurable) WDL distribution, have a good strong relationship to the own evaluation,
and they provide a enough information of imbalance because of the handicap. Further i like the idea that the engine
trains on positions that it produced itself before, so they were relevant somehow in own gameplay.

Pio · Post by **Pio** » Mon Jan 18, 2021 3:15 pm

Desperado wrote: ↑Mon Jan 18, 2021 2:41 pm
hgm wrote: ↑Mon Jan 18, 2021 1:51 pm
...

One would probably get better evaluations (in term of Elo for the engines using them, not in terms of MSE) by testing on games of lower quality. The quicker development of large imbalances would probably be more important than whether an imbalance, once it occurs, would be perfectly mapped on its game-theoretical result (which would average out over many games). Quality of play should not be too low, however; a random mover certainly wouldn't do.
My idea is to play handicap matches, self games from clones not different versions. The handicap is time(depth,nodes) based.
My hope is that the games will provide a good (configurable) WDL distribution, have a good strong relationship to the own evaluation,
and they provide a enough information of imbalance because of the handicap. Further i like the idea that the engine
trains on positions that it produced itself before, so they were relevant somehow in own gameplay.

I think it would be very smart to weigh the positions close to the end a lot more since the correlation between position and (win/loss/draw) result will be higher then and you will also see many more unbalanced positions for the wins and losses. I don’t think it is better to use self games than it is to use games where you play against another opponent. It might actually be good to use games where you play against other opponents because you can learn that some positions are losing even though your own engine cannot finalise the win and therefore would never have learnt from it if it would have played itself from that position.

Sven · Post by **Sven** » Mon Jan 18, 2021 3:59 pm

Desperado wrote: ↑Mon Jan 18, 2021 2:41 pm
hgm wrote: ↑Mon Jan 18, 2021 1:51 pm
...

One would probably get better evaluations (in term of Elo for the engines using them, not in terms of MSE) by testing on games of lower quality. The quicker development of large imbalances would probably be more important than whether an imbalance, once it occurs, would be perfectly mapped on its game-theoretical result (which would average out over many games). Quality of play should not be too low, however; a random mover certainly wouldn't do.
My idea is to play handicap matches, self games from clones not different versions. The handicap is time(depth,nodes) based.
My hope is that the games will provide a good (configurable) WDL distribution, have a good strong relationship to the own evaluation,
and they provide a enough information of imbalance because of the handicap. Further i like the idea that the engine
trains on positions that it produced itself before, so they were relevant somehow in own gameplay.

The original Texel tuning method description speaks about using a lot of games between recent versions of the tuning candidate engine itself. The well-known "quiet-labeled.epd" file, described here, has been created that way, for instance.

Apart from that, I still believe that extreme outcomes of the tuning process, like those you have shown several times, are caused by the tuning algorithm (including concepts like using higher and/or variable step sizes) and not by the training data. I always got the best results when using a fixed step size of 1, as in the original Texel tuning. Higher step sizes may lead to a faster convergence to "something" but that "something" is not necessarily a stronger result in terms of playing strength, even if it would result in a smaller MSE.

Did I miss a proof of the opposite?

hgm · Post by **hgm** » Mon Jan 18, 2021 4:43 pm

Handicap matches (and even approximately equal material imbalanced start positions) are a good idea, and would help enormously for getting reliable piece values. But I don't think it is the only thing that is required. There are also weaknesses in Pawn structure that good engines will avoid, and could only be present in significant numbers when you incorporate them in the start position. Playing with weak engines would create such weaknesses much more often. Perhaps you should just generate opening lines of 5-10 moves from games between random movers, and let strong engines play from those.

BTW, using too high a level of play for playing out the games would be very detrimental. In the extreme case of perfect play, each position would get its game-theoretical value, W, D or L. That would not help much distinguishing drawn positions where you are much better from those that are worse. Which is what the evaluation must do to get a strong engine. An engine with a perfect evaluation would never be able to beat a moderately strong opponent, because it starts in a drawn position, and will allow the opponent to improve his position without offering any resistance. Only when it is at the brink of losing it will come up with brilliant defenses that preserve the draw, but it would never make any attempt to push the opponent to positions that are close to winning (so that it would soon turn into a won game because of opponent imperfections). Because it cannot recognize what such positions are.

So to get an evaluation that makes the engine strong, you must make it recognize what positions are likely to cause opponent mistakes. And it can only learn that from games where the players make sufficiently many mistakes. Then it can see things like "this position is very good for black, because imperfect players are likely to blunder here". Otherwise it would just think "this position is always draw", even when you have to be devlishly clever to maintain that draw.

Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)