Question on Texel's Tuning Method

odomobo · Post by **odomobo** » Wed Jul 08, 2020 6:32 pm

I have a theoretical question on Texel's tuning method. If I understand it correctly, the function takes a set of positions, and minimizes the mean squared error of the positions. The error in a position is the difference between the (adjusted) evaluation and the result of the game.

Clearly this is an effective technique, but I can't help but think that most of the positions being analyzed don't have a clear winner. In a pathological case, the game's winner may be slightly losing in the middle game. This would be asking the algorithm to tune a winning position to evaluate as a losing position, and vice versa.

Similarly, wouldn't these unclear positions dominate the contribution to mean squares error? That is, highly accurate late-game evaluations will be ignored in favor of slightly inflated early- and mid-game evaluations.

My thought was, wouldn't it make more sense to instead calculate a position's error by calculating the difference between the current position and the position, say, 5 moves into the future (perhaps an average of several future moves)? You'd need a technique to prevent it from tuning all parameters to 0, but you only need to prevent the final positions of the games from substantially lowering their scores (there are a few techniques that come to mind).

The effect of this should be that the mean squared error is now dominated by positions on the cusp of a large eval swing, and that positions throughout the duration of the game all have similar effects on contributions to mean squared error.

Is my thinking sound, or am I missing something?

hgm · Post by **hgm** » Wed Jul 08, 2020 11:17 pm

I think the normal Texel method is sound enough, because what you in fact do is tune the evaluation to the average score of many very similar positions. And yes, that includes positions that are slightly better, but were still lost. But on average the number of wins from positions that are slightly better would be larger than the number of losses. The better the position, the larger the the difference. That is what you want the evaluation to quantify: your prospects of scoring points.

By artificially making attempts to remove losses from positions that were better, you distort the average, and the positions start to look better than they actually are. This would wreck the tuning.

chrisw · Post by **chrisw** » Thu Jul 09, 2020 9:39 am

hgm wrote: ↑Wed Jul 08, 2020 11:17 pm I think the normal Texel method is sound enough, because what you in fact do is tune the evaluation to the average score of many very similar positions. And yes, that includes positions that are slightly better, but were still lost. But on average the number of wins from positions that are slightly better would be larger than the number of losses. The better the position, the larger the the difference. That is what you want the evaluation to quantify: your prospects of scoring points.

By artificially making attempts to remove losses from positions that were better, you distort the average, and the positions start to look better than they actually are. This would wreck the tuning.

There’s a great deal of power in “averaging”. My own life-changing (haha, but I learnt something) was a physics experiment from way back, we had to do one a week for two years and write it up, drove me mad with boredom usually. However the worst was being told (everybody did something different each week) to measure a short steel bar with a micrometer screw gauge, highly accurate, and then with a crude wooden ruler with millimetre markings. The idea was to guess at the first significant decimal place (eg 45.7 mm or whatever) put the bar down, write down the guess (well, eye-sighted guess) and repeat 100 times. Blow me down, but the average of the guesses, with several decimal places matched the micrometer screw gauge.
The average of the noise approximates well to the actuality.

Ferdy · Post by **Ferdy** » Thu Jul 09, 2020 11:06 am

odomobo wrote: ↑Wed Jul 08, 2020 6:32 pm I have a theoretical question on Texel's tuning method. If I understand it correctly, the function takes a set of positions, and minimizes the mean squared error of the positions. The error in a position is the difference between the (adjusted) evaluation and the result of the game.

Clearly this is an effective technique, but I can't help but think that most of the positions being analyzed don't have a clear winner.

Depends on your training positions. Texel Tuning can also handle draw and loss results aside from win result. If the training position has a draw result, the engine being trained has to show a drawish score or close to 0, if the result is a loss, the engine has to show a losing score.

jdart · Post by **jdart** » Thu Jul 09, 2020 8:42 pm

chrisw wrote: ↑Thu Jul 09, 2020 9:39 am There’s a great deal of power in “averaging”.

See the Central Limit Theorem for some mathematical justification of this.

chrisw · Post by **chrisw** » Thu Jul 09, 2020 9:19 pm

jdart wrote: ↑Thu Jul 09, 2020 8:42 pm
chrisw wrote: ↑Thu Jul 09, 2020 9:39 am There’s a great deal of power in “averaging”.
See the Central Limit Theorem for some mathematical justification of this.

I tried to win the “guess the weight of the leg of ham” competition in village event once by going last and trying to do a running average of everybody else’s guesses as they announced them. Failed dismally.

Pio · Post by **Pio** » Thu Jul 09, 2020 9:42 pm

chrisw wrote: ↑Thu Jul 09, 2020 9:19 pm
jdart wrote: ↑Thu Jul 09, 2020 8:42 pm
chrisw wrote: ↑Thu Jul 09, 2020 9:39 am There’s a great deal of power in “averaging”.
See the Central Limit Theorem for some mathematical justification of this.
I tried to win the “guess the weight of the leg of ham” competition in village event once by going last and trying to do a running average of everybody else’s guesses as they announced them. Failed dismally.

I guess you guessed too high.

I would have guessed e^(sum(ln(Wi))/N) where e is 2.71... , ln the natural logarithm and N the number of samples/guesses. Of course you could choose whatever base/log_base-pair you want.

The reason behind my formula is that I think one is as likely to guess true_value/x as true_value*x

/Pio

hgm · Post by **hgm** » Fri Jul 10, 2020 11:09 am

There is a good analogy here with a 'strip detector'. When you want to determine the location of a point source (e.g. a tight light beam) you can make a dector that consists of adjacent light-sensitive strips. But you would never know the location within the strip that detects it. You could of course solve that by making very many very narrow strips. But it is far simpler to just put a 'diffuser' somewhat in front of the strip, which spreads out the light beam to a spot about the width of a strip. By measuring which fraction then falls into one strip, and which in the next one, you can get a pretty precise idea of where the center of the beam was w.r.t. the strips.

In Chess the strips are the win / draw / loss results. The diffuser is the non-perfect play between the position and game end, which causes the same position to not always fall on the same strip at game end. So that the ratio of the various results can be used to precisely measure the quality of the position.

odomobo · Post by **odomobo** » Fri Jul 10, 2020 3:32 pm

The explanation of average scores of many similar positions does partially answer my question, but there's another aspect that I don't think I expressed well.

When tuning something (an eval in chess, or anything else), your error function will determine what things are most important to your tuning algorithm. For example, let's say we're tuning a simple f(x), with training data t(x), and our error function is abs[t(x) - f(x)]. Then our tuning algo is interested in absolute difference, i.e. 10000 vs 10010 is just as important as 10 vs 20.

But for the error function abs[t(x) - f(x)]/abs[t(x)], then it only cares about the relative difference, i.e. 10000 vs 11000 is just as important as 10 vs 11. In the latter case, smaller numbers must be much more accurate than larger numbers, but in the latter case, all numbers must be as accurate. Depending on the nature of your f(x), the nature of your training data, and the error function, then the tuning algorithm might spend all of its effort on large values, or might spend all of its effort on small values. Another way to put it, large values may end up dominating the mean squared error result, or small values may dominate it.

For texel tuning, here are the cases that it finds important and unimportant (as I understand it). Note that this is assuming all of the evaluations are relatively accurate (let's say at least within 100 cp).

Decisive positions: mostly unimportant
Unclear positions (in a game that eventually ends in draw): mostly unimportant
Unclear positions (in a game that eventually ends in win/loss): very important

The reason for this is in similar unclear positions (ending in W/L), we're averaging huge squared error values; we aren't getting the mean of the error values and then squaring them.

Intuitively, this doesn't seem like the best approach. To me, it seems a better approach would be to value unclear middlegame positions (ending in W/L) more similarly to other positions. But maybe I'm missing something.

Pio · Post by **Pio** » Fri Jul 10, 2020 3:43 pm

odomobo wrote: ↑Fri Jul 10, 2020 3:32 pm The explanation of average scores of many similar positions does partially answer my question, but there's another aspect that I don't think I expressed well.

When tuning something (an eval in chess, or anything else), your error function will determine what things are most important to your tuning algorithm. For example, let's say we're tuning a simple f(x), with training data t(x), and our error function is abs[t(x) - f(x)]. Then our tuning algo is interested in absolute difference, i.e. 10000 vs 10010 is just as important as 10 vs 20.

But for the error function abs[t(x) - f(x)]/abs[t(x)], then it only cares about the relative difference, i.e. 10000 vs 11000 is just as important as 10 vs 11. In the latter case, smaller numbers must be much more accurate than larger numbers, but in the latter case, all numbers must be as accurate. Depending on the nature of your f(x), the nature of your training data, and the error function, then the tuning algorithm might spend all of its effort on large values, or might spend all of its effort on small values. Another way to put it, large values may end up dominating the mean squared error result, or small values may dominate it.

For texel tuning, here are the cases that it finds important and unimportant (as I understand it). Note that this is assuming all of the evaluations are relatively accurate (let's say at least within 100 cp).

Decisive positions: mostly unimportant

Unclear positions (in a game that eventually ends in draw): mostly unimportant

Unclear positions (in a game that eventually ends in win/loss): very important

The reason for this is in similar unclear positions (ending in W/L), we're averaging huge squared error values; we aren't getting the mean of the error values and then squaring them.

Intuitively, this doesn't seem like the best approach. To me, it seems a better approach would be to value unclear middlegame positions (ending in W/L) more similarly to other positions. But maybe I'm missing something.

You should first convert your scores to win probabilities first so the big scores won’t dominate. It is the mean squares of the win probabilities you want to minimise. I am not sure however if it is best to minimise the squares. I would probably test to minimise the absolute error which has the advantage of better handling bad labelling.

BR
/Pio

Question on Texel's Tuning Method

Question on Texel's Tuning Method

Re: Question on Texel's Tuning Method

Re: Question on Texel's Tuning Method

Re: Question on Texel's Tuning Method

Re: Question on Texel's Tuning Method

Re: Question on Texel's Tuning Method

Re: Question on Texel's Tuning Method

Re: Question on Texel's Tuning Method

Re: Question on Texel's Tuning Method

Re: Question on Texel's Tuning Method