Proof of strict convexity for logistic loss function:
http://qwone.com/~jason/writing/convexLR.pdf
--Jon
tuning for the uninformed
Moderators: hgm, Rebel, chrisw
-
- Posts: 4367
- Joined: Fri Mar 10, 2006 5:23 am
- Location: http://www.arasanchess.org
-
- Posts: 349
- Joined: Sat Aug 06, 2016 8:31 pm
- Location: United States
Re: tuning for the uninformed
If this was in response to my post directly above it, I have Texel tuning implemented using the local optimization function given as pseudo-code on the CPW, but it sounds like there are ways to get closer first via "gradient descent or gauss-newton". Unfortunately, what I've read of these so far was over my head.jdart wrote:Texel tuning is basically a form of logistic regression. There is a large literature on this, and a lot of explanatory material online. As initially described it is slightly unorthodox because it is using a squared-error loss function to handle the 3 categories of win, loss, draw. But the general principle is still the same.
--Jon
Regarding the 3-categories (win, loss, draw) being unusual, does that mean using an oracle eval would actually be more appropriate? Would it just plugin where the result (Ri) goes or does it have to be converted to a -1 to 1 range first (using the sigmoid function?)?
-
- Posts: 4367
- Joined: Fri Mar 10, 2006 5:23 am
- Location: http://www.arasanchess.org
Re: tuning for the uninformed
The theory of logistic regression with discrete outcomes (dependent variables) deals mainly with three cases:
1. binary (0 or 1) outcomes
2. ordinal outcomes (such as poor, average, good).
3. discrete but unordered outcomes.
None of these commonly use a squared error distance to measure goodness of fit, as the Texel method does.
However, our problem is basically 2, but with a twist: there are three possible values for a game (0, 0.5 or 1), but the values are meaningful in the sense that 0.5 is really equidistant between 0 and 1. So in that case the Texel method may be a reasonable approach, it is just not as theoretically grounded.
As for using the eval as an oracle: yes, you can do this. Your label is then the oracle's value and you regress to find the best match between predicted and the oracle value. All this changes in the whole procedure is the "loss function" that measures goodness of fit. You could use mean absolute or squared difference between predicted and oracle eval, for example.
Yet another approach is to make predicted moves based on your eval match actually played moves by an "oracle" such as a strong program or a strong human player. This approach has been used in Shogi - there is a paper by Hoki and Kaneko.
But realize that then you are really solving a different problem. The regression will make your eval conform to the oracle's eval. In effect, this is a fancy way of reverse engineering the oracle's eval. It may or may not make your eval predict game results better, or actually play better, but if the oracle is much stronger then it likely will cause improvement.
--Jon
1. binary (0 or 1) outcomes
2. ordinal outcomes (such as poor, average, good).
3. discrete but unordered outcomes.
None of these commonly use a squared error distance to measure goodness of fit, as the Texel method does.
However, our problem is basically 2, but with a twist: there are three possible values for a game (0, 0.5 or 1), but the values are meaningful in the sense that 0.5 is really equidistant between 0 and 1. So in that case the Texel method may be a reasonable approach, it is just not as theoretically grounded.
As for using the eval as an oracle: yes, you can do this. Your label is then the oracle's value and you regress to find the best match between predicted and the oracle value. All this changes in the whole procedure is the "loss function" that measures goodness of fit. You could use mean absolute or squared difference between predicted and oracle eval, for example.
Yet another approach is to make predicted moves based on your eval match actually played moves by an "oracle" such as a strong program or a strong human player. This approach has been used in Shogi - there is a paper by Hoki and Kaneko.
But realize that then you are really solving a different problem. The regression will make your eval conform to the oracle's eval. In effect, this is a fancy way of reverse engineering the oracle's eval. It may or may not make your eval predict game results better, or actually play better, but if the oracle is much stronger then it likely will cause improvement.
--Jon
-
- Posts: 4367
- Joined: Fri Mar 10, 2006 5:23 am
- Location: http://www.arasanchess.org
Re: tuning for the uninformed - Poisson?
It is has occurred me that the correct model here might be Poisson regression (https://en.wikipedia.org/wiki/Poisson_regression). If we ran for each position a match of n games, and if the probability of a win is some value p, then the distribution of game results (x2) would be integers that could be modeled as a Poisson distribution and the parameters could be tuned using that method to model the outcomes. I don't know if this is completely valid but it seems plausible. Texel tuning as initially described is the special case of n=1.
--Jon
--Jon
-
- Posts: 1334
- Joined: Sun Jul 17, 2011 11:14 am
Re: tuning for the uninformed - Poisson?
This is only mildly related to the topic, but Jon, I think you said that Arasan used a closed-form equation for the derivatives of your eval. Would you mind going over how you calculated those? It confuses me a lot.jdart wrote:It is has occurred me that the correct model here might be Poisson regression (https://en.wikipedia.org/wiki/Poisson_regression). If we ran for each position a match of n games, and if the probability of a win is some value p, then the distribution of game results (x2) would be integers that could be modeled as a Poisson distribution and the parameters could be tuned using that method to model the outcomes. I don't know if this is completely valid but it seems plausible. Texel tuning as initially described is the special case of n=1.
--Jon
For example, even if I take a stupid material only tapered eval with mean-squared error, I get this:
Code: Select all
Let count_p_w, count_n_w etc be the count of pawns, knights etc for white.
Let count_p_b, count_n_b etc be the count of pawns, knights etc for black.
Let phase_p, phase_n etc be the phase weight of pawns, knights, etc for tapered eval.
Let value_p_o, value_n_o etc be the material value of pawns, knights, etc in the opening.
Let value_p_e, value_n_e etc be the material value of pawns, knights, etc in the endgame
value_o = (count_p_w - count_p_b) * value_p_o + (count_n_w - count_n_b) * value_n_o + ...
value_e = (count_p_w - count_p_b) * value_p_e + (count_n_w - count_n_b) * value_n_e + ...
total_phase = 16 * phase_p + 4 * phase_n + ...
phase = (count_p_w + count_p_b) * phase_p + (count_n_w + count_n_b) * phase_n + ...
value = ((phase * value_o) + ((total_phase - phase) * value_e)) / total_phase
sigmoid = 1 / (1 + 10 ** (-K*value))
error = (result - sigmoid) * (result - sigmoid)
Some believe in the almighty dollar.
I believe in the almighty printf statement.
I believe in the almighty printf statement.
-
- Posts: 931
- Joined: Tue Mar 09, 2010 3:46 pm
- Location: New York
- Full name: Álvaro Begué (RuyDos)
Re: tuning for the uninformed - Poisson?
I'll give you the beginning of an explanation here. If you want to see the whole derivation, look up "backpropagation".ZirconiumX wrote:This is only mildly related to the topic, but Jon, I think you said that Arasan used a closed-form equation for the derivatives of your eval. Would you mind going over how you calculated those? It confuses me a lot.jdart wrote:It is has occurred me that the correct model here might be Poisson regression (https://en.wikipedia.org/wiki/Poisson_regression). If we ran for each position a match of n games, and if the probability of a win is some value p, then the distribution of game results (x2) would be integers that could be modeled as a Poisson distribution and the parameters could be tuned using that method to model the outcomes. I don't know if this is completely valid but it seems plausible. Texel tuning as initially described is the special case of n=1.
--Jon
For example, even if I take a stupid material only tapered eval with mean-squared error, I get this:
And I have no idea how to then differentiate that, because that's a lot of variables.Code: Select all
Let count_p_w, count_n_w etc be the count of pawns, knights etc for white. Let count_p_b, count_n_b etc be the count of pawns, knights etc for black. Let phase_p, phase_n etc be the phase weight of pawns, knights, etc for tapered eval. Let value_p_o, value_n_o etc be the material value of pawns, knights, etc in the opening. Let value_p_e, value_n_e etc be the material value of pawns, knights, etc in the endgame value_o = (count_p_w - count_p_b) * value_p_o + (count_n_w - count_n_b) * value_n_o + ... value_e = (count_p_w - count_p_b) * value_p_e + (count_n_w - count_n_b) * value_n_e + ... total_phase = 16 * phase_p + 4 * phase_n + ... phase = (count_p_w + count_p_b) * phase_p + (count_n_w + count_n_b) * phase_n + ... value = ((phase * value_o) + ((total_phase - phase) * value_e)) / total_phase sigmoid = 1 / (1 + 10 ** (-K*value)) error = (result - sigmoid) * (result - sigmoid)
I am going to compute the partial derivative of `error' with respect to each quantity in your formulas, `d_error/d_quantity'. We'll go from the end of the computation.
d_error/d_error = 1
Then we see this:
error = (result - sigmoid)^2
Take the derivative of both sides with respect to `sigmoid' and you get
d_error/d_sigmoid = 2 * (result - sigmoid)
The next one is:
sigmoid = 1 / (1 + 10^(-K*value))
I am interested in computing `d_error/d_value', which by the chain rule is
d_error/d_value = d_error/d_sigmoid * d_sigmoid/d_value
We have already computed `d_error/d_sigmoid' before, and `d_sigmoid/d_value' can be obtained by differentiating `sigmoid = 1 / (1 + 10^(-K*value))' with respect to d_value:
d_sigmoid/d_value = (K * log(10) * 10^(K * value))/(10^(K * value) + 1)^2
If you understand how this last step works, you can keep going back and compute the derivative of the error with respect to everything, in particular with respect to your evaluation parameters.
-
- Posts: 4367
- Joined: Fri Mar 10, 2006 5:23 am
- Location: http://www.arasanchess.org
Re: derivatives, scaling
If your calculus is rusty, you can use this online tool:
https://www.derivative-calculator.net/
I handle midgame/endgame weighting as follows: each parameter has an entry in an array that is a structure and includes how it should be scaled.
Each time a gradient is updated, I call a scale method that applies the scaling factor for the associated parameter.
See tune.cpp and specifically the scale method here:
https://github.com/jdart1/arasan-chess
--Jon
https://www.derivative-calculator.net/
I handle midgame/endgame weighting as follows: each parameter has an entry in an array that is a structure and includes how it should be scaled.
Each time a gradient is updated, I call a scale method that applies the scaling factor for the associated parameter.
See tune.cpp and specifically the scale method here:
https://github.com/jdart1/arasan-chess
--Jon
-
- Posts: 72
- Joined: Mon Mar 07, 2016 4:41 pm
- Location: Zürich, Switzerland
Re: tuning for the uninformed
Is there anyone who has a large epd file for download somewhere with the results of the games?
cheers
Martin
cheers
Martin
-
- Posts: 931
- Joined: Tue Mar 09, 2010 3:46 pm
- Location: New York
- Full name: Álvaro Begué (RuyDos)
Re: tuning for the uninformed
Give this a try: https://bitbucket.org/alonamaloh/ruy_tu ... th_resultsfierz wrote:Is there anyone who has a large epd file for download somewhere with the results of the games?
-
- Posts: 4367
- Joined: Fri Mar 10, 2006 5:23 am
- Location: http://www.arasanchess.org
Re: tuning for the uninformed
I have done something similar.
It has occurred to me, though, that the current tuning method would work just the same if each position had a winning probability, not just a 0, 0.5, 1 result. So if you really want to burn a lot of CPU cycles, you could run a match of say, 10 games with each position. The result would be a training set with more accurate labels.
--Jon
It has occurred to me, though, that the current tuning method would work just the same if each position had a winning probability, not just a 0, 0.5, 1 result. So if you really want to burn a lot of CPU cycles, you could run a match of say, 10 games with each position. The result would be a training set with more accurate labels.
--Jon