Computing the gradient of the MSE cost function

algerbrex · Post by **algerbrex** » Fri Jun 10, 2022 3:15 am

I'm working on implementing a gradient descent tuner as of late. So I started off by first sitting down and computing the gradient of the cost function.
Of course since the gradient is defined as a vector of partials of each parameter of the given function, I then derived a formula for the partial of the cost function E, with respect to a given weight n:

The final answer I got is boxed in.

Now, I think I've done something wrong or misunderstood something about applying gradient descent to evaluation tuning, since using the above formula seems horribly slow, since I'm taking the formula I derived above and applying it to every weight to compute a gradient vector. And each time I apply the formula to weight, I'm iterating over the whole dataset of positions with length N. So the complexity ends up being O (M x N), where M is the number of weights I have (in this case 778):

Code: Select all

func computePartialDerivate(weights [NumWeights]float64, weightIdx int) (sum float64) {
	for i := 0; i < NumPositions; i++ {
		score := evaluate(weights, Entries[i].Coefficents)
		eTerm := math.Exp(-ScalingFactor * score)
		sigmoid := 1 / (1 + eTerm)
		err := Entries[i].Outcome - sigmoid
		sum += err * math.Pow(sigmoid, 2) * (ScalingFactor * Entries[i].Coefficents[weightIdx] * eTerm)
	}
	return -(sum / BatchSize * 2)
}

Using the gradients to update the weights:

Code: Select all

copyWeights := Weights
for j := 0; j < NumWeights; j++ {
    Weights[j] += LearningRate * computePartialDerivate(copyWeights, j, batch)
}

Now, did I derive the wrong formula, or should I be calculating the gradient using some other approach?

clayt · Post by **clayt** » Fri Jun 10, 2022 5:02 am

You have a slightly different model from me, so I'll explain what I'm doing in my own derivation. You can save yourself a lot of algebra by using the handy property of the logistic function: d/dx(sigmoid(x)) = sigmoid(x)(1 - sigmoid(x)). The algebra is also a little easier if you compute the gradient all in one go; I suspect that you made an algebra mistake somewhere in the details of computing the partial with respect to one element of w.

(where $X_{i, .}$ is the i-th feature vector of your dataset and $\sigma(x)$ is the logistic sigmoid function).

You can compute the partial derivative for the entire weight vector in one go. Here's a cleaned sample of my Rust tuner code, which uses a sparse representation of feature vectors:

Code: Select all

/// Construct the gradient vector for the input data. 
/// Each element of `input` contains the feature vector and the expected output R_i.
fn gradient(
    input: &[(Vec<(usize, f32)>, f32)],
    weights: &[f32],
    k: f32,
) -> Vec<f32> {
    let mut grad = vec![0.; weights.len()];
    
    for datum in input {
        let features = &datum.0;
        let sigm_eval = sigmoid(evaluate(features, weights) * k);
        let err = datum.1- sigm_eval;
        let coeff = -2. * k * sigm_eval * (1. - sigm_eval) * err;
        // gradient descent
        for &(idx, feat_val) in features {
            grad[idx] += feat_val * coeff;
        }
    }

    grad
}

If you're extremely tight about performance, you can throw away those constant multiples by folding them into your learn rate parameter. You will find that the sparse representation will lead to about a 30x speedup. I also multithreaded my tuner (not shown here) for a roughly 16x speedup.

Chessnut1071 · Post by **Chessnut1071** » Fri Jun 10, 2022 6:55 am

algerbrex wrote: ↑Fri Jun 10, 2022 3:15 am I'm working on implementing a gradient descent tuner as of late. So I started off by first sitting down and computing the gradient of the cost function.
Of course since the gradient is defined as a vector of partials of each parameter of the given function, I then derived a formula for the partial of the cost function E, with respect to a given weight n:

The final answer I got is boxed in.

Now, I think I've done something wrong or misunderstood something about applying gradient descent to evaluation tuning, since using the above formula seems horribly slow, since I'm taking the formula I derived above and applying it to every weight to compute a gradient vector. And each time I apply the formula to weight, I'm iterating over the whole dataset of positions with length N. So the complexity ends up being O (M x N), where M is the number of weights I have (in this case 778):
Code: Select all
func computePartialDerivate(weights [NumWeights]float64, weightIdx int) (sum float64) {
	for i := 0; i < NumPositions; i++ {
		score := evaluate(weights, Entries[i].Coefficents)
		eTerm := math.Exp(-ScalingFactor * score)
		sigmoid := 1 / (1 + eTerm)
		err := Entries[i].Outcome - sigmoid
		sum += err * math.Pow(sigmoid, 2) * (ScalingFactor * Entries[i].Coefficents[weightIdx] * eTerm)
	}
	return -(sum / BatchSize * 2)
}
Using the gradients to update the weights:
Code: Select all
copyWeights := Weights
for j := 0; j < NumWeights; j++ {
    Weights[j] += LearningRate * computePartialDerivate(copyWeights, j, batch)
}
Now, did I derive the wrong formula, or should I be calculating the gradient using some other approach?

After years of using gradient methods, the chess optimization problem is better off with a Hooke & Jeeves type optimization. Gradient methods have severe issues with local minimums and convergence to non-optimal solutions. A relative cheapo Hooke & Jeeves search can out perform gradient methods by significant margins. It's an easy algorithm to write.

algerbrex · Post by **algerbrex** » Fri Jun 10, 2022 9:19 am

clayt wrote: ↑Fri Jun 10, 2022 5:02 am You have a slightly different model from me, so I'll explain what I'm doing in my own derivation. You can save yourself a lot of algebra by using the handy property of the logistic function: d/dx(sigmoid(x)) = sigmoid(x)(1 - sigmoid(x)). The algebra is also a little easier if you compute the gradient all in one go; I suspect that you made an algebra mistake somewhere in the details of computing the partial with respect to one element of w.

(where $X_{i, .}$ is the i-th feature vector of your dataset and $\sigma(x)$ is the logistic sigmoid function).

You can compute the partial derivative for the entire weight vector in one go. Here's a cleaned sample of my Rust tuner code, which uses a sparse representation of feature vectors:
Code: Select all
/// Construct the gradient vector for the input data. 
/// Each element of `input` contains the feature vector and the expected output R_i.
fn gradient(
    input: &[(Vec<(usize, f32)>, f32)],
    weights: &[f32],
    k: f32,
) -> Vec<f32> {
    let mut grad = vec![0.; weights.len()];
    
    for datum in input {
        let features = &datum.0;
        let sigm_eval = sigmoid(evaluate(features, weights) * k);
        let err = datum.1- sigm_eval;
        let coeff = -2. * k * sigm_eval * (1. - sigm_eval) * err;
        // gradient descent
        for &(idx, feat_val) in features {
            grad[idx] += feat_val * coeff;
        }
    }

    grad
}
If you're extremely tight about performance, you can throw away those constant multiples by folding them into your learn rate parameter. You will find that the sparse representation will lead to about a 30x speedup. I also multithreaded my tuner (not shown here) for a roughly 16x speedup.

Thanks a lot for this very detailed answer!

I didn't know about the logistic property of functions, very handy. I probably did make a mistake somewhere in my algebra trying to compute all those partials. I'll go back and use the logistic property to redo my derivation, so hopefully, it's cleaner and correct.

And thank you for the code sample. I don't know why it didn't occur to me that we could simply update all the partials of each weight at each position, and in that way only need to go through the positions once and compute all of the gradients. That makes much more sense, and I'll re-write my code to work that way.

I'll also switch over to a more sparse implementation for the feature vector as you mentioned since the majority of those coefficients are empty for any given position they're calculated for.

One question though, mostly just curiosity. When you compute the partial with respect Ri-hat in your math, it seems that your final term is the entire i-th feature vector, but should it not be the n-th term of the i-th feature vector, where n corresponds to the specific weight you're computing the partial for? Your code does show this so I know it's correct, but I was a little unsure of your notation.

algerbrex · Post by **algerbrex** » Fri Jun 10, 2022 9:30 am

Chessnut1071 wrote: ↑Fri Jun 10, 2022 6:55 am After years of using gradient methods, the chess optimization problem is better off with a Hooke & Jeeves type optimization. Gradient methods have severe issues with local minimums and convergence to non-optimal solutions. A relative cheapo Hooke & Jeeves search can out perform gradient methods by significant margins. It's an easy algorithm to write.

Hmm, you make some interesting points.

I'm new to gradient descent but doesn't it having issues finding good minimums have more to due with finding the correct learning rate rather than the algorithm itself being fundamentally flawed when applied to chess? I mean, the theory seems sound. The gradient of course gives us the direction of the steepest ascent, so we can minimize a function by using the negation of it. Now the issues becomes finding a good learning rate, no?

The Hooke & Jeeves algorithm sounds interesting though, thank you, I'll check it out. I must be honest I selected gradient descent mostly because it seemed to work well for others and seemed straightforward enough to implement. So I can't say I researched all the alternatives exhaustively, beside some of its variations (mini-batch GD, SGD, etc.)

algerbrex · Post by **algerbrex** » Fri Jun 10, 2022 11:52 am

algerbrex wrote: ↑Fri Jun 10, 2022 9:19 am ...

Seems I still have a bit more work to do though...

I'm computing my gradients:

Code: Select all

func computeGradient(entries []Entry, weights []float64, scalingFactor float64) (gradients []float64) {
	N := float64(len(entries))
	numWeights := len(weights)
	gradients = make([]float64, numWeights)

	for i := 0; i < len(entries); i++ {
		score := evaluate(weights, entries[i].Coefficents)
		sigmoid := 1 / (1 + math.Exp(-scalingFactor*score))
		err := entries[i].Outcome - sigmoid
		term := -2 * scalingFactor / N * err * (1 - sigmoid) * sigmoid

		for k := 0; k < numWeights; k++ {
			gradients[k] += term * entries[i].Coefficents[k]
		}
	}

	return gradients
}

And a K value:

Code: Select all

func findScalingFactor(entries []Entry, weights []float64) float64 {
	start, end, step := float64(0), float64(10), float64(1)
	err := float64(0)

	curr := start
	best := meanSquaredError(entries, weights, start)

	for i := 0; i < KPrecision; i++ {
		curr = start - step
		for curr < end {
			curr = curr + step
			err = meanSquaredError(entries, weights, curr)
			if err <= best {
				best = err
				start = curr
			}
		}

		fmt.Printf("Best K of %f on iteration %d\n", start, i)

		end = start + step
		start = start - step
		step = step / 10.0
	}

	return start
}

But getting odd results for the best K value like -0.00035. And on top of that, even if I supply a more realistic K value myself, like 1.5 or 2, the gradient values are ridiculously small, much too small to make any significant changes to the weights. Upon investigation this seems to be caused by the (1/N) term from my original cost function, mean-squared error.

Removing this factor does seem to give more realistic gradient values, but using them still causes odd and incorrect weight results.

I suppose I either made another mistake in my algebra, or I have a bug in my code.

However, the one big upside is that my code is now actually as fast as expected. With my old Texel Tuner, I normally used about ~700K positions. I loaded it into this (buggy) gradient descent tuner and it easily ran 100 epochs in under a minute, and an epoch in less than a second. And that's without any optimizations yet (sparse feature vector, parallelization, etc.) So I'm very excited for this tuner's potential...whenever I get it working

Chessnut1071 · Post by **Chessnut1071** » Fri Jun 10, 2022 1:32 pm

algerbrex wrote: ↑Fri Jun 10, 2022 9:30 am
Chessnut1071 wrote: ↑Fri Jun 10, 2022 6:55 am After years of using gradient methods, the chess optimization problem is better off with a Hooke & Jeeves type optimization. Gradient methods have severe issues with local minimums and convergence to non-optimal solutions. A relative cheapo Hooke & Jeeves search can out perform gradient methods by significant margins. It's an easy algorithm to write.
Hmm, you make some interesting points.

I'm new to gradient descent but doesn't it having issues finding good minimums have more to due with finding the correct learning rate rather than the algorithm itself being fundamentally flawed when applied to chess? I mean, the theory seems sound. The gradient of course gives us the direction of the steepest ascent, so we can minimize a function by using the negation of it. Now the issues becomes finding a good learning rate, no?

The Hooke & Jeeves algorithm sounds interesting though, thank you, I'll check it out. I must be honest I selected gradient descent mostly because it seemed to work well for others and seemed straightforward enough to implement. So I can't say I researched all the alternatives exhaustively, beside some of its variations (mini-batch GD, SGD, etc.)

Gradient methods work best on continuous functions and can fail ignominiously on discontinuous functions like chess, especially as the variable size increases. The issue with Hooke & Jeeves is the extra time required over gradient methods. Try both and see for yourself.

lithander · Post by **lithander** » Fri Jun 10, 2022 9:01 pm

When I implemented Leoriks tuner I revisited some old machine learning course I did a few years ago. I believe it was this video where it all came together:
https://youtu.be/GtSf2T6Co80

I tried to reproduce the calculation of the derivative terms myself but since then I have forgotten the details again! Bit Andrew said don't worry about it and the answer is also in the video

After implementing it successfully I think it also makes intuitive sense. If you want I can quote some code from my tuner and try to explain the intuitive reasoning behind it.

And once it worked the rest was just about performance optimizations. You probably saw my post on that?

clayt · Post by **clayt** » Fri Jun 10, 2022 9:06 pm

algerbrex wrote: ↑Fri Jun 10, 2022 11:52 am
But getting odd results for the best K value like -0.00035. And on top of that, even if I supply a more realistic K value myself, like 1.5 or 2, the gradient values are ridiculously small, much too small to make any significant changes to the weights. Upon investigation this seems to be caused by the (1/N) term from my original cost function, mean-squared error.

Removing this factor does seem to give more realistic gradient values, but using them still causes odd and incorrect weight results.

I suppose I either made another mistake in my algebra, or I have a bug in my code.

It shouldn't surprise you too much that a puny value of k will minimize error, since it flattens out the slope of the logistic function, making only the sign of your evaluation important to its correctness. I suggest that you don't try to optimize k, and instead pick a value of k which will make error in the evaluation of positions from -3 to +3 pawns significant, and errors in positions with more extreme evaluations unimportant. I (completely arbitrarily) set k to be 0.6 when I started tuning, and have had pretty good success with that. I also recommend that you "fuzz" your weights at the start of a training period (not an epoch), since that can sometimes help with the local minima issue.

With regards to whether gradient descent is good enough for an engine, I think it's alright, especially when you can get an enormous training set and also some sane initial values for training. Using PeSTO's values as my initial values, and this as my dataset (37M positions), I was able to tune my engine from about 2200 elo to somewhere around 2600. Not bad for ten minutes of training! I don't have exact elo numbers because I haven't run any mini-matches yet, but it's now able to consistently draw against most 2600 bots.

In practice, I think the final round of tuning would have to be done by running matches, instead of comparing against a training dataset, so I won't be able to use the gradient descent tuner forever. However, for as long as I'm still tweaking the engine, it gives "good enough" results to not be worth using a more time- and resource-intensive approach.

Lastly: I think there may be a little bit of confusion here. A *partial derivative* is a single value - it's the rate of change of one parameter with respect to another with all else being fixed. For instance, $\frac{\partial}{\partial w_n} E$ would be the rate of change of error with respect to the n-th weight value. Meanwhile, a *gradient* is a vector of partial derivatives. For instance, $\nabla_w E$ would be the vector containing the rate of change of E with respect to $w_n$ for each n.The sample code I gave computes the whole gradient in one go. In the mathematical expression, $X_{i, .}$ is a vector, and is multiplied by a unique scalar for each i.

algerbrex · Post by **algerbrex** » Fri Jun 10, 2022 9:40 pm

clayt wrote: ↑Fri Jun 10, 2022 9:06 pm It shouldn't surprise you too much that a puny value of k will minimize error, since it flattens out the slope of the logistic function, making only the sign of your evaluation important to its correctness.

Ah, duh. That makes perfect sense. I need to stop trying to do math and program past midnight. I'll play around to find a good K value myself, thanks.

clayt wrote: ↑Fri Jun 10, 2022 9:06 pm With regards to whether gradient descent is good enough for an engine, I think it's alright, especially when you can get an enormous training set and also some sane initial values for training. Using PeSTO's values as my initial values, and this as my dataset (37M positions), I was able to tune my engine from about 2200 elo to somewhere around 2600. Not bad for ten minutes of training! I don't have exact elo numbers because I haven't run any mini-matches yet, but it's now able to consistently draw against most 2600 bots.

Nice results. I agree, I think for my purposes right now gradient descent will be perfectly fine, especially since now with the speed of gradient descent I don't have to limit myself to only using half a million positions, since the naive local optimization method was horrendously slow.

clayt wrote: ↑Fri Jun 10, 2022 9:06 pm Lastly: I think there may be a little bit of confusion here. A *partial derivative* is a single value - it's the rate of change of one parameter with respect to another with all else being fixed. For instance, $\frac{\partial}{\partial w_n} E$ would be the rate of change of error with respect to the n-th weight value. Meanwhile, a *gradient* is a vector of partial derivatives. For instance, $\nabla_w E$ would be the vector containing the rate of change of E with respect to $w_n$ for each n.

Right, right. I understand the difference between a partial derivative and a gradient.

clayt wrote: ↑Fri Jun 10, 2022 9:06 pm The sample code I gave computes the whole gradient in one go. In the mathematical expression, $X_{i, .}$ is a vector, and is multiplied by a unique scalar for each i.

I think I follow.

For the expression you gave, does it represent computing the entire gradient vector at once, by taking the ith feature vector, $X_{i, .}$, and multiplying it by the scalar computed by everything on the left. And the sigma notation denotes that we take each of these scaled ith feature vectors, and add them together, resulting in our gradient vector?

I apologize if the notation here is pretty standard. I've only taken undergraduate multivariable calculus so far, and notation isn't exactly one of my strong suits.

Computing the gradient of the MSE cost function

Computing the gradient of the MSE cost function

Re: Computing the gradient of the MSE cost function

Re: Computing the gradient of the MSE cost function

Re: Computing the gradient of the MSE cost function

Re: Computing the gradient of the MSE cost function

Re: Computing the gradient of the MSE cost function

Re: Computing the gradient of the MSE cost function

Re: Computing the gradient of the MSE cost function

Re: Computing the gradient of the MSE cost function

Re: Computing the gradient of the MSE cost function