Computing the gradient of the MSE cost function

dangi12012 · Post by **dangi12012** » Fri Jun 10, 2022 9:42 pm

clayt wrote: ↑Fri Jun 10, 2022 9:06 pm With regards to whether gradient descent is good enough for an engine, I think it's alright, especially when you can get an enormous training set and also some sane initial values for training. Using PeSTO's values as my initial values, and this as my dataset (37M positions), I was able to tune my engine from about 2200 elo to somewhere around 2600. Not bad for ten minutes of training! I don't have exact elo numbers because I haven't run any mini-matches yet, but it's now able to consistently draw against most 2600 bots.

Is your method efficient enough to be running at engine runtime, lets say per position in a 10+2 game?
Training would only need to be incremental from one ply to the next. It starts up with optimal values at ply 0 - but tweaks them continuously further along the game wherever they fit best.

That implies millions of matches against itself at runtime.

algerbrex · Post by **algerbrex** » Fri Jun 10, 2022 9:47 pm

lithander wrote: ↑Fri Jun 10, 2022 9:01 pm When I implemented Leoriks tuner I revisited some old machine learning course I did a few years ago. I believe it was this video where it all came together:
https://youtu.be/GtSf2T6Co80

I tried to reproduce the calculation of the derivative terms myself but since then I have forgotten the details again! Bit Andrew said don't worry about it and the answer is also in the video

I see, thanks. I don't think I've seen that video before. Probably would've been easier to just use his answer, but I just finished my multivariable calculus course last semester and thought it would be fun to try to compute the correct partial formula by hand

lithander wrote: ↑Fri Jun 10, 2022 9:01 pm After implementing it successfully I think it also makes intuitive sense. If you want I can quote some code from my tuner and try to explain the intuitive reasoning behind it.

Thanks, that's alright though. I won't make you go to the trouble. I think I'm close to figuring things out from where I have them right now. Some of the math involved with computing the gradient and properties of the logistic function just took a little time for me to wrap my head around I think.

lithander wrote: ↑Fri Jun 10, 2022 9:01 pm And once it worked the rest was just about performance optimizations. You probably saw my post on that?

I did actually! That's what prompted me to investigate creating my own gradient descent tuner. I hit an upper limit on the search optimizations for Blunder 8.0.0 right now, so I turned to the evaluation and my old tuner. And waiting 12 hours to tune evaluation weighs didn't at all sound appealing anymore, especially since I knew there were much faster options out there.

It was pretty incredible to see my current tuner (although still a bit buggy), easily run through 100 epochs of the classic 725K positions from Zurichess in less than a minute, probably less than 30 seconds. And that's without using goroutines to parallelize the process, or using a sparse feature vector.

algerbrex · Post by **algerbrex** » Sat Jun 11, 2022 4:43 pm

So, after a little more debugging, I realized I had a sign error in my code, and I wasn't negating the gradient. So my tuner, ironically, was trying to tune values to be as bad as possible. No wonder my evaluation sucked.

After fixing this bug it now looks like the tuner is working decently well, and after playing with the learning rate more I was able to reproduce a realistic set of PSQT values and material values from scratch. They're still ~30-40 Elo worse than the current master values, but I think this is likely because I need to find the optimal learning rate and K value. I'm going to keep playing around to see if I can get the gradient descent tuner to produce something equal to or stronger than the current dev values.

I'm also going to switch to trying a couple million positions, as well as starting from common sense values, instead of completely from scratch (i.e. default material values and zeros in the PSQTs).

algerbrex · Post by **algerbrex** » Sat Jun 11, 2022 5:49 pm

algerbrex wrote: ↑Sat Jun 11, 2022 4:43 pm ...

Hmm, unfortunately, the best error is actually always worse after tuning, and during tuning it bounces between falling and rising, eventually continuing to rise. So it seems somewhere I still have a sign error, or I'm misunderstanding whether the error should be dropping, or gradient descent's ability to tune values from scratch.

I'm quite sure at this point my issue is calculating the gradient. I've been able to verify the correctness of every other part of the tuner (calculating coefficients correctly, calculating evaluations correctly, calculating the mean squared error, etc.) So I'll have to look at it more closely:

Code: Select all

func computeGradient(entries []Entry, weights []float64, scalingFactor float64) (gradients []float64) {
	numWeights := len(weights)
	gradients = make([]float64, numWeights)

	for i := range entries {
		score := evaluate(weights, entries[i].Coefficents)
		sigmoid := 1 / (1 + math.Exp(-(scalingFactor * score)))
		err := entries[i].Outcome - sigmoid
		term := -2 * scalingFactor \ N * err * (1 - sigmoid) * sigmoid

		for k := range entries[i].Coefficents {
			coefficent := &entries[i].Coefficents[k]
			gradients[coefficent.Idx] += term * coefficent.Value
		}
	}

	return gradients
}

...

for i := 0; i < epochs; i++ {
	gradients := computeGradient(entries, weights, scalingFactor)
	for k, gradient := range gradients {
		weights[k] -= learningRate * gradient
	}
}

Seems straightforward enough to me. Because of the loss function being used, however, I do notice the learning rate has to be quite high for any meaningful learning to be made in a reasonable amount of iterations since the N termed carried over into the partial derivative makes the gradient value quite small. I had to use something like 500 before I saw any change if I remember correctly.

Perhaps I'll experiment with the loss function being used, and try something like Mr. Ramsey's.

EDIT: Changing my loss function to remove the (1\n) term and adjusting the learning rate accordingly still made the overall error rise a good bit, but the evaluation terms are given actually looked pretty good, and so far in SPRT testing they seem to be much better than the master evaluation terms. I'll have to think a bit about why this might be. I imagine the issue right now is that I need to still fine-tune the learning rate to make sure I'm not overstepping a good minimum and thus the re-raising error.

Writing this tuner is really making me start to question whether I actually ever understood the multivariable calc course I took last semester

clayt · Post by **clayt** » Sat Jun 11, 2022 10:44 pm

You may want to add a “temperature” scaling to your learning rate. This is commonly used for simulated annealing to allow the weights to escape local minima while also being able to fully optimize. In general, the easiest way is to multiply your learning rate by some fraction (usually around 0.99) at every epoch loop. There are more sophisticated methods which rune the learning rate according to whether the score is improving, but the dumb method I suggested is probably good enough for this.

algerbrex · Post by **algerbrex** » Sat Jun 11, 2022 11:09 pm

clayt wrote: ↑Sat Jun 11, 2022 10:44 pm You may want to add a “temperature” scaling to your learning rate. This is commonly used for simulated annealing to allow the weights to escape local minima while also being able to fully optimize. In general, the easiest way is to multiply your learning rate by some fraction (usually around 0.99) at every epoch loop. There are more sophisticated methods which rune the learning rate according to whether the score is improving, but the dumb method I suggested is probably good enough for this.

Hmm, thanks. I didn't know about that and will definitely try that too. For now since your loss function performed best I'll stick to it, and will see if this temperature scaling helps with the odd error behavior. I suppose technically average error doesn't correspond to Elo gain but still.

algerbrex · Post by **algerbrex** » Sun Jun 12, 2022 6:57 am

Just in case anyone was curious, I finally seemed to figure things out.

Basically, this section of code:

Code: Select all

for k := range entries[i].Coefficents {
	coefficent := &entries[i].Coefficents[k]
	gradients[coefficent.Idx] += term * coefficent.Value
}

Should've been this section:

Code: Select all

for k := range entries[i].Coefficents {
	coefficent := &entries[i].Coefficents[k]
	gradients[coefficent.Idx] += term * coefficent.Value / 256
}

With the coefficient value being divided by 256, since in the partial derivative derivation, the `term` value should be multiplied by x sub i, n, i.e., the part of the feature vector for the position, such when multiplied by the evaluation weight, adds the correct value to the evaluation. In this case I almost has the right value, but needed to scale it back down by 256 since I packed the phasing information into the coefficent's value. I was doing this correctly in the evaluation, but not in the gradient computation.

Now after tweaking the learning right to be around 1M, to compensate for the small gradient values, the tuner seems to be working correctly. Though during training the error is not really dropping, it does drop some, and evaluation terms produced are quite good, which leads me to believe this is just a side-effect of the learning rate and the number of iterations chosen. So it's not a huge issue to me right now, since the error is dropping some, and the evaluation terms produced are getting better and better.

Henk · Post by **Henk** » Mon Jun 13, 2022 12:50 pm

I remember backrpropgation was invented to compute gradient efficiently. So why don't you use a network.

algerbrex · Post by **algerbrex** » Mon Jun 13, 2022 2:21 pm

Henk wrote: ↑Mon Jun 13, 2022 12:50 pm I remember backrpropgation was invented to compute gradient efficiently. So why don't you use a network.

I will eventually. But for now I'd like to see how strong I can make Blunder just using a classical HCE with better tuning methods like gradient descent.

clayt · Post by **clayt** » Mon Jun 13, 2022 7:33 pm

Henk wrote: ↑Mon Jun 13, 2022 12:50 pm I remember backrpropgation was invented to compute gradient efficiently. So why don't you use a network.

This isn't exactly correct: backpropagation is a technique for computing the gradient of multi-layer neural nets. What we're doing here is actually just a single-layer neural net (also known as an adaptive linear element, or Adaline), which does not require backpropagation at all.

Computing the gradient of the MSE cost function

Re: Computing the gradient of the MSE cost function

Re: Computing the gradient of the MSE cost function

Re: Computing the gradient of the MSE cost function

Re: Computing the gradient of the MSE cost function

Re: Computing the gradient of the MSE cost function

Re: Computing the gradient of the MSE cost function

Re: Computing the gradient of the MSE cost function

Re: Computing the gradient of the MSE cost function

Re: Computing the gradient of the MSE cost function

Re: Computing the gradient of the MSE cost function