algerbrex wrote: ↑Fri Jun 10, 2022 11:52 am
But getting odd results for the best K value like -0.00035. And on top of that, even if I supply a more realistic K value myself, like 1.5 or 2, the gradient values are ridiculously small, much too small to make any significant changes to the weights. Upon investigation this seems to be caused by the (1/N) term from my original cost function, mean-squared error.
Removing this factor does seem to give more realistic gradient values, but using them still causes odd and incorrect weight results.
I suppose I either made another mistake in my algebra, or I have a bug in my code.
It shouldn't surprise you too much that a puny value of k will minimize error, since it flattens out the slope of the logistic function, making only the sign of your evaluation important to its correctness. I suggest that you don't try to optimize k, and instead pick a value of k which will make error in the evaluation of positions from -3 to +3 pawns significant, and errors in positions with more extreme evaluations unimportant. I (completely arbitrarily) set k to be 0.6 when I started tuning, and have had pretty good success with that. I also recommend that you "fuzz" your weights at the start of a training period (not an epoch), since that can sometimes help with the local minima issue.
With regards to whether gradient descent is good enough for an engine, I think it's alright, especially when you can get an enormous training set and also some sane initial values for training. Using PeSTO's values as my initial values, and
this as my dataset (37M positions), I was able to tune my engine from about 2200 elo to somewhere around 2600. Not bad for ten minutes of training! I don't have exact elo numbers because I haven't run any mini-matches yet, but it's now able to consistently draw against most 2600 bots.
In practice, I think the final round of tuning would have to be done by running matches, instead of comparing against a training dataset, so I won't be able to use the gradient descent tuner forever. However, for as long as I'm still tweaking the engine, it gives "good enough" results to not be worth using a more time- and resource-intensive approach.
Lastly: I think there may be a little bit of confusion here. A *partial derivative* is a single value - it's the rate of change of one parameter with respect to another with all else being fixed. For instance, $\frac{\partial}{\partial w_n} E$ would be the rate of change of error with respect to the n-th weight value. Meanwhile, a *gradient* is a vector of partial derivatives. For instance, $\nabla_w E$ would be the vector containing the rate of change of E with respect to $w_n$ for each n.The sample code I gave computes the whole gradient in one go. In the mathematical expression, $X_{i, .}$ is a vector, and is multiplied by a unique scalar for each i.