Hmm, unfortunately, the best error is actually always worse after tuning, and during tuning it bounces between falling and rising, eventually continuing to rise. So it seems somewhere I still have a sign error, or I'm misunderstanding whether the error should be dropping, or gradient descent's ability to tune values from scratch.
I'm quite sure at this point my issue is calculating the gradient. I've been able to verify the correctness of every other part of the tuner (calculating coefficients correctly, calculating evaluations correctly, calculating the mean squared error, etc.) So I'll have to look at it more closely:
Code: Select all
func computeGradient(entries []Entry, weights []float64, scalingFactor float64) (gradients []float64) {
numWeights := len(weights)
gradients = make([]float64, numWeights)
for i := range entries {
score := evaluate(weights, entries[i].Coefficents)
sigmoid := 1 / (1 + math.Exp(-(scalingFactor * score)))
err := entries[i].Outcome - sigmoid
term := -2 * scalingFactor \ N * err * (1 - sigmoid) * sigmoid
for k := range entries[i].Coefficents {
coefficent := &entries[i].Coefficents[k]
gradients[coefficent.Idx] += term * coefficent.Value
}
}
return gradients
}
...
for i := 0; i < epochs; i++ {
gradients := computeGradient(entries, weights, scalingFactor)
for k, gradient := range gradients {
weights[k] -= learningRate * gradient
}
}
Seems straightforward enough to me. Because of the loss function being used, however, I do notice the learning rate has to be quite high for any meaningful learning to be made in a reasonable amount of iterations since the N termed carried over into the partial derivative makes the gradient value quite small. I had to use something like 500 before I saw any change if I remember correctly.
Perhaps I'll experiment with the loss function being used, and try something like Mr. Ramsey's.
EDIT: Changing my loss function to remove the (1\n) term and adjusting the learning rate accordingly still made the overall error rise a good bit, but the evaluation terms are given actually looked pretty good, and so far in SPRT testing they seem to be much better than the master evaluation terms. I'll have to think a bit about why this might be. I imagine the issue right now is that I need to still fine-tune the learning rate to make sure I'm not overstepping a good minimum and thus the re-raising error.
Writing this tuner is really making me start to question whether I actually ever understood the multivariable calc course I took last semester
