I've written a blog article about my experience with auto-tuning in chess4j and Prophet.
http://jamesswafford.com/2022/07/02/aut ... n-chess4j/
I hope you enjoy.

--
James
Moderator: Ras
Correct me if I'm wrong since I'm also planning to begin working on a simple (768 -> 16 -> 1) neural network to experiment with in Blunder, but isn't the activation function something that's per neuron? Or is the term also used to apply to the whole network?dangi12012 wrote: ↑Sat Jul 02, 2022 7:26 pm You dont have to dive deep into neural networks because what you are describing is already a single layer network. What you call hypothesis is the activation function.
Tens of thousands of iterations? Do you not have stopping conditions? My tuning sessions usually only need 100 or so iterations before the error stops improving.algerbrex wrote: ↑Sat Jul 02, 2022 9:58 pm Indeed gradient descent is much more practical than naive-based texel tuning. Now that I finally switched over, development time is much quicker. I'm now able to run 50k tuning iterations on 1-2M positions in only about ~4 hours, whereas with the old tuner I could only use 400-600K at most, and that would take 10-12 hours, or sometimes even longer. And the values I get are better!
It was pretty incredible running the tuner for the first time on the Zurichess dataset, and in only about 5 minutes and 10K iterations, I had a set of evaluation parameters, tuned completely from scratch, that beat the old ones that I had spent hours tuning by 60 Elo!
I currently don't, no, but that's something I should look into. But what I found, with my tuner anyway, is that I got much higher quality results when I used a much smaller learning rate and more iterations than a higher learning rate and only a couple of hundred iterations. 50K gave better results than 10K iterations.
Thanks for the post, James, and thanks for the shoutout. Now if only this could be done for search paramaters instead of just eval features, then we would really be cookingjswaff wrote: ↑Sat Jul 02, 2022 6:37 pm I've written a blog article about my experience with auto-tuning in chess4j and Prophet.
http://jamesswafford.com/2022/07/02/aut ... n-chess4j/
There is a problem here in my view. For any particular game G, 20% of the positions from G will be in the validation, and 80% in the training. These are not completely independent from one another. A more sound option would be to take 20% of the games and turn that into validation data, and the other 80% and turn them into training data, that way there subsets are as independent as they can be.For each training session, the test data was shuffled and then split into two subsets. The first subset contained 80% of the records and was used for the actual training. The other 20% were used for validation (technically the “test set”). Ideally, the training error should decrease after every iteration, but with gradient descent (particularly stochastic gradient descent) that may not be the case. The important point is that the error trends downward (this is tough if you’re a perfectionist).
I've been thinking about ways to tune the search parameters a good bit, and after reading Thomas Petzke's blog from a couple of years ago, I've been tempted to try some sort of genetic tuning of the search parameters.AndrewGrant wrote: ↑Sun Jul 03, 2022 11:46 am Thanks for the post, James, and thanks for the shoutout. Now if only this could be done for search paramaters instead of just eval features, then we would really be cooking![]()
Good point about the data in each set being independent, but in practice I'm not sure it's mattered. The datasets you've provided are large enough that I don't think overfitting is really an issue anyway, at least not for my program.AndrewGrant wrote: ↑Sun Jul 03, 2022 11:46 am There is a problem here in my view. For any particular game G, 20% of the positions from G will be in the validation, and 80% in the training. These are not completely independent from one another. A more sound option would be to take 20% of the games and turn that into validation data, and the other 80% and turn them into training data, that way there subsets are as independent as they can be.
But in my experience with NNUE (and HCE years ago), is that validation loss and training loss don't seem to be great metrics for anything. I'll do runs and get much lower loss, but not have it equate to elo. I'll do runs with equal loss, and have it equate to elo.
---
An aside -- my trainer always used Adagrad as well. Mostly, because my first attempt at doing ADAM was bugged (!), and I only found out some time later when writing the NNUE trainer. I had put ADAM into the HCE trainer and done additional runs, gaining no elo.
HOWEVER, it is my belief that if you are going to add new terms to the trainer, and then train only those terms, that ADAM is a better solution. My reason stems from my NNUE experience, which is that Adagrad simply cannot get the job done from a random init. As a result, if you add a new term and set the defaults to 0, I would expect ADAM to do a better job than Adagrad.
Speculation, however.