parameter tuning in chess4j and Prophet

jswaff · Post by **jswaff** » Sat Jul 02, 2022 6:37 pm

Hi all.

I've written a blog article about my experience with auto-tuning in chess4j and Prophet.

http://jamesswafford.com/2022/07/02/aut ... n-chess4j/

I hope you enjoy.

--
James

dangi12012 · Post by **dangi12012** » Sat Jul 02, 2022 7:26 pm

Good read - thanks for sharing.

Additional thoughts:
1)
You dont have to dive deep into neural networks because what you are describing is already a single layer network. What you call hypothesis is the activation function.

2)
It comes up time and time again but how hard is it to adapt your tuning during actual gameplay (runtime)? The keyword is dynamic learning.
Essentially your training set of "opening" "midgame" "endgame" has to be calculated during runtime not from the start - but the current position.
Then you get more optimal piece values for your current position. If this is possible in a few hundred milliseconds.

3)
Who says that there need to be only 3 Pawn PST networks. If training is easy why not have them trained per ply. The values from ply to ply ideally would not change much - and it could already be interesting if you are using linear interpolation for values in between mid and endgame?

algerbrex · Post by **algerbrex** » Sat Jul 02, 2022 9:58 pm

Nice read, I enjoyed it a lot.

Indeed gradient descent is much more practical than naive-based texel tuning. Now that I finally switched over, development time is much quicker. I'm now able to run 50k tuning iterations on 1-2M positions in only about ~4 hours, whereas with the old tuner I could only use 400-600K at most, and that would take 10-12 hours, or sometimes even longer. And the values I get are better!

It was pretty incredible running the tuner for the first time on the Zurichess dataset, and in only about 5 minutes and 10K iterations, I had a set of evaluation parameters, tuned completely from scratch, that beat the old ones that I had spent hours tuning by 60 Elo!

algerbrex · Post by **algerbrex** » Sat Jul 02, 2022 10:02 pm

dangi12012 wrote: ↑Sat Jul 02, 2022 7:26 pm You dont have to dive deep into neural networks because what you are describing is already a single layer network. What you call hypothesis is the activation function.

Correct me if I'm wrong since I'm also planning to begin working on a simple (768 -> 16 -> 1) neural network to experiment with in Blunder, but isn't the activation function something that's per neuron? Or is the term also used to apply to the whole network?

zenpawn · Post by **zenpawn** » Sun Jul 03, 2022 2:17 am

algerbrex wrote: ↑Sat Jul 02, 2022 9:58 pm Indeed gradient descent is much more practical than naive-based texel tuning. Now that I finally switched over, development time is much quicker. I'm now able to run 50k tuning iterations on 1-2M positions in only about ~4 hours, whereas with the old tuner I could only use 400-600K at most, and that would take 10-12 hours, or sometimes even longer. And the values I get are better!

It was pretty incredible running the tuner for the first time on the Zurichess dataset, and in only about 5 minutes and 10K iterations, I had a set of evaluation parameters, tuned completely from scratch, that beat the old ones that I had spent hours tuning by 60 Elo!

Tens of thousands of iterations? Do you not have stopping conditions? My tuning sessions usually only need 100 or so iterations before the error stops improving.

algerbrex · Post by **algerbrex** » Sun Jul 03, 2022 3:17 am

zenpawn wrote: ↑Sun Jul 03, 2022 2:17 am Tens of thousands of iterations? Do you not have stopping conditions? My tuning sessions usually only need 100 or so iterations before the error stops improving.

I currently don't, no, but that's something I should look into. But what I found, with my tuner anyway, is that I got much higher quality results when I used a much smaller learning rate and more iterations than a higher learning rate and only a couple of hundred iterations. 50K gave better results than 10K iterations.

And to make sure I don't overfit, I have my tuner plot the error rate about 100 times over the course of tuning and save the results in a file, so I can plot them. Here's the plot produced from the most recent tuning session: which shows the error trending downward, even after several thousand games.

Now, I'm not saying my approach is optimal. I'm not an expert in gradient descent, but it's worked very well for me so far.

jdart · Post by **jdart** » Sun Jul 03, 2022 5:18 am

I used ADAM, which tunes each parameter with separate decay schedules.
Convergence typically occurs in 100-200 iterations.

AndrewGrant · Post by **AndrewGrant** » Sun Jul 03, 2022 11:46 am

jswaff wrote: ↑Sat Jul 02, 2022 6:37 pm I've written a blog article about my experience with auto-tuning in chess4j and Prophet.

http://jamesswafford.com/2022/07/02/aut ... n-chess4j/

Thanks for the post, James, and thanks for the shoutout. Now if only this could be done for search paramaters instead of just eval features, then we would really be cooking

I noticed this portion of your blog post... emphasis mine

For each training session, the test data was shuffled and then split into two subsets. The first subset contained 80% of the records and was used for the actual training. The other 20% were used for validation (technically the “test set”). Ideally, the training error should decrease after every iteration, but with gradient descent (particularly stochastic gradient descent) that may not be the case. The important point is that the error trends downward (this is tough if you’re a perfectionist).

There is a problem here in my view. For any particular game G, 20% of the positions from G will be in the validation, and 80% in the training. These are not completely independent from one another. A more sound option would be to take 20% of the games and turn that into validation data, and the other 80% and turn them into training data, that way there subsets are as independent as they can be.

But in my experience with NNUE (and HCE years ago), is that validation loss and training loss don't seem to be great metrics for anything. I'll do runs and get much lower loss, but not have it equate to elo. I'll do runs with equal loss, and have it equate to elo.

---

An aside -- my trainer always used Adagrad as well. Mostly, because my first attempt at doing ADAM was bugged (!), and I only found out some time later when writing the NNUE trainer. I had put ADAM into the HCE trainer and done additional runs, gaining no elo.

HOWEVER, it is my belief that if you are going to add new terms to the trainer, and then train only those terms, that ADAM is a better solution. My reason stems from my NNUE experience, which is that Adagrad simply cannot get the job done from a random init. As a result, if you add a new term and set the defaults to 0, I would expect ADAM to do a better job than Adagrad.

Speculation, however.

algerbrex · Post by **algerbrex** » Sun Jul 03, 2022 3:12 pm

AndrewGrant wrote: ↑Sun Jul 03, 2022 11:46 am Thanks for the post, James, and thanks for the shoutout. Now if only this could be done for search paramaters instead of just eval features, then we would really be cooking

I've been thinking about ways to tune the search parameters a good bit, and after reading Thomas Petzke's blog from a couple of years ago, I've been tempted to try some sort of genetic tuning of the search parameters.

Now of course the difficulty is selecting a good fitness function. It would be nice if there was a quick fitness function that could be used, like for Texel Tuning, but I suspect I'll only see any good results using a fitness function based on playing many games. Nevertheless, I have come across several papers (like this one: https://arxiv.org/pdf/1711.08337.pdf), that used a fitness function based on finding the highest number of correct moves, and the fewest number of nodes used to find such a move, for certain positions from grandmaster games.

I'm skeptical that such a fitness function would work well for me in practice, but since it's relatively quick, I'll probably start with it first and see what happens. I'd be surprised if I was able to start from random values and even equalize with the search parameters in the current code.

jswaff · Post by **jswaff** » Sun Jul 03, 2022 6:18 pm

AndrewGrant wrote: ↑Sun Jul 03, 2022 11:46 am There is a problem here in my view. For any particular game G, 20% of the positions from G will be in the validation, and 80% in the training. These are not completely independent from one another. A more sound option would be to take 20% of the games and turn that into validation data, and the other 80% and turn them into training data, that way there subsets are as independent as they can be.

But in my experience with NNUE (and HCE years ago), is that validation loss and training loss don't seem to be great metrics for anything. I'll do runs and get much lower loss, but not have it equate to elo. I'll do runs with equal loss, and have it equate to elo.

---

An aside -- my trainer always used Adagrad as well. Mostly, because my first attempt at doing ADAM was bugged (!), and I only found out some time later when writing the NNUE trainer. I had put ADAM into the HCE trainer and done additional runs, gaining no elo.

HOWEVER, it is my belief that if you are going to add new terms to the trainer, and then train only those terms, that ADAM is a better solution. My reason stems from my NNUE experience, which is that Adagrad simply cannot get the job done from a random init. As a result, if you add a new term and set the defaults to 0, I would expect ADAM to do a better job than Adagrad.

Speculation, however.

Good point about the data in each set being independent, but in practice I'm not sure it's mattered. The datasets you've provided are large enough that I don't think overfitting is really an issue anyway, at least not for my program.

I hadn't considered training only a new term or set of terms without rebalancing all terms. That seems dangerous to me actually.

Thanks for the pointers regarding ADAM and Adagrad. Something else to put on the list to investigate.

parameter tuning in chess4j and Prophet

parameter tuning in chess4j and Prophet

Re: parameter tuning in chess4j and Prophet

Re: parameter tuning in chess4j and Prophet

Re: parameter tuning in chess4j and Prophet

Re: parameter tuning in chess4j and Prophet

Re: parameter tuning in chess4j and Prophet

Re: parameter tuning in chess4j and Prophet

Re: parameter tuning in chess4j and Prophet

Re: parameter tuning in chess4j and Prophet

Re: parameter tuning in chess4j and Prophet