How to calc the derivative for gradient descent?

BrianNeal · Post by **BrianNeal** » Mon Jan 04, 2021 10:03 pm

Hello, this is my first post.
I'd like to know which way you suggest for calculating the derivative of the evaluation for each parameter for gradient descent with Texel tuning. I've read about (Eval(xi+1)-Eval(xi-1))/2, Eval(xi+1)-Eval(xi), auto differentiation libraries, Jacobian matrix and so forth.
I'm currently using local search. My engine is written in C.

AndrewGrant · Post by **AndrewGrant** » Tue Jan 05, 2021 1:15 am

BrianNeal wrote: ↑Mon Jan 04, 2021 10:03 pm Hello, this is my first post.
I'd like to know which way you suggest for calculating the derivative of the evaluation for each parameter for gradient descent with Texel tuning. I've read about (Eval(xi+1)-Eval(xi-1))/2, Eval(xi+1)-Eval(xi), auto differentiation libraries, Jacobian matrix and so forth.
I'm currently using local search. My engine is written in C.

Might not directly answer your question, but I can plug something I wrote on the topic.

There is a section dedicated to the derivative of each term.

AlvaroBegue · Post by **AlvaroBegue** » Tue Jan 05, 2021 5:00 pm

This is much easier to do in C++, because you can just use a custom data type instead of `float' and the gradient will be computed for you automatically.

For instance, see the example here: https://github.com/autodiff/autodiff

It's not hard to write your own reverse-mode automatic differentiation library in C++.

D Sceviour · Post by **D Sceviour** » Tue Jan 05, 2021 6:00 pm

BrianNeal wrote: ↑Mon Jan 04, 2021 10:03 pm Hello, this is my first post.
I'd like to know which way you suggest for calculating the derivative of the evaluation for each parameter for gradient descent with Texel tuning. I've read about (Eval(xi+1)-Eval(xi-1))/2, Eval(xi+1)-Eval(xi), auto differentiation libraries, Jacobian matrix and so forth.
I'm currently using local search. My engine is written in C.

My question is : What does the slope do once it is calculated? The calculation is for a variable based upon a local minimum. However, there may be several local minimums for a position (king safety, pawn promotions, etc.) , and the current best estimator will change with the next iteration, and the slope is in the wrong direction for the convergence.

tomitank · Post by **tomitank** » Tue Jan 05, 2021 9:32 pm

D Sceviour wrote: ↑Tue Jan 05, 2021 6:00 pm
BrianNeal wrote: ↑Mon Jan 04, 2021 10:03 pm Hello, this is my first post.
I'd like to know which way you suggest for calculating the derivative of the evaluation for each parameter for gradient descent with Texel tuning. I've read about (Eval(xi+1)-Eval(xi-1))/2, Eval(xi+1)-Eval(xi), auto differentiation libraries, Jacobian matrix and so forth.
I'm currently using local search. My engine is written in C.
My question is : What does the slope do once it is calculated? The calculation is for a variable based upon a local minimum. However, there may be several local minimums for a position (king safety, pawn promotions, etc.) , and the current best estimator will change with the next iteration, and the slope is in the wrong direction for the convergence.

If the eval is linear, then you have only a single global minimum.

Pio · Post by **Pio** » Tue Jan 05, 2021 9:38 pm

D Sceviour wrote: ↑Tue Jan 05, 2021 6:00 pm
BrianNeal wrote: ↑Mon Jan 04, 2021 10:03 pm Hello, this is my first post.
I'd like to know which way you suggest for calculating the derivative of the evaluation for each parameter for gradient descent with Texel tuning. I've read about (Eval(xi+1)-Eval(xi-1))/2, Eval(xi+1)-Eval(xi), auto differentiation libraries, Jacobian matrix and so forth.
I'm currently using local search. My engine is written in C.
My question is : What does the slope do once it is calculated? The calculation is for a variable based upon a local minimum. However, there may be several local minimums for a position (king safety, pawn promotions, etc.) , and the current best estimator will change with the next iteration, and the slope is in the wrong direction for the convergence.

One way of making convergence faster and more reliable is to do batch updates. The reason is that with batch updates you get much more reliable approximation of the direction. When I made some simple CNN’s for picture recognition I observed that doing big batches helped reducing the validation error enormously and my net could become much much better. The training error was a little bit worse (if I don’t remember it wrong) and that is just good. What it shows is that when you are doing single updates, each update will be like a Brownian motion just drifting more and more from optimum/optima and the optimum it will attract is probably only good for your training set.

If you have a very good set of values and just want to get a good value for a new feature I have a really good way of doing so.

D Sceviour · Post by **D Sceviour** » Tue Jan 05, 2021 11:02 pm

tomitank wrote: ↑Tue Jan 05, 2021 9:32 pm
D Sceviour wrote: ↑Tue Jan 05, 2021 6:00 pm
BrianNeal wrote: ↑Mon Jan 04, 2021 10:03 pm Hello, this is my first post.
I'd like to know which way you suggest for calculating the derivative of the evaluation for each parameter for gradient descent with Texel tuning. I've read about (Eval(xi+1)-Eval(xi-1))/2, Eval(xi+1)-Eval(xi), auto differentiation libraries, Jacobian matrix and so forth.
I'm currently using local search. My engine is written in C.
My question is : What does the slope do once it is calculated? The calculation is for a variable based upon a local minimum. However, there may be several local minimums for a position (king safety, pawn promotions, etc.) , and the current best estimator will change with the next iteration, and the slope is in the wrong direction for the convergence.
If the eval is linear, then you have only a single global minimum.

How does one prove the eval is linear? From the information available at a glance it appears to be a struggle for mathematicians:

https://math.stackexchange.com/question ... al-minimum
https://math.stackexchange.com/question ... al-minimum

Can anybody translate the links to NN chess for dummies? King placement is very important and this seems to be a starting point for NNUE tuning. What about pawn promotion which may or may not be king placement sensitive?

D Sceviour · Post by **D Sceviour** » Tue Jan 05, 2021 11:08 pm

Pio wrote: ↑Tue Jan 05, 2021 9:38 pm
D Sceviour wrote: ↑Tue Jan 05, 2021 6:00 pm
BrianNeal wrote: ↑Mon Jan 04, 2021 10:03 pm Hello, this is my first post.
I'd like to know which way you suggest for calculating the derivative of the evaluation for each parameter for gradient descent with Texel tuning. I've read about (Eval(xi+1)-Eval(xi-1))/2, Eval(xi+1)-Eval(xi), auto differentiation libraries, Jacobian matrix and so forth.
I'm currently using local search. My engine is written in C.
My question is : What does the slope do once it is calculated? The calculation is for a variable based upon a local minimum. However, there may be several local minimums for a position (king safety, pawn promotions, etc.) , and the current best estimator will change with the next iteration, and the slope is in the wrong direction for the convergence.
One way of making convergence faster and more reliable is to do batch updates. The reason is that with batch updates you get much more reliable approximation of the direction. When I made some simple CNN’s for picture recognition I observed that doing big batches helped reducing the validation error enormously and my net could become much much better. The training error was a little bit worse (if I don’t remember it wrong) and that is just good. What it shows is that when you are doing single updates, each update will be like a Brownian motion just drifting more and more from optimum/optima and the optimum it will attract is probably only good for your training set.

If you have a very good set of values and just want to get a good value for a new feature I have a really good way of doing so.

Thank you. That is very good explanation of the importance of batch updating. This looks worth further experimentation.

Pio · Post by **Pio** » Tue Jan 05, 2021 11:47 pm

D Sceviour wrote: ↑Tue Jan 05, 2021 11:02 pm
tomitank wrote: ↑Tue Jan 05, 2021 9:32 pm
D Sceviour wrote: ↑Tue Jan 05, 2021 6:00 pm
BrianNeal wrote: ↑Mon Jan 04, 2021 10:03 pm Hello, this is my first post.
I'd like to know which way you suggest for calculating the derivative of the evaluation for each parameter for gradient descent with Texel tuning. I've read about (Eval(xi+1)-Eval(xi-1))/2, Eval(xi+1)-Eval(xi), auto differentiation libraries, Jacobian matrix and so forth.
I'm currently using local search. My engine is written in C.
My question is : What does the slope do once it is calculated? The calculation is for a variable based upon a local minimum. However, there may be several local minimums for a position (king safety, pawn promotions, etc.) , and the current best estimator will change with the next iteration, and the slope is in the wrong direction for the convergence.
If the eval is linear, then you have only a single global minimum.
How does one prove the eval is linear? From the information available at a glance it appears to be a struggle for mathematicians:

https://math.stackexchange.com/question ... al-minimum
https://math.stackexchange.com/question ... al-minimum

Can anybody translate the links to NN chess for dummies? King placement is very important and this seems to be a starting point for NNUE tuning. What about pawn promotion which may or may not be king placement sensitive?

The discussion about linearity in evaluation function is very bad. You might have a function that is linear with respect to your features, I.e. a linear combination of those features, but the features themselves might not be linear. You can also create a linear function of your highly non linear evaluation function. Just define your entire evaluation function as a feature and you have a linear evaluation function.

I have previously had the idea about doing a pawn eval using an NN since it can be reused 99 % of the time. I guess you could do a very advanced neural network function for the pawns only, maybe learn 16 bytes output for each square that will later be used as input to kings’ placement, combining the info with a new NN producing a more complex representation per square where the original pawn output will be identity mapped and the combined pawn and king output will go to another 8 bytes or so. The nice thing is that you will seldom move your king so a big part of the calculation can be saved.

You can do the same thing for other pieces but I don’t think you will gain so much since they usually move a lot.

Good luck with your engine !!!

D Sceviour · Post by **D Sceviour** » Wed Jan 06, 2021 12:52 am

Pio wrote: ↑Tue Jan 05, 2021 11:47 pm The discussion about linearity in evaluation function is very bad. You might have a function that is linear with respect to your features, I.e. a linear combination of those features, but the features themselves might not be linear. You can also create a linear function of your highly non linear evaluation function. Just define your entire evaluation function as a feature and you have a linear evaluation function.

I believe chess evaluation is non-linear, buy any argument is welcome that would demonstrate linearity. However, creating a linear function of a highly non-linear evaluation function should lead to very bad results.

How to calc the derivative for gradient descent?

How to calc the derivative for gradient descent?

Re: How to calc the derivative for gradient descent?

Re: How to calc the derivative for gradient descent?

Re: How to calc the derivative for gradient descent?

Re: How to calc the derivative for gradient descent?

Re: How to calc the derivative for gradient descent?

Re: How to calc the derivative for gradient descent?

Re: How to calc the derivative for gradient descent?

Re: How to calc the derivative for gradient descent?

Re: How to calc the derivative for gradient descent?