https://www.google.com/search?client=sa ... 8&oe=UTF8
Are there others?
Anyone know the relative advantages and disadvantages?
methods for tuning coefficients
Moderators: hgm, Dann Corbit, Harvey Williamson
Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.

 Posts: 719
 Joined: Wed Mar 08, 2006 7:08 pm
 Location: Orange County California
 Full name: Stuart Cracraft
 Contact:

 Posts: 4099
 Joined: Fri Mar 10, 2006 4:23 am
 Location: http://www.arasanchess.org
Re: methods for tuning coefficients
This was discussed (among other topics) quite recently in this thread: http://talkchess.com/forum3/viewtopic.php?f=7&t=68326.
Jon
Jon

 Posts: 4099
 Joined: Fri Mar 10, 2006 4:23 am
 Location: http://www.arasanchess.org
Re: methods for tuning coefficients
To answer your question more directly though:
NelderMead simplex is a "black box" optimization method, which means it does not rely on the gradient of the function being available. Therefore it is commonly used for nonlinear functions (but even in that domain it is not the best method).
For evaluation functions, mostly they are linear functions of the eval parameters, so you can compute the gradient. If you know the gradient, generally you can use methods that are faster/more efficient than black box methods.
The standard method in machine learning for very large data sets is Stochastic Gradient Descent or some variant of it.
There is also a family of socalled batch gradient algorithms that do multiple iterative passes over the whole data set. I use one of these called ADAM.
Jon
NelderMead simplex is a "black box" optimization method, which means it does not rely on the gradient of the function being available. Therefore it is commonly used for nonlinear functions (but even in that domain it is not the best method).
For evaluation functions, mostly they are linear functions of the eval parameters, so you can compute the gradient. If you know the gradient, generally you can use methods that are faster/more efficient than black box methods.
The standard method in machine learning for very large data sets is Stochastic Gradient Descent or some variant of it.
There is also a family of socalled batch gradient algorithms that do multiple iterative passes over the whole data set. I use one of these called ADAM.
Jon
Re: methods for tuning coefficients
Think of this:
The static position eval is of the form:
v = Sigma Ci x Fi where C are coefficients and F are features.
Now compare that to the equation for an SLP (single layer perceptron = a neural network with only one weight layer):
O = Sigma Wi x Ii where W are weights and I are inputs.
These equations are the same and thus all chess programs have an SLP in them. Therefore, methods available
for training neural networks are open for chess programs. One interesting method is TDL (temporal difference learning)
that can be done via selfplay, opponent play or even training positions. The nice thing about TDL is that you don't need an oracle
because it is based on win, draw lose.
I suggest reading papers written by Gerald Tessauro.
The static position eval is of the form:
v = Sigma Ci x Fi where C are coefficients and F are features.
Now compare that to the equation for an SLP (single layer perceptron = a neural network with only one weight layer):
O = Sigma Wi x Ii where W are weights and I are inputs.
These equations are the same and thus all chess programs have an SLP in them. Therefore, methods available
for training neural networks are open for chess programs. One interesting method is TDL (temporal difference learning)
that can be done via selfplay, opponent play or even training positions. The nice thing about TDL is that you don't need an oracle
because it is based on win, draw lose.
I suggest reading papers written by Gerald Tessauro.

 Posts: 61
 Joined: Wed Feb 19, 2014 9:11 pm
Re: methods for tuning coefficients
Pretty sure your Temporal Difference Learning is implemented incorrectly (and highly inefficient) if you are only propagating WDL back.
TDL's "oracle" should be the engines own search()
The idea being to push the static evaluation of the current position towards the score returned by a search of it.
If you dont see why this is TDL as often explained, consider that at some point previously the current position was a leaf node and its score was the current static evaluation.
W.R.T. this stuff all using similar maths, that is because all of this stuff can be described as operations on vectors. The differences manifest not at the pairwise multiplications, but in the summations. A traditional static evaluator uses a straight summation of the products, while ML techniques will frequently shape this sum somehow (sigmoid, relu, etc..) and more advanced informationtheory approaches may be summing the squares of those products.
TDL's "oracle" should be the engines own search()
The idea being to push the static evaluation of the current position towards the score returned by a search of it.
If you dont see why this is TDL as often explained, consider that at some point previously the current position was a leaf node and its score was the current static evaluation.
W.R.T. this stuff all using similar maths, that is because all of this stuff can be described as operations on vectors. The differences manifest not at the pairwise multiplications, but in the summations. A traditional static evaluator uses a straight summation of the products, while ML techniques will frequently shape this sum somehow (sigmoid, relu, etc..) and more advanced informationtheory approaches may be summing the squares of those products.
Re: methods for tuning coefficients
I think you are confusing TDL with Reinforcement Learning. RL is as you describe. The key difference with TDL is that there is the assumption that moves made near the end of the game have more to do with the outcome than moves near the beginning. Ex. In a 120 move chess game the first move c4 has little to nothing to do with losing. Of course, a weak player can make a critically bad move on move 2DustyMonkey wrote: ↑Tue Oct 30, 2018 4:34 amPretty sure your Temporal Difference Learning is implemented incorrectly (and highly inefficient) if you are only propagating WDL back.
TDL's "oracle" should be the engines own search()
The idea being to push the static evaluation of the current position towards the score returned by a search of it.
If you dont see why this is TDL as often explained, consider that at some point previously the current position was a leaf node and its score was the current static evaluation.
(for example the classic fools mate). I am referring to far better played game than that.
I think the Leela team would be better off with TDL instead of RL.
Here you are talking about neuron activation functions and they are typically sigmoid or relu. However, theyDustyMonkey wrote: ↑Tue Oct 30, 2018 4:34 amW.R.T. this stuff all using similar maths, that is because all of this stuff can be described as operations on vectors. The differences manifest not at the pairwise multiplications, but in the summations. A traditional static evaluator uses a straight summation of the products, while ML techniques will frequently shape this sum somehow (sigmoid, relu, etc..) and more advanced informationtheory approaches may be summing the squares of those products.
can be linear  if so then that is the same as a chess programs method.
The static eval in a chess program (a good one) is a sophisticated and highly tuned expert system with
the math of a SLP with a linear activation function. Of course, some programs have some nonlinear features in them.