NNUE accessible explanation

jorose · Post by **jorose** » Fri Jul 24, 2020 10:22 pm

I think in general a lot of why certain architectures are chosen over others boils down to "because we tried others and this worked the best". That's not to say there is no justification, just that it's not black and white. Perhaps, however, I can give you a bit of my intuition with different aspects.

Convolutional Networks and Stride
Generally, when working in the image space, convolutional neural networks are superior to fully connected networks, because they are easier to train (they have some implicit regularization due to weight re-use) and their modularity makes them more suitable for GPU computation. Since chessboards are very small (8x8 vs 1080x1920 for a 1080p image) it's less clear that this is the case for our game. Convolutional networks usually use fully connected layers at the end as they try to move from the image space to a more general feature space. Stride > 1 is mostly a tool for dimensionality reduction. For chess this is pretty much useless, as there is little reason to want to reduce to 4x4 or smaller. I would much sooner just switch to fully connected layers.

Filter Size
For general problems filter size selection is a slightly more interesting question, but even there 3x3 filter is very popular, so you should have a very good reason if you are doing something else. I actually have experimented with other values for this in Winter, but I'm not in the mood for getting into those details at the moment.

Activation Functions
There is a lot of research trying to find the most optimal activation function. The goal is usually to find better gradients in order to be able to train even larger and more complex neural networks. In recent literature people have been training with hundreds of layers. OpenAI recently released information about their latest GPT-3 natural language model. It has 175 billion parameters, which would take 700GB to store assuming 32 bit precision. This is all to say the research being done is mostly to push boundaries and solve problems you probably do not have. A solid rule of thumb is to use relu for all layers with exception of the output layer. The output layer should be one of linear, sigmoid, softmax, tanh or relu, depending on what the data demands. Relu is a very good activation function as it is very efficient to compute and suffers much less from the vanishing gradient problem when compared to sigmoid.

Layer Size and Neurons Per Layer
The optimal number of layers and number of neurons per layer is going to be very problem specific. More layers increases computational complexity and the number of parameters linearly in the number of layers while simultaneous allowing for the neurons to contain higher level information. The main downside is more layers tend to be quite a bit harder to train. This is more problematic for fully connected networks than for convolutional networks, but is definitely problematic in both cases. More neurons per layer will increase the number of parameters in a quadratic way. If your network is very small, an increase in the number of units may not have too big an impact. For larger networks, I believe computational complexity will increase quadratically as well, but don't quote me on that. Important is that you need enough units to represent the information you want the network to learn.

Overparametrization
As a side note, most state of the art neural networks are heavily overparametrized. Since for most problems, we only care about reducing the horrible training times, there is not as much work on reducing this issue. Thanks to a desire to have neural networks on mobile devices and the progressively larger networks we are able to train this has changed a bit, but nevertheless, for chess this is actually much more important. We care very much about inference time as better inference time implies more nodes per second for our search algorithm. The size of the network and the inference time are related, but its not a one to one relation.

AlphaZero and LC0 Network Architectures
AlphaZero and LC0 are both based on the well known ResNet neural network architecture. AlphaZero introduced a dual headed output for policy and evaluation. The LC0 team extended to use the SE feature and I would imagine many other ideas I missed as I haven't been following too closely. At their heart however, the architectures are quite standard for image recognition.

NNUE Architecture
The NNUE network, in my understanding, is very non-standard and completely designed for being efficient for the task at hand. The input layer is heavily overparametrized which is normally a bad thing, but due to the known sparsity, it is actually very efficient to compute. The number of layers and neurons after that is kept low, in order too much computational burden. This makes it extremely fast to compute relative to the LC0 network while still having a fairly high amount of power.

Winter Architecture
As a final note, the Winter NN has two main parts. The first part is a non-standard convolutional neural network which uses sparsity similarly to the NNUE network. This convolutional network is used to calculate pawn structure features, so the output can be reused very often as it gets stored in a separate hash table with a high hitrate. The second part is a fully connected network which has as input the output of the convolutional network as well as a set of handcrafted features standard to classical engines such as SF. This set of features is mostly a subset of the features from before I added neural networks to Winter.

Hopefully this clarifies a lot of questions. As this actually took some time to type up I might further extend and clean this up to make a separate post for people interested in getting into neural networks for chess-like games.

fierz · Post by **fierz** » Fri Jul 24, 2020 11:14 pm

Dear Jonathan,

thanks for the very detailled explanations, they are much appreciated!
I will contribute to your idea of writing something for newbies to NNs by asking some more stupid newbie questions which you can then answer and make an even better introduction

1) If I want the NN to return an evaluation, which could be used in a traditional alphabeta search engine, how is this achieved most commonly? Do I have a last hidden layer that feeds into a single output neuron which uses e.g. a tanh activation function so -1 would be a loss and +1 would be a win? Or is there some other "standard" approach?

2) Do I code my NN myself in C or do I use libraries for this? If libraries, which ones (BLAS?)? It would seem to me that it is very easy to implement the NN code "by hand" (just a few matrix multiplications) but perhaps it will be inefficient if not using the right libraries?

3) When it comes to training, all that I know so far is Texel's tuning method which I recently used to improve my checkers engine (http://www.fierz.ch/cake186.php) by using logistic regression on win-loss-draw information on a few million (N) positions to improve the weights of my handwritten eval function. So essentially I do some kind of gradient descent for a rather small handful of parameters (a few 100). When training an NN, I read that the relu activation has the advantage that its derivative is easy to compute, but I'm not sure if/how I would need to use that. If I think of Texel's tuning method, I would set up some small NN to start with, and try to do the same as I did there = calculate the output of the NN for all N positions, calculate the error vs the game results; then change the weights of all parameters layer by layer starting from the last one, by a small amount to calculate a gradient, and do the same I did before? Is this totally wrong (because I don't calculate any derivatives here?)?

4) From reading about Stockfish NNUE I get the impression that they are not doing a regression vs game results, but rather vs the evaluation of a Stockfish search to X ply and try to learn that rather than the game result, which is different then the Texel tuning method. Is this distincition of trying to learn search eval vs trying to learn game results actually relevant or are the two +/- equal?

Sorry for the really stupid questions... but perhaps other people have them too...

best regards
Martin-who-didn't-realize-you-were-in-Indiana! Hope you are doing well there!

fabianVDW · Post by **fabianVDW** » Sat Jul 25, 2020 11:25 am

I can chime in on two of the points and maybe provide a bit of a mathematical background, although I am unsure how much of this you know already.

fierz wrote: ↑Fri Jul 24, 2020 11:14 pm
1) If I want the NN to return an evaluation, which could be used in a traditional alphabeta search engine, how is this achieved most commonly? Do I have a last hidden layer that feeds into a single output neuron which uses e.g. a tanh activation function so -1 would be a loss and +1 would be a win? Or is there some other "standard" approach?

There are several ways to go about this, and it all depends on what you are training your net on. What you describe is what can be done during training, you could also do this with a sigmoid which ranges from 0 to 1, and interpret the output as a probability. If you want to use the "Centipawn" score or any other linear score internally for your engine, you would then need to take the inverse sigmoid/tanh of your output, which is the same as doing no activation at all for the last layer. So for inference it might make sense to turn off the activation of the last layer. It is to be noted here that for centipawn, it might make sense to multiply the inverse sigmoid of the output by a factor of K, where K can range from 50-200. Another approach is to have a Win-Draw-Loss output which is done with 3 neurons and a softmax activation(don't quote me on that, not sure), so all three probabilities would be predicted independently, which has some advantages over only having one probability representing the expected game outcome.

fierz wrote: ↑Fri Jul 24, 2020 11:14 pm 3) When it comes to training, all that I know so far is Texel's tuning method which I recently used to improve my checkers engine (http://www.fierz.ch/cake186.php) by using logistic regression on win-loss-draw information on a few million (N) positions to improve the weights of my handwritten eval function. So essentially I do some kind of gradient descent for a rather small handful of parameters (a few 100).

Texel's tuning method is actually very similar to training a dense neural network. You can think of your evaluation as a one-layer neural network with no activation function for the most part, and one output neuron(or maybe two, if you do a phased evaluation). Of course, most engines have some non linear stuff in their evaluation, most typically found in attack evaluation.

fierz wrote: ↑Fri Jul 24, 2020 11:14 pm When training an NN, I read that the relu activation has the advantage that its derivative is easy to compute, but I'm not sure if/how I would need to use that. If I think of Texel's tuning method, I would set up some small NN to start with, and try to do the same as I did there = calculate the output of the NN for all N positions, calculate the error vs the game results; then change the weights of all parameters layer by layer starting from the last one, by a small amount to calculate a gradient, and do the same I did before? Is this totally wrong (because I don't calculate any derivatives here?)?

The gradient is essentially the same as the derivative. From what I learned, by definition the gradient is the transpose of the derivative vector, but I know there are definitions where it is exactly the same. You can think of your NN(with a fixed input) as a function F: R^n -> R, where R are the real numbers, and n is the dimension, so the amount of weights of your network. Then F is differentiable for the most part, and the gradient of F at a vector x in R^n (denoted as (Grad F)(x)) will be showing in the direction of steepest ascent, which is easy to prove from cauchy-schwarz inequality and definitions of a derivative. Similarly, -1 * (Grad F)(x) will show in the direction of the steepest descent. This is what justifies the gradient descent method. Now there are two ways of calculating the gradient.

What you are doing is approximating the gradient by the difference qoutient and the fact that (Grad F)(x) = ( (partial_1 F)(x), ..., (partial_n F)(x))^T.
So the definition of (partial_j F)(x) = lim as s-> 0 of (F(x + s*e_j) - F(x))/ s, where e_j is a unit vector in R^n. What you are doing is essentialy approximating this limit by explicitly calculating (F(x+1 * e_j) - F(x)) / 1, so plugging in s = 1( or other values for s). This works pretty well in practice, but has two disadvantages:
1) This approach scales linearly in dimension n (because for each partial derivative you will have to recalculate F(x+ s*e_j). This is particulary bad for NN, where parameter sizes usually explodes.
2) It is just an approximation, and especially for Neural Networks there may be some numerical instability for doing this for instance in the first layer, while your network has, say 8-9 layers.
3) A big oversight in this approach is the following: "Why is F even differentiable?" If you ask yourself this question, you will find out that with this approach you will not think about this. You can "approximate" the "gradient" of any function like that, even if it does not make sense at all, because not any function is differentiable.

What is done instead is calculating the Gradient of F analytically with derivative rules, so the chain rule, addition rule and so forth, taking advantage of the fact that we can formulate F(x) analytically with sums(or matrix multiply). Using the chain rule, you will find that you will indeed have to take the derivative of the Relu Function, which is easier to calculate than the derivative of sigmoid. You can look up the formulas or try to come up with them yourselves, but if you use a framework like Tensorflow/Keras or pytorch to train a network, you will not be confronted with this, and all your left to do might be implementing the inference in the language of your choice.

jorose · Post by **jorose** » Mon Aug 03, 2020 6:53 am

fierz wrote: ↑Fri Jul 24, 2020 11:14 pm 1) If I want the NN to return an evaluation, which could be used in a traditional alphabeta search engine, how is this achieved most commonly? Do I have a last hidden layer that feeds into a single output neuron which uses e.g. a tanh activation function so -1 would be a loss and +1 would be a win? Or is there some other "standard" approach?

That is pretty much the standard approach for a single output value. For training purposes you can consider using sigmoid instead of tanh and then switching to either tanh or just using the logit without activation for gameplay. In this case you would use cross entropy loss and assign 1/0.5/0 as outputs for WDL.

For multiple output values (eg W/D/L) the standard approach would be to use a softmax activation function paired with cross-entropy loss. In Winter I have two outputs (W/W+D which trivially imply W/D/L) which are activated by sigmoid function with cross entropy loss. Whether this non-standard approach is still worth it at the moment, I cannot say.

fierz wrote: ↑Fri Jul 24, 2020 11:14 pm 2) Do I code my NN myself in C or do I use libraries for this? If libraries, which ones (BLAS?)? It would seem to me that it is very easy to implement the NN code "by hand" (just a few matrix multiplications) but perhaps it will be inefficient if not using the right libraries?

For training I would recommend using some high level library such as TensorFlow 2.0 (which I currently use) or Pytorch. This has a couple of benefits. You can easily experiment with different architectures containing any number of state of the art features you might not want to implement unless the architecture looks like a clear improvement. It is a natural sanity check as you can compare the output of a network in your engine as well as from the library for some fixed input. Finally, you don't need to implement any training algorithms yourself. To me at least, the training algorithms are significantly more complicated than just implementing support for inference, so it has a bigger potential for bugs as well as potentially making your codebase bigger and convoluted.

As for inference, different projects rely on support on different levels.

I don't know any projects which rely on high level libraries directly for neural network support. I think that is a fine route if you don't need the absolute maximum of performance and customization options. The main downside is that something like NNUE, where you are doing some unusual tricks, might be hard or even impossible to do. Furthermore, you might not actually save a lof of work if you are working with a low level language. I find documentation a bit lacking for C/C++ in combination with higher level libraries and most of the functions you have to implement for inference are not actually that complicated anyways.

On GPU, LC0 relies on Cuda and CUDNN or alternatively on OpenCL. If you want to support GPUs, I would suggest you do the same. Support for GPUs is likely not worth it for you if you implement some NNUE based system or just a small fully connected network.

On CPU, LC0 relies on Blas. Afaik Blas backends can be used interchangeably and share the same basic interface.

Allie relies on the LC0 code for its Neural Network implementation. I think this has been a minor point of discussion in terms of debates on cloning, but that is outside of the scope of this thread. I mainly mention it here, because all engines I am mentioning in this post are open source and it might give you another reference point.

I don't think SF-NNUE relies on Blas, instead working directly with optimized code and SIMD CPU instructions.

Giraffe relies on Eigen. I am not really familiar with Eigen aside from knowing that it is a general purpose linear algebra library. If we are lucky, Matthew Lai could pop in and explain its benefits and whether he would recommend using that.

Winter doesn't rely on an external backend for inference and in fact doesn't even explicitly use SIMD instructions. The primary reason being that the last time I checked the assembly code, the compiler seemed to be doing that reasonably well. Somewhere in the middle of my TODO list is to add explicit support for SIMD instructions, so I don't have to trust the compiler anymore.

fierz wrote: ↑Fri Jul 24, 2020 11:14 pm 3) When it comes to training, all that I know so far is Texel's tuning method which I recently used to improve my checkers engine (http://www.fierz.ch/cake186.php) by using logistic regression on win-loss-draw information on a few million (N) positions to improve the weights of my handwritten eval function. So essentially I do some kind of gradient descent for a rather small handful of parameters (a few 100). When training an NN, I read that the relu activation has the advantage that its derivative is easy to compute, but I'm not sure if/how I would need to use that. If I think of Texel's tuning method, I would set up some small NN to start with, and try to do the same as I did there = calculate the output of the NN for all N positions, calculate the error vs the game results; then change the weights of all parameters layer by layer starting from the last one, by a small amount to calculate a gradient, and do the same I did before? Is this totally wrong (because I don't calculate any derivatives here?)?

Based on your writing I think you are close. Perhaps you should read up some more on backpropogation as you should be calculating the derivatives at every layer. Backpropogation is more or less just a clever and optimized use of the chain rule for derivatives.

It might help to write out the neural network in the form of L(A(B(C(x))),y) where L is the loss function, A,B and C are the layers with their respective weight parameters, x is the input and y is the target label. Then think about how you would calculate the derivative to update the parameters in A, B and C respectively.

fierz wrote: ↑Fri Jul 24, 2020 11:14 pm 4) From reading about Stockfish NNUE I get the impression that they are not doing a regression vs game results, but rather vs the evaluation of a Stockfish search to X ply and try to learn that rather than the game result, which is different then the Texel tuning method. Is this distincition of trying to learn search eval vs trying to learn game results actually relevant or are the two +/- equal?

The distinction is somewhat relevant. For more information you can read up on temporal difference learning, which is more generally what SF-NNUE is doing. Giraffe did another form of this and I have previously done temporal difference learning in Winter as well.

The main advantages are that such an approach is less data reliant (you can train with positions where you don't have a game outcome) and your labels could potentially end up being more accurate. E.g. game result alone cannot really quantify if a position is unbalanced, but dynamically equal.

Temporal difference learning and other forms or reinforcement learning have their own issues though and I would recommend at least starting out with pure supervised learning of game outcomes. Once you know that is working, you can try out temporal difference learning or self play for comparison.

At some point the best in Winter was labelling positions as some weighted average between a low depth search and the actual game outcome.

fierz wrote: ↑Fri Jul 24, 2020 11:14 pm Martin-who-didn't-realize-you-were-in-Indiana! Hope you are doing well there!

Indeed I am in Indiana and doing ok. I just moved out of my previous apartment and will be moving into my new one after a brief stay with friends. Hopefully I will be able to visit everyone back home for a month or so around Christmas!

Milos · Post by **Milos** » Mon Aug 03, 2020 8:21 am

jorose wrote: ↑Mon Aug 03, 2020 6:53 am As for inference, different projects rely on support on different levels.

I don't know any projects which rely on high level libraries directly for neural network support. I think that is a fine route if you don't need the absolute maximum of performance and customization options. The main downside is that something like NNUE, where you are doing some unusual tricks, might be hard or even impossible to do. Furthermore, you might not actually save a lof of work if you are working with a low level language. I find documentation a bit lacking for C/C++ in combination with higher level libraries and most of the functions you have to implement for inference are not actually that complicated anyways.

On GPU, LC0 relies on Cuda and CUDNN or alternatively on OpenCL. If you want to support GPUs, I would suggest you do the same. Support for GPUs is likely not worth it for you if you implement some NNUE based system or just a small fully connected network.

On CPU, LC0 relies on Blas. Afaik Blas backends can be used interchangeably and share the same basic interface.

Allie relies on the LC0 code for its Neural Network implementation. I think this has been a minor point of discussion in terms of debates on cloning, but that is outside of the scope of this thread. I mainly mention it here, because all engines I am mentioning in this post are open source and it might give you another reference point.

I don't think SF-NNUE relies on Blas, instead working directly with optimized code and SIMD CPU instructions.

Giraffe relies on Eigen. I am not really familiar with Eigen aside from knowing that it is a general purpose linear algebra library. If we are lucky, Matthew Lai could pop in and explain its benefits and whether he would recommend using that.

Nice write-up Johnatan.
Here are a couple of additional clarifications.
SF-NNUE indeed uses handcrafted SIMD code mainly because it has a lot of sparse update operations. Here is how it actually looks like:
https://github.com/nodchip/Stockfish/bl ... nsformer.h

Eigen is very good for more complex operations (beyond winograd convolutions, matrix-matrix multiply and dot products) on a single thread. On multi-threads nothing much except general matrix-matrix products is implemented.
In general using Blas as backend is a much safer bet.

P.S. Btw. do you by any chance know Celestine Dünner from your ETHZ days?

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Tue Aug 04, 2020 7:59 pm

Milos wrote: ↑Mon Aug 03, 2020 8:21 am On CPU, LC0 relies on Blas. Afaik Blas backends can be used interchangeably and share the same basic interface.

There's an Eigen backend as well, but it is just used for basic matrix/matrix multiplies, i.e. as an alternative BLAS library.

Giraffe relies on Eigen. I am not really familiar with Eigen aside from knowing that it is a general purpose linear algebra library. If we are lucky, Matthew Lai could pop in and explain its benefits and whether he would recommend using that.
Eigen is very good for more complex operations (beyond winograd convolutions, matrix-matrix multiply and dot products) on a single thread. On multi-threads nothing much except general matrix-matrix products is implemented.
In general using Blas as backend is a much safer bet.

One reason why Eigen is nice is that it is a header only library, you don't need to compile and link it separately. This simplifies the build process. For small matrices Eigen's operations can be inlined better and perform faster because the compilers' optimizer has more visibility. Also the free BLAS libraries like OpenBLAS can be rather buggy.

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Tue Aug 04, 2020 8:30 pm

jorose wrote: ↑Thu Jul 23, 2020 4:12 pm Notably, my understanding is nodchip also has as input for each square whether a capture occured there.

Ha, interesting, there's a lot of tricks one can do here. You can also cheat with piece types. It's not clear to me whether NNUE can distinguish between castling state right now? But you can fix such issues by making separate "king" and "king that can never castle" pieces. Or "pawn" and "passed pawn" pieces, if you want to "help" the neural network a bit.

Pio · Post by **Pio** » Tue Aug 04, 2020 10:32 pm

Gian-Carlo Pascutto wrote: ↑Tue Aug 04, 2020 8:30 pm
jorose wrote: ↑Thu Jul 23, 2020 4:12 pm Notably, my understanding is nodchip also has as input for each square whether a capture occured there.
Ha, interesting, there's a lot of tricks one can do here. You can also cheat with piece types. It's not clear to me whether NNUE can distinguish between castling state right now? But you can fix such issues by making separate "king" and "king that can never castle" pieces. Or "pawn" and "passed pawn" pieces, if you want to "help" the neural network a bit.

Actually you need to have the castling rights in the rooks so that a rook with castling opportunity is a castlingRook piece. I do that and it works. Whenever the king moves both castlingRooks if existing will be transformed To normal rooks and whenever a castlingRook moves it will become a normal rook. I also put the En passant pawn as a special piece. In that way I have eight piece types that can be encoded in three bits.

It would be interesting to make a neural network that is vertical flip symmetrical plus a diff part when castling is available. If no pawns are present It could be both vertical, horizontal and diagonal symmetrical.

Andrew · Post by **Andrew** » Tue Nov 17, 2020 9:21 am

fierz wrote: ↑Fri Jul 24, 2020 11:14 pm

3) When it comes to training, all that I know so far is Texel's tuning method which I recently used to improve my checkers engine (http://www.fierz.ch/cake186.php) by using logistic regression on win-loss-draw information on a few million (N) positions to improve the weights of my handwritten eval function. So essentially I do some kind of gradient descent for a rather small handful of parameters (a few 100). When training an NN, I read that the relu activation has the advantage that its derivative is easy to compute, but I'm not sure if/how I would need to use that. If I think of Texel's tuning method, I would set up some small NN to start with, and try to do the same as I did there = calculate the output of the NN for all N positions, calculate the error vs the game results; then change the weights of all parameters layer by layer starting from the last one, by a small amount to calculate a gradient, and do the same I did before? Is this totally wrong (because I don't calculate any derivatives here?)?

4) From reading about Stockfish NNUE I get the impression that they are not doing a regression vs game results, but rather vs the evaluation of a Stockfish search to X ply and try to learn that rather than the game result, which is different then the Texel tuning method. Is this distincition of trying to learn search eval vs trying to learn game results actually relevant or are the two +/- equal?

Sorry for the really stupid questions... but perhaps other people have them too...

best regards
Martin-who-didn't-realize-you-were-in-Indiana! Hope you are doing well there!

A followup to this post, is anyone aware of any NNUE engines that have been developed for Checkers/Draughts?

The 8x8 version of course has been proved as a draw, but 10x10 and other variants might be a nice challenge!

In fact 8x8 might be an interesting experiment to see how simple an initial evaluation function you can use to train the network.

I found this interesting paper from 1999 using networks for checkers but in a different way.

https://www.researchgate.net/publicatio ... _knowledge

Andrew

Rein Halbersma · Post by **Rein Halbersma** » Tue Nov 17, 2020 10:58 am

Andrew wrote: ↑Tue Nov 17, 2020 9:21 am A followup to this post, is anyone aware of any NNUE engines that have been developed for Checkers/Draughts?

The 8x8 version of course has been proved as a draw, but 10x10 and other variants might be a nice challenge!

In fact 8x8 might be an interesting experiment to see how simple an initial evaluation function you can use to train the network.

I found this interesting paper from 1999 using networks for checkers but in a different way.

https://www.researchgate.net/publicatio ... _knowledge

Andrew

Perhaps you missed this post by Jonathan Kreuzer: http://talkchess.com/forum3/viewtopic.p ... 20#p872289
It's not the same as NNUE, but similar in spirit. Just the raw board representation (121 bytes per position, including side to move) fed into 3 fully connected layers of 192, 32 and 32 neurons each, squashed with a sigmoid onto an eval score. Works like a charm.

NNUE accessible explanation

Re: NNUE accessible explanation

Re: NNUE accessible explanation

Re: NNUE accessible explanation

Re: NNUE accessible explanation

Re: NNUE accessible explanation

Re: NNUE accessible explanation

Re: NNUE accessible explanation

Re: NNUE accessible explanation

Re: NNUE accessible explanation

Re: NNUE accessible explanation