Re: NNUE accessible explanation
Posted: Fri Jul 24, 2020 10:22 pm
I think in general a lot of why certain architectures are chosen over others boils down to "because we tried others and this worked the best". That's not to say there is no justification, just that it's not black and white. Perhaps, however, I can give you a bit of my intuition with different aspects.
Convolutional Networks and Stride
Generally, when working in the image space, convolutional neural networks are superior to fully connected networks, because they are easier to train (they have some implicit regularization due to weight re-use) and their modularity makes them more suitable for GPU computation. Since chessboards are very small (8x8 vs 1080x1920 for a 1080p image) it's less clear that this is the case for our game. Convolutional networks usually use fully connected layers at the end as they try to move from the image space to a more general feature space. Stride > 1 is mostly a tool for dimensionality reduction. For chess this is pretty much useless, as there is little reason to want to reduce to 4x4 or smaller. I would much sooner just switch to fully connected layers.
Filter Size
For general problems filter size selection is a slightly more interesting question, but even there 3x3 filter is very popular, so you should have a very good reason if you are doing something else. I actually have experimented with other values for this in Winter, but I'm not in the mood for getting into those details at the moment.
Activation Functions
There is a lot of research trying to find the most optimal activation function. The goal is usually to find better gradients in order to be able to train even larger and more complex neural networks. In recent literature people have been training with hundreds of layers. OpenAI recently released information about their latest GPT-3 natural language model. It has 175 billion parameters, which would take 700GB to store assuming 32 bit precision. This is all to say the research being done is mostly to push boundaries and solve problems you probably do not have. A solid rule of thumb is to use relu for all layers with exception of the output layer. The output layer should be one of linear, sigmoid, softmax, tanh or relu, depending on what the data demands. Relu is a very good activation function as it is very efficient to compute and suffers much less from the vanishing gradient problem when compared to sigmoid.
Layer Size and Neurons Per Layer
The optimal number of layers and number of neurons per layer is going to be very problem specific. More layers increases computational complexity and the number of parameters linearly in the number of layers while simultaneous allowing for the neurons to contain higher level information. The main downside is more layers tend to be quite a bit harder to train. This is more problematic for fully connected networks than for convolutional networks, but is definitely problematic in both cases. More neurons per layer will increase the number of parameters in a quadratic way. If your network is very small, an increase in the number of units may not have too big an impact. For larger networks, I believe computational complexity will increase quadratically as well, but don't quote me on that. Important is that you need enough units to represent the information you want the network to learn.
Overparametrization
As a side note, most state of the art neural networks are heavily overparametrized. Since for most problems, we only care about reducing the horrible training times, there is not as much work on reducing this issue. Thanks to a desire to have neural networks on mobile devices and the progressively larger networks we are able to train this has changed a bit, but nevertheless, for chess this is actually much more important. We care very much about inference time as better inference time implies more nodes per second for our search algorithm. The size of the network and the inference time are related, but its not a one to one relation.
AlphaZero and LC0 Network Architectures
AlphaZero and LC0 are both based on the well known ResNet neural network architecture. AlphaZero introduced a dual headed output for policy and evaluation. The LC0 team extended to use the SE feature and I would imagine many other ideas I missed as I haven't been following too closely. At their heart however, the architectures are quite standard for image recognition.
NNUE Architecture
The NNUE network, in my understanding, is very non-standard and completely designed for being efficient for the task at hand. The input layer is heavily overparametrized which is normally a bad thing, but due to the known sparsity, it is actually very efficient to compute. The number of layers and neurons after that is kept low, in order too much computational burden. This makes it extremely fast to compute relative to the LC0 network while still having a fairly high amount of power.
Winter Architecture
As a final note, the Winter NN has two main parts. The first part is a non-standard convolutional neural network which uses sparsity similarly to the NNUE network. This convolutional network is used to calculate pawn structure features, so the output can be reused very often as it gets stored in a separate hash table with a high hitrate. The second part is a fully connected network which has as input the output of the convolutional network as well as a set of handcrafted features standard to classical engines such as SF. This set of features is mostly a subset of the features from before I added neural networks to Winter.
Hopefully this clarifies a lot of questions. As this actually took some time to type up I might further extend and clean this up to make a separate post for people interested in getting into neural networks for chess-like games.
Convolutional Networks and Stride
Generally, when working in the image space, convolutional neural networks are superior to fully connected networks, because they are easier to train (they have some implicit regularization due to weight re-use) and their modularity makes them more suitable for GPU computation. Since chessboards are very small (8x8 vs 1080x1920 for a 1080p image) it's less clear that this is the case for our game. Convolutional networks usually use fully connected layers at the end as they try to move from the image space to a more general feature space. Stride > 1 is mostly a tool for dimensionality reduction. For chess this is pretty much useless, as there is little reason to want to reduce to 4x4 or smaller. I would much sooner just switch to fully connected layers.
Filter Size
For general problems filter size selection is a slightly more interesting question, but even there 3x3 filter is very popular, so you should have a very good reason if you are doing something else. I actually have experimented with other values for this in Winter, but I'm not in the mood for getting into those details at the moment.
Activation Functions
There is a lot of research trying to find the most optimal activation function. The goal is usually to find better gradients in order to be able to train even larger and more complex neural networks. In recent literature people have been training with hundreds of layers. OpenAI recently released information about their latest GPT-3 natural language model. It has 175 billion parameters, which would take 700GB to store assuming 32 bit precision. This is all to say the research being done is mostly to push boundaries and solve problems you probably do not have. A solid rule of thumb is to use relu for all layers with exception of the output layer. The output layer should be one of linear, sigmoid, softmax, tanh or relu, depending on what the data demands. Relu is a very good activation function as it is very efficient to compute and suffers much less from the vanishing gradient problem when compared to sigmoid.
Layer Size and Neurons Per Layer
The optimal number of layers and number of neurons per layer is going to be very problem specific. More layers increases computational complexity and the number of parameters linearly in the number of layers while simultaneous allowing for the neurons to contain higher level information. The main downside is more layers tend to be quite a bit harder to train. This is more problematic for fully connected networks than for convolutional networks, but is definitely problematic in both cases. More neurons per layer will increase the number of parameters in a quadratic way. If your network is very small, an increase in the number of units may not have too big an impact. For larger networks, I believe computational complexity will increase quadratically as well, but don't quote me on that. Important is that you need enough units to represent the information you want the network to learn.
Overparametrization
As a side note, most state of the art neural networks are heavily overparametrized. Since for most problems, we only care about reducing the horrible training times, there is not as much work on reducing this issue. Thanks to a desire to have neural networks on mobile devices and the progressively larger networks we are able to train this has changed a bit, but nevertheless, for chess this is actually much more important. We care very much about inference time as better inference time implies more nodes per second for our search algorithm. The size of the network and the inference time are related, but its not a one to one relation.
AlphaZero and LC0 Network Architectures
AlphaZero and LC0 are both based on the well known ResNet neural network architecture. AlphaZero introduced a dual headed output for policy and evaluation. The LC0 team extended to use the SE feature and I would imagine many other ideas I missed as I haven't been following too closely. At their heart however, the architectures are quite standard for image recognition.
NNUE Architecture
The NNUE network, in my understanding, is very non-standard and completely designed for being efficient for the task at hand. The input layer is heavily overparametrized which is normally a bad thing, but due to the known sparsity, it is actually very efficient to compute. The number of layers and neurons after that is kept low, in order too much computational burden. This makes it extremely fast to compute relative to the LC0 network while still having a fairly high amount of power.
Winter Architecture
As a final note, the Winter NN has two main parts. The first part is a non-standard convolutional neural network which uses sparsity similarly to the NNUE network. This convolutional network is used to calculate pawn structure features, so the output can be reused very often as it gets stored in a separate hash table with a high hitrate. The second part is a fully connected network which has as input the output of the convolutional network as well as a set of handcrafted features standard to classical engines such as SF. This set of features is mostly a subset of the features from before I added neural networks to Winter.
Hopefully this clarifies a lot of questions. As this actually took some time to type up I might further extend and clean this up to make a separate post for people interested in getting into neural networks for chess-like games.