Don't understand NNUE
Moderators: hgm, Dann Corbit, Harvey Williamson
Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
Don't understand NNUE
I cannot find an explanation in English of the structure of the network. Quite simply, how is a forward pass through the network computed ?
Based on posts I've read, it is simply a multilayer perceptron, with 512 inputs (each is one bit), 2 hidden layers of 32 neurons each, and a single output (the eval). So I would expect (512+1)*32 + (32+1)*32 + (32+1) = 17505 weights.
Yet the NNUE file is 20MB, which is several orders of magnutide larger than that...
Based on posts I've read, it is simply a multilayer perceptron, with 512 inputs (each is one bit), 2 hidden layers of 32 neurons each, and a single output (the eval). So I would expect (512+1)*32 + (32+1)*32 + (32+1) = 17505 weights.
Yet the NNUE file is 20MB, which is several orders of magnutide larger than that...
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.

 Posts: 2238
 Joined: Wed Mar 08, 2006 7:47 pm
 Location: Hattingen, Germany
Re: Don't understand NNUE
The HalfPK NNUE has 2x20480 (plus some more) onebit inputs ([pnbrq] x 64 x 64 king squares) connected to 2x256 neurons of the first dominant hidden layer, which is updated incrementally.lucasart wrote: ↑Thu Aug 13, 2020 11:56 pmI cannot find an explanation in English of the structure of the network. Quite simply, how is a forward pass through the network computed ?
Based on posts I've read, it is simply a multilayer perceptron, with 512 inputs (each is one bit), 2 hidden layers of 32 neurons each, and a single output (the eval). So I would expect (512+1)*32 + (32+1)*32 + (32+1) = 17505 weights.
Yet the NNUE file is 20MB, which is several orders of magnutide larger than that...

 Posts: 436
 Joined: Mon Apr 24, 2006 6:06 pm
 Contact:
Re: Don't understand NNUE
The Chess Programming Wiki has an excellent description:
https://www.chessprogramming.org/Stockf ... _Structure
https://www.chessprogramming.org/Stockf ... _Structure
Re: Don't understand NNUE
Thanks.Gerd Isenberg wrote: ↑Fri Aug 14, 2020 2:24 pmThe HalfPK NNUE has 2x20480 (plus some more) onebit inputs ([pnbrq] x 64 x 64 king squares) connected to 2x256 neurons of the first dominant hidden layer, which is updated incrementally.lucasart wrote: ↑Thu Aug 13, 2020 11:56 pmI cannot find an explanation in English of the structure of the network. Quite simply, how is a forward pass through the network computed ?
Based on posts I've read, it is simply a multilayer perceptron, with 512 inputs (each is one bit), 2 hidden layers of 32 neurons each, and a single output (the eval). So I would expect (512+1)*32 + (32+1)*32 + (32+1) = 17505 weights.
Yet the NNUE file is 20MB, which is several orders of magnutide larger than that...
Well done on the CPW article. Very clear.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
Re: Don't understand NNUE
An attempt to translate the original article on NNUE into English:
https://cdn.discordapp.com/attachments/ ... 4/NNUE.zip
https://cdn.discordapp.com/attachments/ ... 4/NNUE.zip
Re: Don't understand NNUE
https://discord.com/channels/4359437104 ... 4605209601
message.txt
nodchip
41,024 = 64 * 641. 64 comes from the number of the cells where king may exist. 641 = 64 * 5 * 2 + 1. 64 here comes from the number of the cells where a piece other than king may exist. 5 is the number of piece types other than king. 2 is the number of the colors, white and black. 1 is a captured piece.
Twipply
That +1 is the part I couldn't figure out
I came 64 short
nodchip
Please refer this source code. https://github.com/nodchip/Stockfish/bl ... uate.h#L57
"+ 1" is BONA_PIECE_ZERO.
Here "bona" means "bonanza" which is a popular computer shogi engine. It introduced the feature "p" for the first time.
Twipply
I see
So what's the purpose of the extra 64 features?
Twipply
@nodchip Could you perhaps explain what this +1 is for a little more?
nodchip
@Twipply Position struct in Stockfish+NNUE have evalList. https://github.com/nodchip/Stockfish/bl ... ion.h#L248
BonaPieces are contained in the evalList. It is updated by Position::do_move() and Position::undo_move(), and used by NNUE to calculate the network parameters between the input layer and the first hidden layer.
About the calculation, the following text will be helpful. This text is sent to RocketMiningPoo on Twitter.
document
message.txt
2.63 KB
When a piece is captured, BONA_PIECE_ZERO is set to the corresponding element in evalList. NNUE will skip the calculation of the network parameters of a piece, if the corresponding element is BONA_PIECE_ZERO.
nodchip
uh... In message.txt, there are some wrong descriptions...
nodchip
"We add the ith COLUMN of the W{0} to the z{0} for each i, where the ith element is set to 1. And we subtract the ith COULMN of the W{0} from the z{0} for each i, where the ith element is set to 0. This operation is "accumulate" in your question." may be right... I hope that someone will double check.
message.txt
NNUE uses fullyconnected neural network. The output of the network can be expressed as:
[y]_{1x1} = z_{L+1}
z_{l} = b_{l} + W_{l}a_{l1}
a_{l} = \sigma(z_{l}) if l > 0
x if l == 0
where x, L, W_{l}, b_{l}, \sigma and y represents the input feature vector, the number of the hidden layers, the weight matrix of the lth layer, the bias vector of the lth layer, the activation function and the output value, respectively.
If we write the calculation of a layer with a simple C++ code, the code will be like:
float input[N];
float bias[M];
float matrix[M][N];
float output[M];
for (int m = 0; m < M; ++m) {
float accumulated[m] = bias[m];
for (int n = 0; n < N; ++n) {
accumulated += matrix[m][n] * input[n];
}
output[m] = ActivationFunction(accumulated);
}
If we calculate the output of the neural network with a simple way, we need to execute the calculation above for each layer. It takes a log time, and the nps will be extremely low.
At this time, we can speed up the calculation by using a special characteristics of the feature vector. When a move is played in a position, only a few elements in the corresponding feature vector changes. For example, if a pawn is advanced, the element in the feature vector corresponding to the previous position of the pawn will be set to 0, and the element corresponding to the current position will be set to 1. If we calculate only about the changed elements, we can speed up the calculation. This technique is called "difference calculation" in computer shogi.
For difference calculation, we hold z_{0} when we calculate the output of the neural network. When a move is played, some elements in the feature vector is set to 1, and other will be set to 0. We add the ith row of the W_{0} to the z_{0} for each i, where the ith element is set to 1. And we subtract the ith row of the W_{0} from the z_{0} for each i, where the ith element is set to 0. This operation is "accumulate" in your question.
Difference calculation can be applied only for the calculation between the input layer and the first hidden layer. Because the input vector of the second and later hidden layer change drastically when a move is played. But this is fine because the number of the network parameters between the input layer and the first hidden layer is very large, and the other is very small.
We can not use difference calculation if a king is moved in halfkp. Because all the elements in the feature vector corresponding to the king will be changed. In this case, we reset the z{0} and recalculate by adding all the row parameters corresponding to the feature vector.
 hgm
 Posts: 26134
 Joined: Fri Mar 10, 2006 9:06 am
 Location: Amsterdam
 Full name: H G Muller
 Contact:
Re: Don't understand NNUE
Note that the first layer of the NN is simply Kinglocationdependent piecesquare table. Two sets, one per friendly King, one per enemy King. This is a wellknown technique, which allows easy incremental updating similar to normal PST, except when the Kings move. (In which case the new total value has to be calculated from scratch).
The new thing is that NNUE uses 256 of such KingPieceSquare tables, and then an NN with two hidden layers of 32 to integrate the scores these provide.
One can wonder if this would also work for other evaluation terms than PST or KPST. One could take a conventional evaluation, but take 256 diferently tuned copies of it, an d feed their outputs (which normally would be the evaluation result) to a 2layer NN.
The new thing is that NNUE uses 256 of such KingPieceSquare tables, and then an NN with two hidden layers of 32 to integrate the scores these provide.
One can wonder if this would also work for other evaluation terms than PST or KPST. One could take a conventional evaluation, but take 256 diferently tuned copies of it, an d feed their outputs (which normally would be the evaluation result) to a 2layer NN.
Re: Don't understand NNUE
It works very well. It is called transfer learning. I have done it on image classification just training the last fully connected layer onhgm wrote: ↑Mon Aug 17, 2020 8:11 amNote that the first layer of the NN is simply Kinglocationdependent piecesquare table. Two sets, one per friendly King, one per enemy King. This is a wellknown technique, which allows easy incremental updating similar to normal PST, except when the Kings move. (In which case the new total value has to be calculated from scratch).
The new thing is that NNUE uses 256 of such KingPieceSquare tables, and then an NN with two hidden layers of 32 to integrate the scores these provide.
One can wonder if this would also work for other evaluation terms than PST or KPST. One could take a conventional evaluation, but take 256 diferently tuned copies of it, an d feed their outputs (which normally would be the evaluation result) to a 2layer NN.
on a network trained for other images/classifications. I think it is even better if you use networks with different topology and/or functions and connect their final layers. You can see it as each network is a person and what you do is The weighted democracy of all networks. I have a long time thought of making an ANN that is very big that given a pawn/kingconstellation returns maybe eight weights (64 bits might be enough) for each square. You could then save the result in a pawn hash and use the weights together with the rest to a smaller faster network. You could then use the big network on the GPU and it would not matter so much since the results would be reused by a factor 20 maybe.
 hgm
 Posts: 26134
 Joined: Fri Mar 10, 2006 9:06 am
 Location: Amsterdam
 Full name: H G Muller
 Contact:
Re: Don't understand NNUE
The Chessprogramming Wiki very nicely shows the topology of the NN, but it lacks any information on what type of neurons are used in each layer (i.e what their activation function is, and if they have trainable bias). Are these rectifiers, step functions, sigmoids?

 Posts: 2238
 Joined: Wed Mar 08, 2006 7:47 pm
 Location: Hattingen, Germany
Re: Don't understand NNUE
Some snippets with google translate form Nasu's paper :hgm wrote: ↑Mon Aug 17, 2020 10:05 amThe Chessprogramming Wiki very nicely shows the topology of the NN, but it lacks any information on what type of neurons are used in each layer (i.e what their activation function is, and if they have trainable bias). Are these rectifiers, step functions, sigmoids?
https://www.apply.computershogi.org/wc ... D/nnue.pdf
2.8. SIMD
So VPMAXSB looks like rectifier. I have to study the code a little bit more...Activation al and the weight matrix Wl are expressed in 8bit
Express and calculate the inner product using an instruction (VPMADDUBSW) that calculates the sum of two 8bit integer vectors.
The activation function is an instruction that converts the bit width of a vector of integers (VPACKSSDW, VPACKSSWB)
And an 8bit integer vector are calculated by the instruction (VPMAXSB) that takes the maximum value of each two elements.