Don't understand NNUE

lucasart · Post by **lucasart** » Fri Aug 14, 2020 1:56 am

I cannot find an explanation in English of the structure of the network. Quite simply, how is a forward pass through the network computed ?

Based on posts I've read, it is simply a multi-layer perceptron, with 512 inputs (each is one bit), 2 hidden layers of 32 neurons each, and a single output (the eval). So I would expect (512+1)*32 + (32+1)*32 + (32+1) = 17505 weights.

Yet the NNUE file is 20MB, which is several orders of magnutide larger than that...

Gerd Isenberg · Post by **Gerd Isenberg** » Fri Aug 14, 2020 4:24 pm

lucasart wrote: ↑Fri Aug 14, 2020 1:56 am I cannot find an explanation in English of the structure of the network. Quite simply, how is a forward pass through the network computed ?

Based on posts I've read, it is simply a multi-layer perceptron, with 512 inputs (each is one bit), 2 hidden layers of 32 neurons each, and a single output (the eval). So I would expect (512+1)*32 + (32+1)*32 + (32+1) = 17505 weights.

Yet the NNUE file is 20MB, which is several orders of magnutide larger than that...

The HalfPK NNUE has 2x20480 (plus some more) one-bit inputs ([pnbrq] x 64 x 64 king squares) connected to 2x256 neurons of the first dominant hidden layer, which is updated incrementally.

Rémi Coulom · Post by **Rémi Coulom** » Fri Aug 14, 2020 7:52 pm

The Chess Programming Wiki has an excellent description:
https://www.chessprogramming.org/Stockf ... _Structure

lucasart · Post by **lucasart** » Sat Aug 15, 2020 1:20 am

Gerd Isenberg wrote: ↑Fri Aug 14, 2020 4:24 pm
lucasart wrote: ↑Fri Aug 14, 2020 1:56 am I cannot find an explanation in English of the structure of the network. Quite simply, how is a forward pass through the network computed ?

Based on posts I've read, it is simply a multi-layer perceptron, with 512 inputs (each is one bit), 2 hidden layers of 32 neurons each, and a single output (the eval). So I would expect (512+1)*32 + (32+1)*32 + (32+1) = 17505 weights.

Yet the NNUE file is 20MB, which is several orders of magnutide larger than that...
The HalfPK NNUE has 2x20480 (plus some more) one-bit inputs ([pnbrq] x 64 x 64 king squares) connected to 2x256 neurons of the first dominant hidden layer, which is updated incrementally.

Thanks.

Well done on the CPW article. Very clear.

Rom77 · Post by **Rom77** » Sat Aug 15, 2020 3:47 am

An attempt to translate the original article on NNUE into English:

https://cdn.discordapp.com/attachments/ ... 4/NNUE.zip

Rom77 · Post by **Rom77** » Sat Aug 15, 2020 4:07 am

https://discord.com/channels/4359437104 ... 4605209601

nodchip
41,024 = 64 * 641. 64 comes from the number of the cells where king may exist. 641 = 64 * 5 * 2 + 1. 64 here comes from the number of the cells where a piece other than king may exist. 5 is the number of piece types other than king. 2 is the number of the colors, white and black. 1 is a captured piece.

Twipply
That +1 is the part I couldn't figure out
I came 64 short

nodchip
Please refer this source code. https://github.com/nodchip/Stockfish/bl ... uate.h#L57
"+ 1" is BONA_PIECE_ZERO.
Here "bona" means "bonanza" which is a popular computer shogi engine. It introduced the feature "p" for the first time.

Twipply
I see
So what's the purpose of the extra 64 features?

Twipply
@nodchip Could you perhaps explain what this +1 is for a little more?

nodchip
@Twipply Position struct in Stockfish+NNUE have evalList. https://github.com/nodchip/Stockfish/bl ... ion.h#L248
BonaPieces are contained in the evalList. It is updated by Position::do_move() and Position::undo_move(), and used by NNUE to calculate the network parameters between the input layer and the first hidden layer.
About the calculation, the following text will be helpful. This text is sent to RocketMiningPoo on Twitter.

document
message.txt
2.63 KB

When a piece is captured, BONA_PIECE_ZERO is set to the corresponding element in evalList. NNUE will skip the calculation of the network parameters of a piece, if the corresponding element is BONA_PIECE_ZERO.

nodchip
uh... In message.txt, there are some wrong descriptions...

nodchip
"We add the i-th COLUMN of the W{0} to the z{0} for each i, where the i-th element is set to 1. And we subtract the i-th COULMN of the W{0} from the z{0} for each i, where the i-th element is set to 0. This operation is "accumulate" in your question." may be right... I hope that someone will double check.

message.txt

NNUE uses fully-connected neural network. The output of the network can be expressed as:
[y]_{1x1} = z_{L+1}
z_{l} = b_{l} + W_{l}a_{l-1}
a_{l} = \sigma(z_{l}) if l > 0
x if l == 0
where x, L, W_{l}, b_{l}, \sigma and y represents the input feature vector, the number of the hidden layers, the weight matrix of the l-th layer, the bias vector of the l-th layer, the activation function and the output value, respectively.

If we write the calculation of a layer with a simple C++ code, the code will be like:

float input[N];
float bias[M];
float matrix[M][N];
float output[M];

for (int m = 0; m < M; ++m) {
float accumulated[m] = bias[m];
for (int n = 0; n < N; ++n) {
accumulated += matrix[m][n] * input[n];
}
output[m] = ActivationFunction(accumulated);
}

If we calculate the output of the neural network with a simple way, we need to execute the calculation above for each layer. It takes a log time, and the nps will be extremely low.
At this time, we can speed up the calculation by using a special characteristics of the feature vector. When a move is played in a position, only a few elements in the corresponding feature vector changes. For example, if a pawn is advanced, the element in the feature vector corresponding to the previous position of the pawn will be set to 0, and the element corresponding to the current position will be set to 1. If we calculate only about the changed elements, we can speed up the calculation. This technique is called "difference calculation" in computer shogi.
For difference calculation, we hold z_{0} when we calculate the output of the neural network. When a move is played, some elements in the feature vector is set to 1, and other will be set to 0. We add the i-th row of the W_{0} to the z_{0} for each i, where the i-th element is set to 1. And we subtract the i-th row of the W_{0} from the z_{0} for each i, where the i-th element is set to 0. This operation is "accumulate" in your question.
Difference calculation can be applied only for the calculation between the input layer and the first hidden layer. Because the input vector of the second and later hidden layer change drastically when a move is played. But this is fine because the number of the network parameters between the input layer and the first hidden layer is very large, and the other is very small.
We can not use difference calculation if a king is moved in halfkp. Because all the elements in the feature vector corresponding to the king will be changed. In this case, we reset the z{0} and re-calculate by adding all the row parameters corresponding to the feature vector.

hgm · Post by **hgm** » Mon Aug 17, 2020 10:11 am

Note that the first layer of the NN is simply King-location-dependent piece-square table. Two sets, one per friendly King, one per enemy King. This is a well-known technique, which allows easy incremental updating similar to normal PST, except when the Kings move. (In which case the new total value has to be calculated from scratch).

The new thing is that NNUE uses 256 of such King-Piece-Square tables, and then an NN with two hidden layers of 32 to integrate the scores these provide.

One can wonder if this would also work for other evaluation terms than PST or KPST. One could take a conventional evaluation, but take 256 diferently tuned copies of it, an d feed their outputs (which normally would be the evaluation result) to a 2-layer NN.

Pio · Post by **Pio** » Mon Aug 17, 2020 11:07 am

hgm wrote: ↑Mon Aug 17, 2020 10:11 am Note that the first layer of the NN is simply King-location-dependent piece-square table. Two sets, one per friendly King, one per enemy King. This is a well-known technique, which allows easy incremental updating similar to normal PST, except when the Kings move. (In which case the new total value has to be calculated from scratch).

The new thing is that NNUE uses 256 of such King-Piece-Square tables, and then an NN with two hidden layers of 32 to integrate the scores these provide.

One can wonder if this would also work for other evaluation terms than PST or KPST. One could take a conventional evaluation, but take 256 diferently tuned copies of it, an d feed their outputs (which normally would be the evaluation result) to a 2-layer NN.

It works very well. It is called transfer learning. I have done it on image classification just training the last fully connected layer on
on a network trained for other images/classifications. I think it is even better if you use networks with different topology and/or functions and connect their final layers. You can see it as each network is a person and what you do is The weighted democracy of all networks. I have a long time thought of making an ANN that is very big that given a pawn/king-constellation returns maybe eight weights (64 bits might be enough) for each square. You could then save the result in a pawn hash and use the weights together with the rest to a smaller faster network. You could then use the big network on the GPU and it would not matter so much since the results would be reused by a factor 20 maybe.

hgm · Post by **hgm** » Mon Aug 17, 2020 12:05 pm

The Chessprogramming Wiki very nicely shows the topology of the NN, but it lacks any information on what type of neurons are used in each layer (i.e what their activation function is, and if they have trainable bias). Are these rectifiers, step functions, sigmoids?

Gerd Isenberg · Post by **Gerd Isenberg** » Mon Aug 17, 2020 9:13 pm

hgm wrote: ↑Mon Aug 17, 2020 12:05 pm The Chessprogramming Wiki very nicely shows the topology of the NN, but it lacks any information on what type of neurons are used in each layer (i.e what their activation function is, and if they have trainable bias). Are these rectifiers, step functions, sigmoids?

Some snippets with google translate form Nasu's paper :
https://www.apply.computer-shogi.org/wc ... D/nnue.pdf

2.8. SIMD

Activation al and the weight matrix Wl are expressed in 8-bit
Express and calculate the inner product using an instruction (VPMADDUBSW) that calculates the sum of two 8-bit integer vectors.
The activation function is an instruction that converts the bit width of a vector of integers (VPACKSSDW, VPACKSSWB)
And an 8-bit integer vector are calculated by the instruction (VPMAXSB) that takes the maximum value of each two elements.

So VPMAXSB looks like rectifier. I have to study the code a little bit more...

Don't understand NNUE

Don't understand NNUE

Re: Don't understand NNUE

Re: Don't understand NNUE

Re: Don't understand NNUE

Re: Don't understand NNUE

Re: Don't understand NNUE

Re: Don't understand NNUE

Re: Don't understand NNUE

Re: Don't understand NNUE

Re: Don't understand NNUE