NNUE - Efficiently Updatable Network - understanding

dangi12012 · Post by **dangi12012** » Mon Oct 11, 2021 2:19 pm

I read the whitepaper on NNUE and it clearly states that it is a fully connected network. It goes on to say its 8 bit internally for the hidden layers.
https://github.com/asdfjkl/nnue/blob/main/nnue_en.pdf

The weight matrix sizes are:
W1: 125388 x 256
W2: 512 x 32
W3: 32 x 32
W4: 32 x 1

Which means that the input goes from 125388 -> 256 -> 512 -> 32 -> 32 -> 1
I can see why 32 and 8 bit is really optimal here since 8x32 = 256 byte and that fits perfectly in a single AVX2 register and also has the right instructions for multiplication / add. It really describes how linear models failed sooner in shogi than in chess - since what we know intrisically is that the value of a piece really depends on the position and is not a linear model. For example if a queen is blocked out of the game - its not worth much even when placed on a "good" square. Neural networks solve that problem. -

2 Questions
How is it efficiently updatable since it has so many inputs and is fully connected?

What exactly is the overparametrisation for the network since it has more inputs than pieces? The first multiplication is a W1 is a Matrix 125388x256 and a vector of length 125388. Which yields a vector of length 256.
The paper talks about "TNK evolution Turbo type D" which is a shogi program.

AVX2 Cheat sheet:
https://db.in.tum.de/~finis/x86%20intri ... 20v1.0.pdf

mvanthoor · Post by **mvanthoor** » Mon Oct 11, 2021 2:27 pm

dangi12012 wrote: ↑Mon Oct 11, 2021 2:19 pm What exactly is the overparametrisation for the network since it has more inputs than pieces?

Without having read the whitepaper and without being bothered or blessed with any kind of knowledge about NNUE at this point, I'm going to hazard a guess that the extra inputs are more information.

A chess engine needs more than only the placement of the pieces; it also needs lots of information about relationships between those pieces: are there doubled pawns? Are there open lines? Is the king safe / protected on the edge, or in the middle of the board (first is good in middle game, second is good in the endgame). Are the knights on outposts, are the bishops not blocked in? Etc...

You need all sorts of information to accurately evaluate a position, and I assume the inputs are used to put this information about the current position into the network.

I _could_ be completely wrong though...

yeni_sekme · Post by **yeni_sekme** » Mon Oct 11, 2021 4:13 pm

Great resource for Stockfish NNUE:
https://github.com/glinscott/nnue-pytor ... cs/nnue.md

- There are so many input neurons but only some of them are activated, so you don't do multiplication in the first layer but just add the weights of active neurons.

dangi12012 · Post by **dangi12012** » Mon Oct 11, 2021 5:37 pm

yeni_sekme wrote: ↑Mon Oct 11, 2021 4:13 pm Great resource for Stockfish NNUE:
https://github.com/glinscott/nnue-pytor ... cs/nnue.md

Thank you. That was the level of detail I was looking for. The whitepaper is too sparse.
I also thought in terms of neurons thats why I thought a Neuron is a class in programming and has a list of inputs.
https://www.codeproject.com/Articles/79 ... he-Unknown
Naive approach - has 3Million iops.

But you need to know that fundamentally you multiply a weight matrix https://en.wikipedia.org/wiki/Matrix_(mathematics) by your inputs. Which will do the same thing - but is much cheaper in terms of compute. So its literally only (matrix mult - activation function) per layer.
Math approach on CPU like stockfish - 3 Billion iops.

Matrix multiplication is insanely fast. Training is the hard part. But if you have a trained model - you literally have baked in weights (the matrix) and can use the tensor cores of your gpu - which are two order of magnitude faster than any cpu.
Optimized GPU Cublas on Tensorcores - 285 Trillion iops.

JohnWoe · Post by **JohnWoe** » Mon Oct 11, 2021 10:50 pm

Fading gradient problem when you have many layers. Deep network. In RNN especially. Tho NNUE is just CNN. Deeper layers learn shit when you multiply deep gradients.

derjack · Post by **derjack** » Mon Oct 11, 2021 11:22 pm

You have around 40k inputs, but only at most 32 will be 'ones' in input layer, so you only have 32 * 512 (or 1024?) calculations for first layer. Furthermore when you move piece other than king, in the input you may calculate only the change. The rest of the NNUE is small and relatively cheap.

dangi12012 wrote: ↑Mon Oct 11, 2021 5:37 pm But you need to know that fundamentally you multiply a weight matrix https://en.wikipedia.org/wiki/Matrix_(mathematics) by your inputs. Which will do the same thing - but is much cheaper in terms of compute. So its literally only (matrix mult - activation function) per layer.
Math approach on CPU like stockfish - 3 Billion iops.

Matrix multiplication is insanely fast. Training is the hard part. But if you have a trained model - you literally have baked in weights (the matrix) and can use the tensor cores of your gpu - which are two order of magnitude faster than any cpu.
Optimized GPU Cublas on Tensorcores - 285 Trillion iops.

Unfortunately, even if GPU speed would be infinite, the latency between GPU<->CPU is the biggest bottleneck here, making it not worth it during playing.

connor_mcmonigle · Post by **connor_mcmonigle** » Tue Oct 12, 2021 12:35 am

JohnWoe wrote: ↑Mon Oct 11, 2021 10:50 pm Fading gradient problem when you have many layers. Deep network. In RNN especially. Tho NNUE is just CNN. Deeper layers learn shit when you multiply deep gradients.

Nope.

hgm · Post by **hgm** » Thu Oct 21, 2021 2:22 pm

dangi12012 wrote: ↑Mon Oct 11, 2021 2:19 pmThe weight matrix sizes are:
W1: 125388 x 256
W2: 512 x 32
W3: 32 x 32
W4: 32 x 1

Which means that the input goes from 125388 -> 256 -> 512 -> 32 -> 32 -> 1

I don't think that is correct, and that there is no 256 -> 512 step. The point is that the first weight matrix uses all weights twice, because there is color-reversal symmetry.

The first layer is really just 256 x 64 x 2 sets of Piece-Square Tables: 256 for each location of the King of the side to move, and 256 for each location of the other King. Each set of PST has 64x10 weights (as there is no PST for the King itself, just for P, N, B, R and Q). So the first-layer weights are an array PST[n][kingColor][kingSqr][square][coloredPieceType], where n is the set number (0-255). For each n and kingColor you then sum the PST weights for the given king location over all other pieces, as is usual for calculating a PST evaluation. So you end up with 2 (=colors) x 256 PST evaluations.

The easily updatable aspect comes from the fact that you are updating the 256 PST evaluations incrementally, only adjusting for the moved piece. (Unless the King moves. Which is rarely.)

NNUE - Efficiently Updatable Network - understanding

NNUE - Efficiently Updatable Network - understanding

Re: NNUE - Efficiently Updatable Network - understanding

Re: NNUE - Efficiently Updatable Network - understanding

Re: NNUE - Efficiently Updatable Network - understanding

Re: NNUE - Efficiently Updatable Network - understanding

Re: NNUE - Efficiently Updatable Network - understanding

Re: NNUE - Efficiently Updatable Network - understanding

Re: NNUE - Efficiently Updatable Network - understanding