NNUE - Efficiently Updatable Network - understanding

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

NNUE - Efficiently Updatable Network - understanding

Post by dangi12012 »

I read the whitepaper on NNUE and it clearly states that it is a fully connected network. It goes on to say its 8 bit internally for the hidden layers.
https://github.com/asdfjkl/nnue/blob/main/nnue_en.pdf

The weight matrix sizes are:
W1: 125388 x 256
W2: 512 x 32
W3: 32 x 32
W4: 32 x 1

Which means that the input goes from 125388 -> 256 -> 512 -> 32 -> 32 -> 1
I can see why 32 and 8 bit is really optimal here since 8x32 = 256 byte and that fits perfectly in a single AVX2 register and also has the right instructions for multiplication / add. It really describes how linear models failed sooner in shogi than in chess - since what we know intrisically is that the value of a piece really depends on the position and is not a linear model. For example if a queen is blocked out of the game - its not worth much even when placed on a "good" square. Neural networks solve that problem. -

2 Questions
How is it efficiently updatable since it has so many inputs and is fully connected?

What exactly is the overparametrisation for the network since it has more inputs than pieces? The first multiplication is a W1 is a Matrix 125388x256 and a vector of length 125388. Which yields a vector of length 256.
The paper talks about "TNK evolution Turbo type D" which is a shogi program.

AVX2 Cheat sheet:
https://db.in.tum.de/~finis/x86%20intri ... 20v1.0.pdf
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
User avatar
mvanthoor
Posts: 1784
Joined: Wed Jul 03, 2019 4:42 pm
Location: Netherlands
Full name: Marcel Vanthoor

Re: NNUE - Efficiently Updatable Network - understanding

Post by mvanthoor »

dangi12012 wrote: Mon Oct 11, 2021 2:19 pm What exactly is the overparametrisation for the network since it has more inputs than pieces?
Without having read the whitepaper and without being bothered or blessed with any kind of knowledge about NNUE at this point, I'm going to hazard a guess that the extra inputs are more information.

A chess engine needs more than only the placement of the pieces; it also needs lots of information about relationships between those pieces: are there doubled pawns? Are there open lines? Is the king safe / protected on the edge, or in the middle of the board (first is good in middle game, second is good in the endgame). Are the knights on outposts, are the bishops not blocked in? Etc...

You need all sorts of information to accurately evaluate a position, and I assume the inputs are used to put this information about the current position into the network.

I _could_ be completely wrong though...
Author of Rustic, an engine written in Rust.
Releases | Code | Docs | Progress | CCRL
yeni_sekme
Posts: 39
Joined: Mon Mar 01, 2021 7:51 pm
Location: İstanbul, Turkey
Full name: Ömer Faruk Tutkun

Re: NNUE - Efficiently Updatable Network - understanding

Post by yeni_sekme »

Great resource for Stockfish NNUE:
https://github.com/glinscott/nnue-pytor ... cs/nnue.md

- There are so many input neurons but only some of them are activated, so you don't do multiplication in the first layer but just add the weights of active neurons.
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: NNUE - Efficiently Updatable Network - understanding

Post by dangi12012 »

yeni_sekme wrote: Mon Oct 11, 2021 4:13 pm Great resource for Stockfish NNUE:
https://github.com/glinscott/nnue-pytor ... cs/nnue.md
Thank you. That was the level of detail I was looking for. The whitepaper is too sparse.
I also thought in terms of neurons thats why I thought a Neuron is a class in programming and has a list of inputs.
https://www.codeproject.com/Articles/79 ... he-Unknown
Naive approach - has 3Million iops.

But you need to know that fundamentally you multiply a weight matrix https://en.wikipedia.org/wiki/Matrix_(mathematics) by your inputs. Which will do the same thing - but is much cheaper in terms of compute. So its literally only (matrix mult - activation function) per layer.
Math approach on CPU like stockfish - 3 Billion iops.

Matrix multiplication is insanely fast. Training is the hard part. But if you have a trained model - you literally have baked in weights (the matrix) and can use the tensor cores of your gpu - which are two order of magnitude faster than any cpu.
Optimized GPU Cublas on Tensorcores - 285 Trillion iops.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
JohnWoe
Posts: 491
Joined: Sat Mar 02, 2013 11:31 pm

Re: NNUE - Efficiently Updatable Network - understanding

Post by JohnWoe »

Fading gradient problem when you have many layers. Deep network. In RNN especially. Tho NNUE is just CNN. Deeper layers learn shit when you multiply deep gradients.
derjack
Posts: 16
Joined: Fri Dec 27, 2019 8:47 pm
Full name: Jacek Dermont

Re: NNUE - Efficiently Updatable Network - understanding

Post by derjack »

You have around 40k inputs, but only at most 32 will be 'ones' in input layer, so you only have 32 * 512 (or 1024?) calculations for first layer. Furthermore when you move piece other than king, in the input you may calculate only the change. The rest of the NNUE is small and relatively cheap.
dangi12012 wrote: Mon Oct 11, 2021 5:37 pm But you need to know that fundamentally you multiply a weight matrix https://en.wikipedia.org/wiki/Matrix_(mathematics) by your inputs. Which will do the same thing - but is much cheaper in terms of compute. So its literally only (matrix mult - activation function) per layer.
Math approach on CPU like stockfish - 3 Billion iops.

Matrix multiplication is insanely fast. Training is the hard part. But if you have a trained model - you literally have baked in weights (the matrix) and can use the tensor cores of your gpu - which are two order of magnitude faster than any cpu.
Optimized GPU Cublas on Tensorcores - 285 Trillion iops.
Unfortunately, even if GPU speed would be infinite, the latency between GPU<->CPU is the biggest bottleneck here, making it not worth it during playing.
connor_mcmonigle
Posts: 533
Joined: Sun Sep 06, 2020 4:40 am
Full name: Connor McMonigle

Re: NNUE - Efficiently Updatable Network - understanding

Post by connor_mcmonigle »

JohnWoe wrote: Mon Oct 11, 2021 10:50 pm Fading gradient problem when you have many layers. Deep network. In RNN especially. Tho NNUE is just CNN. Deeper layers learn shit when you multiply deep gradients.
Nope.
User avatar
hgm
Posts: 27809
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: NNUE - Efficiently Updatable Network - understanding

Post by hgm »

dangi12012 wrote: Mon Oct 11, 2021 2:19 pmThe weight matrix sizes are:
W1: 125388 x 256
W2: 512 x 32
W3: 32 x 32
W4: 32 x 1

Which means that the input goes from 125388 -> 256 -> 512 -> 32 -> 32 -> 1
I don't think that is correct, and that there is no 256 -> 512 step. The point is that the first weight matrix uses all weights twice, because there is color-reversal symmetry.

The first layer is really just 256 x 64 x 2 sets of Piece-Square Tables: 256 for each location of the King of the side to move, and 256 for each location of the other King. Each set of PST has 64x10 weights (as there is no PST for the King itself, just for P, N, B, R and Q). So the first-layer weights are an array PST[n][kingColor][kingSqr][square][coloredPieceType], where n is the set number (0-255). For each n and kingColor you then sum the PST weights for the given king location over all other pieces, as is usual for calculating a PST evaluation. So you end up with 2 (=colors) x 256 PST evaluations.

The easily updatable aspect comes from the fact that you are updating the 256 PST evaluations incrementally, only adjusting for the moved piece. (Unless the King moves. Which is rarely.)