256 in NNUE?

kinderchocolate · Post by **kinderchocolate** » Thu Jan 28, 2021 8:16 pm

https://www.chessprogramming.org/Stockfish_NNUE. In the NNUE network architecture, I can see 256 weights in the first hidden layer. It requires an int16 to keep the weights.

Can we reduce the number of hidden layer neurons from say 256 to 128, and it will reduce the size of the weights to int8? Also, what's the reason for keeping it to 256 and not any other numbers? It looks by reducing the neurons just one, the weight vectors sharply dropped to int8.

tomitank · Post by **tomitank** » Thu Jan 28, 2021 9:03 pm

kinderchocolate wrote: ↑Thu Jan 28, 2021 8:16 pm https://www.chessprogramming.org/Stockfish_NNUE. In the NNUE network architecture, I can see 256 weights in the first hidden layer. It requires an int16 to keep the weights.

Can we reduce the number of hidden layer neurons from say 256 to 128, and it will reduce the size of the weights to int8? Also, what's the reason for keeping it to 256 and not any other numbers? It looks by reducing the neurons just one, the weight vectors sharply dropped to int8.

Yes you can reduce. 256 is only a tested number (i think). There are some recommendations for neurons, but there is no perfect recipe.

BeyondCritics · Post by **BeyondCritics** » Thu Jan 28, 2021 9:17 pm

kinderchocolate wrote: ↑Thu Jan 28, 2021 8:16 pm ...
Can we reduce the number of hidden layer neurons from say 256 to 128, and it will reduce the size of the weights to int8? ...

The number of weights tells you something about the dimension of your problem, the size of ints something about your precision. Obviously there is no correspondence between them. You can of course try to vary them, as already mentioned. AFAIK there exists still no recommendable guideline to prefer one over some other, except for cornercases.

AndrewGrant · Post by **AndrewGrant** » Fri Jan 29, 2021 5:32 am

You can have a network with 2x128 neurons in L1 as opposed to 2x256 like SF. Its what I do in Ethereal, it works fine.
The number of weights has nothing to do with the size of the individual weights.

kinderchocolate · Post by **kinderchocolate** » Sat Jan 30, 2021 11:06 am

Thanks! Follow up questions:

1.) Why integer16 for the weights? Shouldn't the weights be floating numbers?
2.) Why int16 in the first layer but int8 in the remaining of the network?

hgm · Post by **hgm** » Sat Jan 30, 2021 11:57 am

My guess is that this is just trial and error. Heavily influenced by what the hardware can do in terms of SIMD instructions. The weights of the first layer never have to be multiplied (as the inputs are just 0/1), and are furthermore updated incrementally. Which means in practice only adding / subtracting a very small fraction of them (those for which the input toggled) to the KPST sums. So I suppose there is very little gain in reducing the precision to 8 bits. Which of course would not be enough of a reason to refrain doing it if making them 16 bits would serve no purpose at all.

derjack · Post by **derjack** » Sat Jan 30, 2021 1:15 pm

It's called neural network quantization. You convert into int16 or int8, you lose some precision, which may worsen the net a liitle bit, but SIMD calculations then can be faster as more weights fit into registers. As for int16 for first layer, this reduces memory by 2x (floats are 32bits), so more or entire neural network can fit into cache, thus again it becomes faster.

AndrewGrant · Post by **AndrewGrant** » Sat Jan 30, 2021 3:21 pm

kinderchocolate wrote: ↑Sat Jan 30, 2021 11:06 am Thanks! Follow up questions:

1.) Why integer16 for the weights? Shouldn't the weights be floating numbers?
2.) Why int16 in the first layer but int8 in the remaining of the network?

https://chess.stackexchange.com/questio ... 3736#33736
That answers why the input is 16 bits and the others 8 bits. TL;DR: Due the the way its computed _without_ multiplication.

As for floats -- you can pack ints more heavily. Also, you remove the possibility of variance between how compilers and platforms treat floats. Not all floating point expressions will be evaluated the same way on all compilers / platforms.

Sesse · Post by **Sesse** » Sat Jan 30, 2021 6:46 pm

AndrewGrant wrote: ↑Sat Jan 30, 2021 3:21 pm As for floats -- you can pack ints more heavily. Also, you remove the possibility of variance between how compilers and platforms treat floats. Not all floating point expressions will be evaluated the same way on all compilers / platforms.

But most certainly will, especially now that x87 is effectively dead and gone. (The biggest differences are in library functions like exp() and the likes.)

256 in NNUE?

256 in NNUE?

Re: 256 in NNUE?

Re: 256 in NNUE?

Re: 256 in NNUE?

Re: 256 in NNUE?

Re: 256 in NNUE?

Re: 256 in NNUE?

Re: 256 in NNUE?

Re: 256 in NNUE?