256 in NNUE?

Discussion of chess software programming and technical issues.

Moderators: hgm, Dann Corbit, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
Post Reply
kinderchocolate
Posts: 452
Joined: Mon Nov 01, 2010 5:55 am
Full name: Ted Wong
Contact:

256 in NNUE?

Post by kinderchocolate » Thu Jan 28, 2021 7:16 pm

https://www.chessprogramming.org/Stockfish_NNUE. In the NNUE network architecture, I can see 256 weights in the first hidden layer. It requires an int16 to keep the weights.

Can we reduce the number of hidden layer neurons from say 256 to 128, and it will reduce the size of the weights to int8? Also, what's the reason for keeping it to 256 and not any other numbers? It looks by reducing the neurons just one, the weight vectors sharply dropped to int8.
Chessable Technical Lead. PlayMagnus Group Principal Software Engineer. Leading chess AI in the group of companies. https://www.linkedin.com/in/scchess. SmallChess. http://smallchess.com. https://twitter.com/scchess. https://www.facebook.com/scchess.

tomitank
Posts: 258
Joined: Sat Mar 04, 2017 11:24 am
Location: Hungary

Re: 256 in NNUE?

Post by tomitank » Thu Jan 28, 2021 8:03 pm

kinderchocolate wrote:
Thu Jan 28, 2021 7:16 pm
https://www.chessprogramming.org/Stockfish_NNUE. In the NNUE network architecture, I can see 256 weights in the first hidden layer. It requires an int16 to keep the weights.

Can we reduce the number of hidden layer neurons from say 256 to 128, and it will reduce the size of the weights to int8? Also, what's the reason for keeping it to 256 and not any other numbers? It looks by reducing the neurons just one, the weight vectors sharply dropped to int8.
Yes you can reduce. 256 is only a tested number (i think). There are some recommendations for neurons, but there is no perfect recipe.

BeyondCritics
Posts: 385
Joined: Sat May 05, 2012 12:48 pm
Full name: Oliver Roese

Re: 256 in NNUE?

Post by BeyondCritics » Thu Jan 28, 2021 8:17 pm

kinderchocolate wrote:
Thu Jan 28, 2021 7:16 pm
...
Can we reduce the number of hidden layer neurons from say 256 to 128, and it will reduce the size of the weights to int8? ...
The number of weights tells you something about the dimension of your problem, the size of ints something about your precision. Obviously there is no correspondence between them. You can of course try to vary them, as already mentioned. AFAIK there exists still no recommendable guideline to prefer one over some other, except for cornercases.

AndrewGrant
Posts: 1101
Joined: Tue Apr 19, 2016 4:08 am
Location: U.S.A
Full name: Andrew Grant
Contact:

Re: 256 in NNUE?

Post by AndrewGrant » Fri Jan 29, 2021 4:32 am

You can have a network with 2x128 neurons in L1 as opposed to 2x256 like SF. Its what I do in Ethereal, it works fine.
The number of weights has nothing to do with the size of the individual weights.

kinderchocolate
Posts: 452
Joined: Mon Nov 01, 2010 5:55 am
Full name: Ted Wong
Contact:

Re: 256 in NNUE?

Post by kinderchocolate » Sat Jan 30, 2021 10:06 am

Thanks! Follow up questions:

1.) Why integer16 for the weights? Shouldn't the weights be floating numbers?
2.) Why int16 in the first layer but int8 in the remaining of the network?
Chessable Technical Lead. PlayMagnus Group Principal Software Engineer. Leading chess AI in the group of companies. https://www.linkedin.com/in/scchess. SmallChess. http://smallchess.com. https://twitter.com/scchess. https://www.facebook.com/scchess.

User avatar
hgm
Posts: 26134
Joined: Fri Mar 10, 2006 9:06 am
Location: Amsterdam
Full name: H G Muller
Contact:

Re: 256 in NNUE?

Post by hgm » Sat Jan 30, 2021 10:57 am

My guess is that this is just trial and error. Heavily influenced by what the hardware can do in terms of SIMD instructions. The weights of the first layer never have to be multiplied (as the inputs are just 0/1), and are furthermore updated incrementally. Which means in practice only adding / subtracting a very small fraction of them (those for which the input toggled) to the KPST sums. So I suppose there is very little gain in reducing the precision to 8 bits. Which of course would not be enough of a reason to refrain doing it if making them 16 bits would serve no purpose at all.

derjack
Posts: 13
Joined: Fri Dec 27, 2019 7:47 pm
Full name: Jacek Dermont

Re: 256 in NNUE?

Post by derjack » Sat Jan 30, 2021 12:15 pm

It's called neural network quantization. You convert into int16 or int8, you lose some precision, which may worsen the net a liitle bit, but SIMD calculations then can be faster as more weights fit into registers. As for int16 for first layer, this reduces memory by 2x (floats are 32bits), so more or entire neural network can fit into cache, thus again it becomes faster.

AndrewGrant
Posts: 1101
Joined: Tue Apr 19, 2016 4:08 am
Location: U.S.A
Full name: Andrew Grant
Contact:

Re: 256 in NNUE?

Post by AndrewGrant » Sat Jan 30, 2021 2:21 pm

kinderchocolate wrote:
Sat Jan 30, 2021 10:06 am
Thanks! Follow up questions:

1.) Why integer16 for the weights? Shouldn't the weights be floating numbers?
2.) Why int16 in the first layer but int8 in the remaining of the network?
https://chess.stackexchange.com/questio ... 3736#33736
That answers why the input is 16 bits and the others 8 bits. TL;DR: Due the the way its computed _without_ multiplication.

As for floats -- you can pack ints more heavily. Also, you remove the possibility of variance between how compilers and platforms treat floats. Not all floating point expressions will be evaluated the same way on all compilers / platforms.

Sesse
Posts: 284
Joined: Mon Apr 30, 2018 9:51 pm
Contact:

Re: 256 in NNUE?

Post by Sesse » Sat Jan 30, 2021 5:46 pm

AndrewGrant wrote:
Sat Jan 30, 2021 2:21 pm
As for floats -- you can pack ints more heavily. Also, you remove the possibility of variance between how compilers and platforms treat floats. Not all floating point expressions will be evaluated the same way on all compilers / platforms.
But most certainly will, especially now that x87 is effectively dead and gone. (The biggest differences are in library functions like exp() and the likes.)

Post Reply