Binary neural network for chess?

AAce3 · Post by **AAce3** » Wed Aug 10, 2022 1:35 am

Hey all,
Out of curiosity, has anyone tried using a Binary Neural Network for chess? i.e., with weights quantized to 1-bit. I'd imagine that it'd be fast enough for NNUE, especially considering the bitboard representation can be fed in natively. I'm not super well versed on the subject but I believe that convolutions can also be quantized down to 1-bit? Perhaps an interesting idea to try.

dangi12012 · Post by **dangi12012** » Wed Aug 10, 2022 11:20 am

Yes Cuda supports experimental_1b convolutions. The Performance expected is more than 2 peta ops. A dot product with 1 bit = popcnt(xnor (x,y)) < 32. Thats the same as popcnt(xor(x,y)) >= 32.
But you need to read the "vertical bits" so without hardware like Cuda, FPGA or CPUs with AVX-galois field extensions it's nothing.

Training is hard because with 1b what is the gradient?
But I tackled this for a move generator to prove general viability:
https://github.com/Gigantua/Chess_BinaryNeuralNetwork
I even found a way to do it quite fast on normal CPUs by finding 32x4 popcounts in a handful instructions.

Code to get the up to 14 attacked bits by a Rook:

Code: Select all

result |= (std::popcount<uint32_t>(_mm256_movemask_epi8(ChessBNN::popcount8x32_SmallerThan4(_mm256_xor_si256(input, _mm256_load_si256(reinterpret_cast<const __m256i*>(weights + 0  * 32)))))) > 16) << 0 ;
	result |= (std::popcount<uint32_t>(_mm256_movemask_epi8(ChessBNN::popcount8x32_SmallerThan4(_mm256_xor_si256(input, _mm256_load_si256(reinterpret_cast<const __m256i*>(weights + 1  * 32)))))) > 16) << 1 ;
	result |= (std::popcount<uint32_t>(_mm256_movemask_epi8(ChessBNN::popcount8x32_SmallerThan4(_mm256_xor_si256(input, _mm256_load_si256(reinterpret_cast<const __m256i*>(weights + 2  * 32)))))) > 16) << 2 ;
	result |= (std::popcount<uint32_t>(_mm256_movemask_epi8(ChessBNN::popcount8x32_SmallerThan4(_mm256_xor_si256(input, _mm256_load_si256(reinterpret_cast<const __m256i*>(weights + 3  * 32)))))) > 16) << 3 ;
	result |= (std::popcount<uint32_t>(_mm256_movemask_epi8(ChessBNN::popcount8x32_SmallerThan4(_mm256_xor_si256(input, _mm256_load_si256(reinterpret_cast<const __m256i*>(weights + 4  * 32)))))) > 16) << 4 ;
	result |= (std::popcount<uint32_t>(_mm256_movemask_epi8(ChessBNN::popcount8x32_SmallerThan4(_mm256_xor_si256(input, _mm256_load_si256(reinterpret_cast<const __m256i*>(weights + 5  * 32)))))) > 16) << 5 ;
	result |= (std::popcount<uint32_t>(_mm256_movemask_epi8(ChessBNN::popcount8x32_SmallerThan4(_mm256_xor_si256(input, _mm256_load_si256(reinterpret_cast<const __m256i*>(weights + 6  * 32)))))) > 16) << 6 ;
	result |= (std::popcount<uint32_t>(_mm256_movemask_epi8(ChessBNN::popcount8x32_SmallerThan4(_mm256_xor_si256(input, _mm256_load_si256(reinterpret_cast<const __m256i*>(weights + 7  * 32)))))) > 16) << 7 ;
	result |= (std::popcount<uint32_t>(_mm256_movemask_epi8(ChessBNN::popcount8x32_SmallerThan4(_mm256_xor_si256(input, _mm256_load_si256(reinterpret_cast<const __m256i*>(weights + 8  * 32)))))) > 16) << 8 ;
	result |= (std::popcount<uint32_t>(_mm256_movemask_epi8(ChessBNN::popcount8x32_SmallerThan4(_mm256_xor_si256(input, _mm256_load_si256(reinterpret_cast<const __m256i*>(weights + 9  * 32)))))) > 16) << 9 ;
	result |= (std::popcount<uint32_t>(_mm256_movemask_epi8(ChessBNN::popcount8x32_SmallerThan4(_mm256_xor_si256(input, _mm256_load_si256(reinterpret_cast<const __m256i*>(weights + 10 * 32)))))) > 16) << 10;
	result |= (std::popcount<uint32_t>(_mm256_movemask_epi8(ChessBNN::popcount8x32_SmallerThan4(_mm256_xor_si256(input, _mm256_load_si256(reinterpret_cast<const __m256i*>(weights + 11 * 32)))))) > 16) << 11;
	result |= (std::popcount<uint32_t>(_mm256_movemask_epi8(ChessBNN::popcount8x32_SmallerThan4(_mm256_xor_si256(input, _mm256_load_si256(reinterpret_cast<const __m256i*>(weights + 12 * 32)))))) > 16) << 12;
	result |= (std::popcount<uint32_t>(_mm256_movemask_epi8(ChessBNN::popcount8x32_SmallerThan4(_mm256_xor_si256(input, _mm256_load_si256(reinterpret_cast<const __m256i*>(weights + 13 * 32)))))) > 16) << 13;

In Summary: Very interesting topic - especially for the first layer

Witek · Post by **Witek** » Wed Aug 10, 2022 1:31 pm

I think it could be usefull in first one or two layers where it would help recognizing patterns. But you still need rather smooth and countinuous output from the network.

dangi12012 · Post by **dangi12012** » Wed Aug 10, 2022 9:48 pm

Witek wrote: ↑Wed Aug 10, 2022 1:31 pm I think it could be usefull in first one or two layers where it would help recognizing patterns. But you still need rather smooth and countinuous output from the network.

Exactly. This would be the way to go from a BB to a 2nd layer without the need to expand to 64x4b nibble representation first.
CUBLAS/CUTLASS templates do not support it - you have to go one level deeper and this is very hard for non nvidia engineers. You need to call the tensorop intrinsics "by hand".

You get these types:
https://docs.nvidia.com/cuda/cuda-c-pro ... ma-subbyte
And code will look like page 6:
https://arxiv.org/pdf/2006.16578.pdf

And you need to build a NN from that. Doable - but a few weeks of (fulltime) work imo.
But that is the way to go to implement maximum performance of binary neural networks in the year 2022 for consumer machines.
A naive approach will get you nowhere fast.

Binary neural network for chess?

Binary neural network for chess?

Re: Binary neural network for chess?

Re: Binary neural network for chess?

Re: Binary neural network for chess?