Binary neural network for chess?

Discussion of chess software programming and technical issues.

Moderator: Ras

User avatar
AAce3
Posts: 80
Joined: Fri Jul 29, 2022 1:30 am
Full name: Aaron Li

Binary neural network for chess?

Post by AAce3 »

Hey all,
Out of curiosity, has anyone tried using a Binary Neural Network for chess? i.e., with weights quantized to 1-bit. I'd imagine that it'd be fast enough for NNUE, especially considering the bitboard representation can be fed in natively. I'm not super well versed on the subject but I believe that convolutions can also be quantized down to 1-bit? Perhaps an interesting idea to try.
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Binary neural network for chess?

Post by dangi12012 »

Yes Cuda supports experimental_1b convolutions. The Performance expected is more than 2 peta ops. A dot product with 1 bit = popcnt(xnor (x,y)) < 32. Thats the same as popcnt(xor(x,y)) >= 32.
But you need to read the "vertical bits" so without hardware like Cuda, FPGA or CPUs with AVX-galois field extensions it's nothing.

Training is hard because with 1b what is the gradient?
But I tackled this for a move generator to prove general viability:
https://github.com/Gigantua/Chess_BinaryNeuralNetwork
I even found a way to do it quite fast on normal CPUs by finding 32x4 popcounts in a handful instructions.

Code to get the up to 14 attacked bits by a Rook:

Code: Select all

result |= (std::popcount<uint32_t>(_mm256_movemask_epi8(ChessBNN::popcount8x32_SmallerThan4(_mm256_xor_si256(input, _mm256_load_si256(reinterpret_cast<const __m256i*>(weights + 0  * 32)))))) > 16) << 0 ;
	result |= (std::popcount<uint32_t>(_mm256_movemask_epi8(ChessBNN::popcount8x32_SmallerThan4(_mm256_xor_si256(input, _mm256_load_si256(reinterpret_cast<const __m256i*>(weights + 1  * 32)))))) > 16) << 1 ;
	result |= (std::popcount<uint32_t>(_mm256_movemask_epi8(ChessBNN::popcount8x32_SmallerThan4(_mm256_xor_si256(input, _mm256_load_si256(reinterpret_cast<const __m256i*>(weights + 2  * 32)))))) > 16) << 2 ;
	result |= (std::popcount<uint32_t>(_mm256_movemask_epi8(ChessBNN::popcount8x32_SmallerThan4(_mm256_xor_si256(input, _mm256_load_si256(reinterpret_cast<const __m256i*>(weights + 3  * 32)))))) > 16) << 3 ;
	result |= (std::popcount<uint32_t>(_mm256_movemask_epi8(ChessBNN::popcount8x32_SmallerThan4(_mm256_xor_si256(input, _mm256_load_si256(reinterpret_cast<const __m256i*>(weights + 4  * 32)))))) > 16) << 4 ;
	result |= (std::popcount<uint32_t>(_mm256_movemask_epi8(ChessBNN::popcount8x32_SmallerThan4(_mm256_xor_si256(input, _mm256_load_si256(reinterpret_cast<const __m256i*>(weights + 5  * 32)))))) > 16) << 5 ;
	result |= (std::popcount<uint32_t>(_mm256_movemask_epi8(ChessBNN::popcount8x32_SmallerThan4(_mm256_xor_si256(input, _mm256_load_si256(reinterpret_cast<const __m256i*>(weights + 6  * 32)))))) > 16) << 6 ;
	result |= (std::popcount<uint32_t>(_mm256_movemask_epi8(ChessBNN::popcount8x32_SmallerThan4(_mm256_xor_si256(input, _mm256_load_si256(reinterpret_cast<const __m256i*>(weights + 7  * 32)))))) > 16) << 7 ;
	result |= (std::popcount<uint32_t>(_mm256_movemask_epi8(ChessBNN::popcount8x32_SmallerThan4(_mm256_xor_si256(input, _mm256_load_si256(reinterpret_cast<const __m256i*>(weights + 8  * 32)))))) > 16) << 8 ;
	result |= (std::popcount<uint32_t>(_mm256_movemask_epi8(ChessBNN::popcount8x32_SmallerThan4(_mm256_xor_si256(input, _mm256_load_si256(reinterpret_cast<const __m256i*>(weights + 9  * 32)))))) > 16) << 9 ;
	result |= (std::popcount<uint32_t>(_mm256_movemask_epi8(ChessBNN::popcount8x32_SmallerThan4(_mm256_xor_si256(input, _mm256_load_si256(reinterpret_cast<const __m256i*>(weights + 10 * 32)))))) > 16) << 10;
	result |= (std::popcount<uint32_t>(_mm256_movemask_epi8(ChessBNN::popcount8x32_SmallerThan4(_mm256_xor_si256(input, _mm256_load_si256(reinterpret_cast<const __m256i*>(weights + 11 * 32)))))) > 16) << 11;
	result |= (std::popcount<uint32_t>(_mm256_movemask_epi8(ChessBNN::popcount8x32_SmallerThan4(_mm256_xor_si256(input, _mm256_load_si256(reinterpret_cast<const __m256i*>(weights + 12 * 32)))))) > 16) << 12;
	result |= (std::popcount<uint32_t>(_mm256_movemask_epi8(ChessBNN::popcount8x32_SmallerThan4(_mm256_xor_si256(input, _mm256_load_si256(reinterpret_cast<const __m256i*>(weights + 13 * 32)))))) > 16) << 13;
	
In Summary: Very interesting topic - especially for the first layer
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Witek
Posts: 87
Joined: Thu Oct 07, 2021 12:48 am
Location: Warsaw, Poland
Full name: Michal Witanowski

Re: Binary neural network for chess?

Post by Witek »

I think it could be usefull in first one or two layers where it would help recognizing patterns. But you still need rather smooth and countinuous output from the network.
Author of Caissa Chess Engine: https://github.com/Witek902/Caissa
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Binary neural network for chess?

Post by dangi12012 »

Witek wrote: Wed Aug 10, 2022 1:31 pm I think it could be usefull in first one or two layers where it would help recognizing patterns. But you still need rather smooth and countinuous output from the network.
Exactly. This would be the way to go from a BB to a 2nd layer without the need to expand to 64x4b nibble representation first.
CUBLAS/CUTLASS templates do not support it - you have to go one level deeper and this is very hard for non nvidia engineers. You need to call the tensorop intrinsics "by hand".

You get these types:
https://docs.nvidia.com/cuda/cuda-c-pro ... ma-subbyte
And code will look like page 6:
https://arxiv.org/pdf/2006.16578.pdf

And you need to build a NN from that. Doable - but a few weeks of (fulltime) work imo.
But that is the way to go to implement maximum performance of binary neural networks in the year 2022 for consumer machines.
A naive approach will get you nowhere fast.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer