Binary Neural Networks Sliding Piece Inference [Release]

dangi12012 · Post by **dangi12012** » Thu Feb 10, 2022 9:54 pm

Binary Neural Networks Sliding Piece Inference [Release]

This problem has bugged me for a while. While board evaluations using neural networs exist there has been no publications that release a trained binary neural network that produces the sliding piece outputs. Until now!
So there is "By Calculation", "By Occupancy Lookup" and the new method: "By machine learning"
This github repo can train BNNs with a (limited problem size) to 100% accuracy: https://github.com/Gigantua/Chess_BinaryNeuralNetwork

A Neural network normally is implemented like an input vector multiplied with a weight matrix and an activation function and it can learn all sorts of different things - like many of you know.

The lesser known and somewhat new variant is binary neural networks. These are computationally much cheaper - but harder to train. Instead of the weights being a full 32bit variable - each weight is really a single bit.
This means that a 64bit variable is an input vector of size 64 and that matrix multiplication will happen with each bit in the weight matrix.
Also a normal dotproduct is defined as a⋅b = a1*b1+....an+bn - which means that a 64 matrix will need 64 additions per output.
In Binary Neural Networks this dotproduct (horizontal reduction) gets reduced to a⋅b = popcount(a^b) which is just 2 instructions for a 64*64 dotproduct!

I wont go into more detail here - its best to read an introduction: https://sushscience.wordpress.com/2017/ ... -networks/

This algorithm below is what replaces any and all code that would normally produce the up to 14 set bits that are the result for any Sliding piece algorithm. But here the behaviour to infer the outputs are learned. There are 14 networks for all the 14 bits per square.
These networks have learned how to transform the result of bit extract into the correct output for bit scatter! Normally a lookup of this operation takes 512kb of memory. The network needs 23kb!

Code: Select all

static uint64_t Vector32(int sq, uint64_t occ, uint64_t gather, uint64_t scatter, uint32_t count, const uint8_t* weights) {
	const __m256i input = _mm256_set1_epi16(_pext_u64(occ, gather));
	int result = 0; 
	
	//Binary neural network: [count networks with this config] => 16bit x 32bit x 1 bit 
	for (int i = 0; i < count; i++) {
		int R1 = _mm256_movemask_epi8(ChessBNN::popcount8x32_SmallerThan4(_mm256_xor_si256(input, _mm256_load_si256(reinterpret_cast<const __m256i*>(weights + i * 32)))));
		
		//64bit first layer reduction into single bit output
		result |= (std::popcount<uint32_t>(R1) > 16) << i; 
	}

	return _pdep_u64(result, scatter);
}

The performance on x64 is not outstanding - but while implementing this I am looking forward for an implementation in cuda where there will be access to 2088Tflops of compute for this type of operation.
https://on-demand.gputechconf.com/gtc/2 ... n-cuda.pdf
One Cuda TensorOp can fit many of these networks and could calculate the output like below in a single clock cycle! (Per TensorCORE!).
Even without CUDA one can see that a Popcount(a^b) is much much faster than 64 multiplications and 64 additions.

I trained 2 general working networks:
1) Learned on 64 bit occupy & mask => 64 x 64 x 1 [per output bit]
It learned to calculate which bits in the output need to be set for all 4096 possible configurations per square.
The input is the masked occupancy and the output are the attacked bits directly.

2) Learned on 16 bit pext => 16 x 32 x 1 [per output bit]
It learned to calculate the correct translation between a gathered bitmask and the input for the scatter operation
This is much more efficient because weights work from 16bit of input (pext) and into 16bits of output (pdedp)

Obv. Questions:
How is the performance?
I improved the code by AVX2 to a point where it is a similar speed to very optimized loops like dumb7 fill and faster than naive loops!
There are no branches - Nvidia has a native operation that can infer many more networks in a single clockcylcle (per Tensorcore!)

Is it possible to create one network for one square (and not the 14 networks that are used now)?
Not with the current learning technique (too slow). But it should be a a lot faster and smaller.

Is it possible to create one network for all 64 squares?
No because you need to test every possible occupancy to verify a 100% score.

Why not normal neuronal networks?
The native language of the first layer of any network in chess machine learning code can go from char[64] to uint64_t which is an 8x improvement in memory and the operations to infer the output are also cheaper! (See above!)

Why even movegen that seems dumb?
Its an obvious target to see how fast it can be made and if learning can achieve 100.00% accuracy. I have definitely proven that 100% accuracy can be achieved. Its faster than many other algorithms even for movegen - and can be the first layer to a bigger eval network starting from bitboards!

Can I use this to train other things?
Yes you can get started with any mapping of input bits to output bits you want. This tries to target 100% accuracy but for many problems this is not needed. So any mapping from even with full 4x uint64_t Quadboards can be directly used as input and trained to produce anything you need to know!

I really recommend checking it out on github! Especially the Vector version. 32 comparisons can be done in one operation and these can be reduced to a bitmask in another operation: _mm256_movemask_epi8. Really efficient stuff when compared to the serial code!
https://github.com/Gigantua/Chess_BinaryNeuralNetwork

dangi12012 · Post by **dangi12012** » Thu Feb 10, 2022 11:09 pm

TLDR:

I created a binary neural network which learned to transform uint16_t bits =_pext_u64(occ, gather) into the 14 bit solution that when scattered with pdedp64 does so in exactly the right pattern to produce the result for every possible sliding piece + blocker configuration (up to 2^12 possible configurations per square). Normally 500kb of memory is needed for such a lookup - but the network learns and compresses the problem domain into 23kb.

If you were to write code that does a similar thing you would need bswap, shifting and masking. The network has no such operations available and only uses Binary matrix multiplication with RELU to achieve the same result!

dangi12012 · Post by **dangi12012** » Tue Feb 15, 2022 3:07 pm

Obligatory meme:

Question for the community here:

Has anyone here have experience NVIDIA CUTLASS and has implemented a project with it?
https://github.com/NVIDIA/cutlass

PM me - or post here!

Luecx · Post by **Luecx** » Tue Feb 15, 2022 5:35 pm

dangi12012 wrote: ↑Tue Feb 15, 2022 3:07 pm Obligatory meme:

Question for the community here:

Has anyone here have experience NVIDIA CUTLASS and has implemented a project with it?
https://github.com/NVIDIA/cutlass

PM me - or post here!

A few things first.

I implemented the Koivisto trainer in CUDA using CuBlas.
Cublas is generally slightly faster than CUTLASS.

Beside that, the gpu does not always mean its going to be faster. A few things about the gpu:

- the clock speed is lower
- only use it for stuff which can be paraellised. Your example is fine for the GPU.

The most important fact about the gpu is the following:

- The gpu can do so much stuff, but ONLY if you just let it do stuff. what this means is that any synchronization AND memory transfer between the cpu and the gpu will SLOW your engine down. So basically the bottleneck for any engine running a small network on the gpu is the memory. every time you want to calculate something, you need to tell the gpu which will then start computing. its a lot faster if you just let it compute while the cpu does other stuff. This does work if the networks grow very large (like Leela) but smaller networks do NOT benefit from using the GPU.

Luecx · Post by **Luecx** » Tue Feb 15, 2022 5:37 pm

As a sidenote, my gpu trainer is only faster because I queue the batches into the gpu and compute 16k fens at the same time. if i would give it each fen (we are only talking about the sparse inputs) seperately. I would 100% be like 10x slower than my cpu.

Binary Neural Networks Sliding Piece Inference [Release]

Binary Neural Networks Sliding Piece Inference [Release]

Re: Binary Neural Networks Sliding Piece Inference [Release]

Re: Binary Neural Networks Sliding Piece Inference [Release]

Re: Binary Neural Networks Sliding Piece Inference [Release]

Re: Binary Neural Networks Sliding Piece Inference [Release]