This problem has bugged me for a while. While board evaluations using neural networs exist there has been no publications that release a trained binary neural network that produces the sliding piece outputs. Until now!
So there is "By Calculation", "By Occupancy Lookup" and the new method: "By machine learning"
This github repo can train BNNs with a (limited problem size) to 100% accuracy: https://github.com/Gigantua/Chess_BinaryNeuralNetwork
A Neural network normally is implemented like an input vector multiplied with a weight matrix and an activation function and it can learn all sorts of different things - like many of you know.
The lesser known and somewhat new variant is binary neural networks. These are computationally much cheaper - but harder to train. Instead of the weights being a full 32bit variable - each weight is really a single bit.
This means that a 64bit variable is an input vector of size 64 and that matrix multiplication will happen with each bit in the weight matrix.
Also a normal dotproduct is defined as a⋅b = a1*b1+....an+bn - which means that a 64 matrix will need 64 additions per output.
In Binary Neural Networks this dotproduct (horizontal reduction) gets reduced to a⋅b = popcount(a^b) which is just 2 instructions for a 64*64 dotproduct!
I wont go into more detail here - its best to read an introduction: https://sushscience.wordpress.com/2017/ ... -networks/
This algorithm below is what replaces any and all code that would normally produce the up to 14 set bits that are the result for any Sliding piece algorithm. But here the behaviour to infer the outputs are learned. There are 14 networks for all the 14 bits per square.
These networks have learned how to transform the result of bit extract into the correct output for bit scatter! Normally a lookup of this operation takes 512kb of memory. The network needs 23kb!
Code: Select all
static uint64_t Vector32(int sq, uint64_t occ, uint64_t gather, uint64_t scatter, uint32_t count, const uint8_t* weights) {
const __m256i input = _mm256_set1_epi16(_pext_u64(occ, gather));
int result = 0;
//Binary neural network: [count networks with this config] => 16bit x 32bit x 1 bit
for (int i = 0; i < count; i++) {
int R1 = _mm256_movemask_epi8(ChessBNN::popcount8x32_SmallerThan4(_mm256_xor_si256(input, _mm256_load_si256(reinterpret_cast<const __m256i*>(weights + i * 32)))));
//64bit first layer reduction into single bit output
result |= (std::popcount<uint32_t>(R1) > 16) << i;
}
return _pdep_u64(result, scatter);
}
The performance on x64 is not outstanding - but while implementing this I am looking forward for an implementation in cuda where there will be access to 2088Tflops of compute for this type of operation.
https://on-demand.gputechconf.com/gtc/2 ... n-cuda.pdf
One Cuda TensorOp can fit many of these networks and could calculate the output like below in a single clock cycle! (Per TensorCORE!).
Even without CUDA one can see that a Popcount(a^b) is much much faster than 64 multiplications and 64 additions.
I trained 2 general working networks:
1) Learned on 64 bit occupy & mask => 64 x 64 x 1 [per output bit]
It learned to calculate which bits in the output need to be set for all 4096 possible configurations per square.
The input is the masked occupancy and the output are the attacked bits directly.
2) Learned on 16 bit pext => 16 x 32 x 1 [per output bit]
It learned to calculate the correct translation between a gathered bitmask and the input for the scatter operation
This is much more efficient because weights work from 16bit of input (pext) and into 16bits of output (pdedp)

Obv. Questions:
How is the performance?
I improved the code by AVX2 to a point where it is a similar speed to very optimized loops like dumb7 fill and faster than naive loops!
There are no branches - Nvidia has a native operation that can infer many more networks in a single clockcylcle (per Tensorcore!)
Is it possible to create one network for one square (and not the 14 networks that are used now)?
Not with the current learning technique (too slow). But it should be a a lot faster and smaller.
Is it possible to create one network for all 64 squares?
No because you need to test every possible occupancy to verify a 100% score.
Why not normal neuronal networks?
The native language of the first layer of any network in chess machine learning code can go from char[64] to uint64_t which is an 8x improvement in memory and the operations to infer the output are also cheaper! (See above!)
Why even movegen that seems dumb?
Its an obvious target to see how fast it can be made and if learning can achieve 100.00% accuracy. I have definitely proven that 100% accuracy can be achieved. Its faster than many other algorithms even for movegen - and can be the first layer to a bigger eval network starting from bitboards!
Can I use this to train other things?
Yes you can get started with any mapping of input bits to output bits you want. This tries to target 100% accuracy but for many problems this is not needed. So any mapping from even with full 4x uint64_t Quadboards can be directly used as input and trained to produce anything you need to know!
I really recommend checking it out on github! Especially the Vector version. 32 comparisons can be done in one operation and these can be reduced to a bitmask in another operation: _mm256_movemask_epi8. Really efficient stuff when compared to the serial code!
https://github.com/Gigantua/Chess_BinaryNeuralNetwork