https://github.com/Gigantua/Chess_Movegen
This was an algorithm I found when trying to remove all branches from a mailslot engine. When a square is occupied by a white piece the mailslot engine would normally need to test if the target square is occupied or of the enemies color to generate a valid move. But this can be done in a single operation (even when 8 squares would normally need to be tested).
That means for knights an array of 2^8 entries per square would have to be generated which takes occupation and color into account to direcly output the correct result. But how to index into that 2^8 array? Well since the target squares of the knight are what affects the lookup - the colored occupancy of those target squares become the input. But looking up 8 squares and generating an index is the same thing as testing 8 bits?
That brought me to the idea to do the same exact thing but incrementally. Each square that changes its occupancy from 0 to 1 only needs to change the 8 corresponding bits in the squares it can see. This is the reverse lookup per square. This doesnt test which squares are occupied - but a change of occupation updates all relevant squares. This is totally branchless and requires not 64 bit math!
A single makemove operation which does not contain branches (just setting 8 bits in 8 different squares) - all 64 squares can now be looked up for free! Literally a single lookup with a constant index (per position)
This idea can be expanded to sliders with another trick. Sliders see up to 14 squares but still no branches are needed. During makemove a single function is called that updates the corresponding 1 bits in all 14 squares. This uint16_t cfg[64] now knows all indices for all 64 squares!
Now the trick: This cfg can be chosen to be exactly the same result as the PEXT operation would yield with the mask. So to speak the hypercube lookup is calculating the pext result for all 64 squares simultaniously and without any branches - incrementally per move! This is binary compatible with PEXT - so this could be used in conjunction with pext as a fallback to still enable fast sliding lookup on machines that dont have fast 64 bit multiplication or parallel bit extract. There is no dependency on IMUL64, SHIFT, COUNTLZ etc which dont exist in some hardware or on gpus!
This is the lookup code that indexes into a normal pext lookup table -
Notice how occupancy is not needed as a parameter for this incremental lookup code:
Code: Select all
static constexpr uint64_t Rook(int sq) {
return RookMoves[sq][cfgR[sq]];
}
Square occupancy changes the unique configuration of all squares that are affected by this square. This does make edge squares mostly free since the border squares are only affected by border slider squares. Thats why its called hypercube. Each square connects to other squares - only one constant function is called to update some values and then each rook attack on the board becomes a single IMOV per rook.
The best part: When picking a "unique configuration" i picked the very same numbers that a PEXT lookup would result into. So this lookup is binary compatible with a pext lookup table and can do Rooks, RookXRay, Bishop, BishopXRay in a single MOV operation!
The second best part: This algorithm has a seperate generator project - which can emit sourcecode for knights, kings, slider, etc - which take color into consideration. So instead of testing each target square of a knight for "enemy or empty" this can be skipped completely to only get legal squares without any conditions in code.
The third best part: Since the lookup code does have every square indexed with its legal subset - it is possible to not store target squares but directly the correct form of the move struct and completely skip any notion of bit shifting into a legal 16bit representation of a move.
You could also chose to have seperate lookups for taking and silent moves which is the very best thing wanted for engines!
Result code and comparison of different games: (Table size and dependency on special instructions printed too)
Code: Select all
AMD Ryzen 9 5950X 16-Core Processor
Megalooks Simulated Game/s:
Exploading: 150.89MOps 6 kB Optimal perf: imul64
Reference: 68.93MOps 8 kB Optimal perf: none
KoggeStone: 111.98MOps 0 kB Optimal perf: none
RotatedBoard: 92.37MOps 14 kB Optimal perf: none
QBB Algo: 171.72MOps 0 kB Optimal perf: countr_zero, countl_zero
BobMike: 211.32MOps 8 kB Optimal perf: countr_zero, countl_zero
SlideArithm: 256.04MOps 2 kB Optimal perf: bzhi_u64, blsmsk_u64
XorRookSub: 297.78MOps 2 kB Optimal perf: bswap
Hash Variable: 399.36MOps 729 kB Optimal perf: imul64
Hash Plain: 529.61MOps 2306 kB Optimal perf: imul64
Hash Fancy: 597.36MOps 694 kB Optimal perf: imul64
Pext : 925.24MOps 843 kB Optimal perf: pext_u64
HyperCube: 310.30MOps 841 kB Optimal perf: none
In the thread below I had the claim that it is faster than fany magic - which is not a true statement generally. BUT: when your movegenerator does know the square which you want to know the sliders for (for example in a 0...64 loop or hardcoded sq=23) then it is the fastest movegen in my tests.
References: http://www.talkchess.com/forum3/viewtop ... =7&t=78843
TODOS:
More processors tested
Mailslot Test
Multithreading Test
CUDA Translation
If you read this please just post your results - this should be interesting.
Also if you have an algo that is not mentioned here please pm me - or create a PR - or contact me under daniel.infuehr@live.de