Hypercube Slider Lookup - New Sliding Piece Algorithm [RELEASE]

dangi12012 · Post by **dangi12012** » Fri Dec 31, 2021 1:02 am

Here is a git repo for most currently known algorithms including hyperchess lookup:
https://github.com/Gigantua/Chess_Movegen

This was an algorithm I found when trying to remove all branches from a mailslot engine. When a square is occupied by a white piece the mailslot engine would normally need to test if the target square is occupied or of the enemies color to generate a valid move. But this can be done in a single operation (even when 8 squares would normally need to be tested).

That means for knights an array of 2^8 entries per square would have to be generated which takes occupation and color into account to direcly output the correct result. But how to index into that 2^8 array? Well since the target squares of the knight are what affects the lookup - the colored occupancy of those target squares become the input. But looking up 8 squares and generating an index is the same thing as testing 8 bits?

That brought me to the idea to do the same exact thing but incrementally. Each square that changes its occupancy from 0 to 1 only needs to change the 8 corresponding bits in the squares it can see. This is the reverse lookup per square. This doesnt test which squares are occupied - but a change of occupation updates all relevant squares. This is totally branchless and requires not 64 bit math!
A single makemove operation which does not contain branches (just setting 8 bits in 8 different squares) - all 64 squares can now be looked up for free! Literally a single lookup with a constant index (per position)

This idea can be expanded to sliders with another trick. Sliders see up to 14 squares but still no branches are needed. During makemove a single function is called that updates the corresponding 1 bits in all 14 squares. This uint16_t cfg[64] now knows all indices for all 64 squares!

Now the trick: This cfg can be chosen to be exactly the same result as the PEXT operation would yield with the mask. So to speak the hypercube lookup is calculating the pext result for all 64 squares simultaniously and without any branches - incrementally per move! This is binary compatible with PEXT - so this could be used in conjunction with pext as a fallback to still enable fast sliding lookup on machines that dont have fast 64 bit multiplication or parallel bit extract. There is no dependency on IMUL64, SHIFT, COUNTLZ etc which dont exist in some hardware or on gpus!

This is the lookup code that indexes into a normal pext lookup table -
Notice how occupancy is not needed as a parameter for this incremental lookup code:

Code: Select all

static constexpr uint64_t Rook(int sq) {
	return RookMoves[sq][cfgR[sq]];
}

In summary:
Square occupancy changes the unique configuration of all squares that are affected by this square. This does make edge squares mostly free since the border squares are only affected by border slider squares. Thats why its called hypercube. Each square connects to other squares - only one constant function is called to update some values and then each rook attack on the board becomes a single IMOV per rook.

The best part: When picking a "unique configuration" i picked the very same numbers that a PEXT lookup would result into. So this lookup is binary compatible with a pext lookup table and can do Rooks, RookXRay, Bishop, BishopXRay in a single MOV operation!

The second best part: This algorithm has a seperate generator project - which can emit sourcecode for knights, kings, slider, etc - which take color into consideration. So instead of testing each target square of a knight for "enemy or empty" this can be skipped completely to only get legal squares without any conditions in code.

The third best part: Since the lookup code does have every square indexed with its legal subset - it is possible to not store target squares but directly the correct form of the move struct and completely skip any notion of bit shifting into a legal 16bit representation of a move.
You could also chose to have seperate lookups for taking and silent moves which is the very best thing wanted for engines!

Result code and comparison of different games: (Table size and dependency on special instructions printed too)

Code: Select all

AMD Ryzen 9 5950X 16-Core Processor
Megalooks Simulated Game/s:
Exploading:     150.89MOps      6 kB    Optimal perf: imul64
Reference:      68.93MOps       8 kB    Optimal perf: none
KoggeStone:     111.98MOps      0 kB    Optimal perf: none
RotatedBoard:   92.37MOps       14 kB   Optimal perf: none
QBB Algo:       171.72MOps      0 kB    Optimal perf: countr_zero, countl_zero
BobMike:        211.32MOps      8 kB    Optimal perf: countr_zero, countl_zero
SlideArithm:    256.04MOps      2 kB    Optimal perf: bzhi_u64, blsmsk_u64
XorRookSub:     297.78MOps      2 kB    Optimal perf: bswap
Hash Variable:  399.36MOps      729 kB  Optimal perf: imul64
Hash Plain:     529.61MOps      2306 kB Optimal perf: imul64
Hash Fancy:     597.36MOps      694 kB  Optimal perf: imul64
Pext  :         925.24MOps      843 kB  Optimal perf: pext_u64
HyperCube:      310.30MOps      841 kB  Optimal perf: none

In the thread below I had the claim that it is faster than fany magic - which is not a true statement generally. BUT: when your movegenerator does know the square which you want to know the sliders for (for example in a 0...64 loop or hardcoded sq=23) then it is the fastest movegen in my tests.
References: http://www.talkchess.com/forum3/viewtop ... =7&t=78843

TODOS:
More processors tested
Mailslot Test
Multithreading Test
CUDA Translation

If you read this please just post your results - this should be interesting.
Also if you have an algo that is not mentioned here please pm me - or create a PR - or contact me under daniel.infuehr@live.de

abulmo2 · Post by **abulmo2** » Fri Dec 31, 2021 11:23 am

This was an algorithm I found when trying to remove all branches from a mailslot engine.

What is a mailslot engine? I made a query on https://www.chessprogramming.org/index. ... arch&go=Go without success.

tcusr · Post by **tcusr** » Fri Dec 31, 2021 11:58 am

abulmo2 wrote: ↑Fri Dec 31, 2021 11:23 am
This was an algorithm I found when trying to remove all branches from a mailslot engine.
What is a mailslot engine? I made a query on https://www.chessprogramming.org/index. ... arch&go=Go without success.

he calls mailbox engines mailslot

AndrewGrant · Post by **AndrewGrant** » Fri Dec 31, 2021 2:37 pm

As suspected, useless.

brianr · Post by **brianr** » Fri Dec 31, 2021 3:38 pm

AndrewGrant wrote: ↑Fri Dec 31, 2021 2:37 pm As suspected, useless.

Just wondering why (it is code after all).
Thanks.

mar · Post by **mar** » Fri Dec 31, 2021 4:27 pm

brianr wrote: ↑Fri Dec 31, 2021 3:38 pm Just wondering why (it is code after all).
Thanks.

because it's 4MB worth of precomputed tables and code. unfortunately when used in a real world scenario - as dangi himself confirms - it's slower than magics
still an interesting approach

brianr · Post by **brianr** » Fri Dec 31, 2021 4:39 pm

mar wrote: ↑Fri Dec 31, 2021 4:27 pm
brianr wrote: ↑Fri Dec 31, 2021 3:38 pm Just wondering why (it is code after all).
Thanks.
because it's 4MB worth of precomputed tables and code. unfortunately when used in a real world scenario - as dangi himself confirms - it's slower than magics
still an interesting approach

How about the pext version?
I doubt if my old 1920X Threadripper supports that other than via emulation, and I don't have MSVC 2020 installed (still on 2019 version), so I can't easily compile the solution (or I could update my Ubuntu config as an option). Just curious, but not enough to bother with those updates. Thanks.

dangi12012 · Post by **dangi12012** » Fri Dec 31, 2021 5:10 pm

mar wrote: ↑Fri Dec 31, 2021 4:27 pm
brianr wrote: ↑Fri Dec 31, 2021 3:38 pm Just wondering why (it is code after all).
Thanks.
because it's 4MB worth of precomputed tables and code. unfortunately when used in a real world scenario - as dangi himself confirms - it's slower than magics
still an interesting approach

4mb of code and not table size!

Performance is hardware dependent. My approach does not depend on 64 bit imul64 or pext and can still lookup into a binary compatible pext table.

Its really a hardcoded 16 bit OR instruction on up to 14 squares.
So for gpus which are 32 bits or 16/32 bit cpus this is it!

leanchess · Post by **leanchess** » Sat Jan 01, 2022 2:58 pm

Cheers! Looks very interesting, but a bit too complex for New Year's me.

However, I believe I spotted a minor omission (apparently localised to Main.cpp): as rand()'s return values range between 0 and (1<<31)-1, rand64()'s will always have their bits 31 and 63 cleared.

Sven · Post by **Sven** » Sat Jan 01, 2022 3:12 pm

leanchess wrote: ↑Sat Jan 01, 2022 2:58 pm Cheers! Looks very interesting, but a bit too complex for New Year's me.

However, I believe I spotted a minor omission (apparently localised to Main.cpp): as rand()'s return values range between 0 and (1<<31)-1, rand64()'s will always have their bits 31 and 63 cleared.

The range of values returned by rand() depends on the constant RAND_MAX. "This value is library-dependent, but is guaranteed to be at least 32767 on any standard library implementation." https://www.cplusplus.com/reference/cstdlib/RAND_MAX/ So even the assumption of having 31 "real" bits can be wrong.

Hypercube Slider Lookup - New Sliding Piece Algorithm [RELEASE]

Hypercube Slider Lookup - New Sliding Piece Algorithm [RELEASE]

Re: Hypercube Slider Lookup - New Sliding Piece Algorithm [RELEASE]

Re: Hypercube Slider Lookup - New Sliding Piece Algorithm [RELEASE]

Re: Hypercube Slider Lookup - New Sliding Piece Algorithm [RELEASE]

Re: Hypercube Slider Lookup - New Sliding Piece Algorithm [RELEASE]

Re: Hypercube Slider Lookup - New Sliding Piece Algorithm [RELEASE]

Re: Hypercube Slider Lookup - New Sliding Piece Algorithm [RELEASE]

Re: Hypercube Slider Lookup - New Sliding Piece Algorithm [RELEASE]

Re: Hypercube Slider Lookup - New Sliding Piece Algorithm [RELEASE]

Re: Hypercube Slider Lookup - New Sliding Piece Algorithm [RELEASE]