Original Slider Lookup - by Hardware Synthesis

dangi12012 · Post by **dangi12012** » Thu Sep 23, 2021 12:20 pm

While writing the highest move/s chess generator I could think of: http://www.talkchess.com/forum3/viewtop ... =7&t=78230
I went through multiple iterations of what would become a pext lookup table. But before that I experimented with loops and hardware synthesis tools.

TLDR: Compile and run the cpp under Slidertest and lets discuss performance! https://gigantua.page.link/Download
- its one of the many things I tried and performance is not like seed hashing or pext lookup

There is a hardware synthesis tool (or boolean minimisation tool) called Espresso heuristic logic minimizer (written in 1993)
It is the only algorithm which can compress binary input tables (which are perfect for chess) in a range of 64 bit if we give it enough dont care bits.
It does so by transforming cubes of higher dimensions like a karnaugh map. It goes way over my head!
But its a good tool for chess!

For example a Rook at position 1 would need a text file with 2^11 entries which would correspond to all possible blocker configurations as input. And the squares it cannot see are dont care bits or -.

This software can generate a very good minimisation (like you maybe learned in school with boolean minimisation) of any boolean input term.
Anyways I created a windows compatible version of the tool

I seperated all 4 ray directions for simplicity and the tool generated garbled output which can be translated to c++.
A rook at position 0:
result = 2ull;
if ((occupy & 2ull) == 0) result |= 4ull;
if ((occupy & 6ull) == 0) result |= 8ull;
if ((occupy & 14ull) == 0) result |= 16ull;
if ((occupy & 30ull) == 0) result |= 32ull;
if ((occupy & 62ull) == 0) result |= 64ull;
if ((occupy & 126ull) == 0) result |= 128ull;

Empirically its faster to have if instead of if else in the general case - because for other squares the lookup goes into two directions.
Anyways please download the CPP file here and tell me what performance you can get on your machine!

Espresso_Windows build will come to github soon! - also the tool to convert logic functions into a c++ method.
Download compile and run - lets discuss!
Folder SliderTest:
https://gigantua.page.link/Download

.i 64
.o 64
.type fr
.ilb a63 a62 a61 a60 a59 a58 a57 a56 a55 a54 a53 a52 a51 a50 a49 a48 a47 a46 a45 a44 a43 a42 a41 a40 a39 a38 a37 a36 a35 a34 a33 a32 a31 a30 a29 a28 a27 a26 a25 a24 a23 a22 a21 a20 a19 a18 a17 a16 a15 a14 a13 a12 a11 a10 a09 a08 a07 a06 a05 a04 a03 a02 a01 a00
.ob b63 b62 b61 b60 b59 b58 b57 b56 b55 b54 b53 b52 b51 b50 b49 b48 b47 b46 b45 b44 b43 b42 b41 b40 b39 b38 b37 b36 b35 b34 b33 b32 b31 b30 b29 b28 b27 b26 b25 b24 b23 b22 b21 b20 b19 b18 b17 b16 b15 b14 b13 b12 b11 b10 b09 b08 b07 b06 b05 b04 b03 b02 b01 b00
-----0-------0--0----0---0---0----0--0-----0-0-0----000-00000-00|0000010000000100100001000100010000100100000101010000111011111011
-----0-------0--0----0---0---0----0--0-----0-0-0----000-00000-01|0000010000000100100001000100010000100100000101010000111011111011
-----0-------0--0----0---0---0----0--0-----0-0-0----000-00000-10|0000010000000100100001000100010000100100000101010000111011111010
-----0-------0--0----0---0---0----0--0-----0-0-0----000-00000-11|0000010000000100100001000100010000100100000101010000111011111010
-----0-------0--0----0---0---0----0--0-----0-0-0----000-00001-00|0000010000000100100001000100010000100100000101010000111000001011
.....2^11 more entries

dangi12012 · Post by **dangi12012** » Thu Sep 23, 2021 12:21 pm

Yes I know its a slow lookup - but I think its faster than most loop implementations. And it can be used as a compiletime constexpr check evaluator for any fen string.

Running this on Ryzen 5950X and Microsoft MSVC yields:
PERF SWITCH: 137.741 MLookups/s

klx · Post by **klx** » Thu Sep 23, 2021 10:06 pm

Two years effort, and it takes me 20 seconds to come up with something faster. (No offense.)

Original code:

Code: Select all

result = 2ull;
if ((occupy & 2ull) == 0) result |= 4ull;

Emanuel's Heuristic Logic Minimizer:

Code: Select all

result = 6 - (occupy<<1&4);

amanjpro · Post by **amanjpro** » Thu Sep 23, 2021 10:51 pm

klx wrote: ↑Thu Sep 23, 2021 10:06 pm Two years effort, and it takes me 20 seconds to come up with something faster. (No offense.)

Original code:
Code: Select all
result = 2ull;
if ((occupy & 2ull) == 0) result |= 4ull;
Emanuel's Heuristic Logic Minimizer:
Code: Select all
result = 6 - (occupy<<1&4);

Go season yourself

dangi12012 · Post by **dangi12012** » Thu Sep 23, 2021 11:06 pm

klx wrote: ↑Thu Sep 23, 2021 10:06 pm Two years effort, and it takes me 20 seconds to come up with something faster. (No offense.)

Original code:
Code: Select all
result = 2ull;
if ((occupy & 2ull) == 0) result |= 4ull;
Emanuel's Heuristic Logic Minimizer:
Code: Select all
result = 6 - (occupy<<1&4);

Do you mind doing that for the other 255 branches too? I want to see if its even 1% faster. In compiler explorer the if gets compiled away into something like your code anyways. Therese is not a single jmpeq instruction with -O2.

I wrote here because I am sure that this can be done faster and with less branching. I dont mind its slowness as I already found the optimal slider lookup which is PEXT at the moment with 4 instructions in total - and 8 for the queen without any conditionals.

Code: Select all

slider(unsigned long, unsigned long):
        sal     rdi, 4
        mov     rax, QWORD PTR Chess_Lookup::Pext_RookAttacks[rdi]
        pext    rsi, rsi, QWORD PTR Chess_Lookup::Pext_RookAttacks[rdi+8]
        mov     rax, QWORD PTR [rax+rsi*8]
        ret

R. Tomasi · Post by **R. Tomasi** » Thu Sep 23, 2021 11:21 pm

dangi12012 wrote: ↑Thu Sep 23, 2021 11:06 pm I wrote here because I am sure that this can be done faster and with less branching. I dont mind its slowness as I already found the optimal slider lookup which is PEXT at the moment with 4 instructions in total - and 8 for the queen without any conditionals.
Code: Select all
slider(unsigned long, unsigned long):
        sal     rdi, 4
        mov     rax, QWORD PTR Chess_Lookup::Pext_RookAttacks[rdi]
        pext    rsi, rsi, QWORD PTR Chess_Lookup::Pext_RookAttacks[rdi+8]
        mov     rax, QWORD PTR [rax+rsi*8]
        ret

I don't mean to be rude... but that's not a new invention: https://www.chessprogramming.org/BMI2#PEXTBitboards. On my CPU, unfortunately, it's slower than magic bitboards. I'm running against 11th Gen Intel(R) Core(TM) i7-1185G7 and 10th Gen Intel(R) Core(TM) i9 10900K.

dangi12012 · Post by **dangi12012** » Thu Sep 23, 2021 11:55 pm

R. Tomasi wrote: ↑Thu Sep 23, 2021 11:21 pm
dangi12012 wrote: ↑Thu Sep 23, 2021 11:06 pm I wrote here because I am sure that this can be done faster and with less branching. I dont mind its slowness as I already found the optimal slider lookup which is PEXT at the moment with 4 instructions in total - and 8 for the queen without any conditionals.
Code: Select all
slider(unsigned long, unsigned long):
        sal     rdi, 4
        mov     rax, QWORD PTR Chess_Lookup::Pext_RookAttacks[rdi]
        pext    rsi, rsi, QWORD PTR Chess_Lookup::Pext_RookAttacks[rdi+8]
        mov     rax, QWORD PTR [rax+rsi*8]
        ret
I don't mean to be rude... but that's not a new invention: https://www.chessprogramming.org/BMI2#PEXTBitboards. On my CPU, unfortunately, it's slower than magic bitboards. I'm running against 11th Gen Intel(R) Core(TM) i7-1185G7 and 10th Gen Intel(R) Core(TM) i9 10900K.

I know that. I wanted to share espresso - which can do magic. Just give it any boolean input table with dont cares and it will compact it into the smallest form. Its a np hard problem. And Chess is a 64 bit input table. - it can generate the optimal code for so many bitpattern problems you throw at it - its not even funny.

Also thx for sharing the link to the wiki. Its not the optimal way.
Heres why its not optimal:

Code: Select all

return arrAttacks[arrRookBase[sq] + _pext_u64(occ, arrRookMask[sq])];
 
//Can be simplified to 
return Attack[sq][_pext_u64(occ, arrRookMask[sq])];

Boom you just saved 1 add instruction where the wiki told you "do it like that"
where Attack pointer[64] gets initialized to arrAttacks + arrRookBase[0..64] BEFORE execution. Check it out in compiler explorer

R. Tomasi · Post by **R. Tomasi** » Fri Sep 24, 2021 12:02 am

dangi12012 wrote: ↑Thu Sep 23, 2021 11:55 pm
R. Tomasi wrote: ↑Thu Sep 23, 2021 11:21 pm
dangi12012 wrote: ↑Thu Sep 23, 2021 11:06 pm I wrote here because I am sure that this can be done faster and with less branching. I dont mind its slowness as I already found the optimal slider lookup which is PEXT at the moment with 4 instructions in total - and 8 for the queen without any conditionals.
Code: Select all
slider(unsigned long, unsigned long):
        sal     rdi, 4
        mov     rax, QWORD PTR Chess_Lookup::Pext_RookAttacks[rdi]
        pext    rsi, rsi, QWORD PTR Chess_Lookup::Pext_RookAttacks[rdi+8]
        mov     rax, QWORD PTR [rax+rsi*8]
        ret
I don't mean to be rude... but that's not a new invention: https://www.chessprogramming.org/BMI2#PEXTBitboards. On my CPU, unfortunately, it's slower than magic bitboards. I'm running against 11th Gen Intel(R) Core(TM) i7-1185G7 and 10th Gen Intel(R) Core(TM) i9 10900K.
I know that. I wanted to share espresso - which can do magic. Just give it any boolean input table with dont cares and it will compact it into the smallest form. Its a np hard problem. And Chess is a 64 bit input table. - it can generate the optimal code for so many bitpattern problems you throw at it - its not even funny.

Also thx for sharing the link to the wiki. Its not the optimal way.
Heres why its not optimal:
Code: Select all
return arrAttacks[arrRookBase[sq] + _pext_u64(occ, arrRookMask[sq])];
 
//Can be simplified to 
return Attack[sq][_pext_u64(occ, arrRookMask[sq])];
Boom you just saved 1 add instruction where the wiki told you "do it like that"
where Attack pointer[64] gets initialized to arrAttacks + arrRookBase[0..64] BEFORE execution. Check it out in compiler explorer

The arrRookBase[sq] lookup is just the offset into the array, which is implicitely there in your simplified version aswell (since it uses jagged arrays). The reason the wiki code says "do it that way" is because the inner arrays have different lengths. The wiki version needs less memory and is more cache-friendly - hard to tell which is faster without measuring, but usually the memory access is the bottleneck and not the number of instructions. Odds are, the wiki version is actually faster on most hardware. The guys who came up with the pseudo-code in the wiki are no fools, you should keep that in mind before quickly jumping to conclusions.

dangi12012 · Post by **dangi12012** » Fri Sep 24, 2021 12:24 am

The arrRookBase[sq] lookup is just the offset into the array, which is implicitely there in your simplified version aswell (since it uses jagged arrays). The reason the wiki code says "do it that way" is because the inner arrays have different lengths. The wiki version needs less memory and is more cache-friendly - hard to tell which is faster without measuring, but usually the memory access is the bottleneck and not the number of instructions. Odds are, the wiki version is actually faster on most hardware. The guys who came up with the pseudo-code in the wiki are no fools, you should keep that in mind before quickly jumping to conclusions.

Thats what made me so happy. I though about this for 2 years to come up with literally hundreds of different implementations for movegen in C# java and C++ and about 12 real perfT performance relevant ideas to take it to a new level. After looking into the forums and the wiki the first time 8 or 9 were discussed - the other 3-4 new ideas will be in my codeproject article. and github. Soon!

I think you didnt understand what I meant. Its literally the same array. Its literally the same offsets.
If you think about it you will see that the addition can be moved to the init phase. Yes I understand its jagged. Yes its good if its small. No its not optimal in the wiki...

R. Tomasi · Post by **R. Tomasi** » Fri Sep 24, 2021 12:34 am

dangi12012 wrote: ↑Fri Sep 24, 2021 12:24 am I think you didnt understand what I meant. Its literally the same array. Its literally the same offsets.
If you think about it you will see that the addition can be moved to the init phase. Yes I understand its jagged. Yes its good if its small. No its not optimal in the wiki...

If it's in the done in init phase, then both versions are equivalent: both will compile to LOAD+LEA. If you want the LOAD to be a compile-time constant in the cases where sq is known at compile-time, then just mark the arrRookBase as constexpr...

The only difference is that the wiki makes it explicitly known to someone who reads the code, that the tables for the individual squares may have different lengths. Which seem's like a good idea, since the pseude-code is supposed to be educational.

Original Slider Lookup - by Hardware Synthesis

Original Slider Lookup - by Hardware Synthesis

Re: Original Slider Lookup - by Hardware Synthesis

Re: Original Slider Lookup - by Hardware Synthesis

Re: Original Slider Lookup - by Hardware Synthesis

Re: Original Slider Lookup - by Hardware Synthesis

Re: Original Slider Lookup - by Hardware Synthesis

Re: Original Slider Lookup - by Hardware Synthesis

Re: Original Slider Lookup - by Hardware Synthesis

Re: Original Slider Lookup - by Hardware Synthesis

Re: Original Slider Lookup - by Hardware Synthesis