Sliding Piece Algorithm by Abstract Syntax Tree Sifting [RELEASE]

dangi12012 · Post by **dangi12012** » Fri Apr 15, 2022 1:33 pm

Given the 8 ray masks for any of the 64 board squares what is the optimal solution given a number of C++ constants and functions?

I wrote a optimized sifting tool that enumerates all possible C++ Syntax trees with a given node count in order starting from the smallest tree (less code) up to bigger trees.
16 tokens are supported (binary functions, unary functions, constants, paramters) but in theory there is no limit.

Here is the list of 16 tokens used for this attempt at the optimal sliding piece algorithm:

Code: Select all

1,2,56, 
square, occ, raymask
std::popcount, std::countl_zero, std::countr_zero, bit_reverse,
(arg0 | arg1) (arg0 & ~arg1) (arg0 ^ arg1) (arg0 - arg1) (arg0 << arg1) (arg0 >> arg1)

The tool can realistically enumerate (and execute) up to 16^11 possible code trees with any 16 tokens overnight.
That is any solution with 8 tokens or less are basically solved instantly and bigger trees (16^11 = 1.7592186e+13) can be done overnight.

I am very proud of this achievement since this tool can solve basically any bit transformation problem with a handful of instructions. But skill and intuition is still required to know which kind of binary operations and functions can solve a problem.

Here is the proven minimum solution for the 8 ray problem (with the tokens above):

Code: Select all

//HO UPPER: (occ ^ (occ - 1ull))
//HO LOWER: (occ ^ bit_reverse((bit_reverse(occ) - 1ull)))

//VE UPPER: ((occ - 1ull) << 1ull)
//VE LOWER: (bit_reverse((bit_reverse(occ) - 1ull)) >> 1ull)

//D1 UPPER: ((occ - 1ull) << 1ull)
//D1 LOWER: (bit_reverse((bit_reverse(occ) - 1ull)) >> 1ull)

//D2 UPPER: ((occ - 1ull) << 1ull)
//D2 LOWER: (bit_reverse((bit_reverse(occ) - 1ull)) >> 1ull)

Notice how the square is not needed to solve this! Even if it was supplied as a possibility - the minimum solution does not need it!

All rays share the same solution! - except the horizontal rays. The inputs parameters were pruned and in a sense this algorithm does not need to know which square we are operating on. This is a really good thing because enumerating set bits in a Bitboards yields sing bit masks and not the squares. This would be an algorithm to work without the source square (just its bitmask) and thus shaving off another countLZero instruction in real movegen code.

Now lets compare the 6 solution rays to hyperbola quiescence:

Code: Select all

	/* Generate attack using the sifted 8 ray approach */
	static constexpr uint64_t SolveGenetic(uint64_t occ, uint64_t mask1, uint64_t mask2)
	{
		uint64_t occ1 = occ & mask1; uint64_t occ2 = occ & mask2;
		return ((occ1 - 1ull) << 1ull) & mask1 | (bit_bswap((bit_bswap(occ) - 1ull)) >> 1ull) & mask2
	}

	/* Generate attack using the hyperbola quintessence approach */
	static constexpr uint64_t SolveHyperbola(uint64_t pieces, uint32_t x, uint64_t mask) {
		uint64_t o = pieces & mask;
		return ((o - (1ull << x)) ^ bit_bswap(bit_bswap(o) - (1ull << (x ^ 56)))) & mask;
	}

Lets look under the hood what is going on:

https://godbolt.org/z/MW1vo5Ed6

They are different but somehow weirdly identical. Not to mention its two totally different problems. Mine is working with 8 rays - and hyperbola quiescense is solving 4 rays but the number of instructions is really the same (since I use one extra mask with one & instuction extra) and godbolt gives my algorithm one instruction less compared to hyperbola.

You can try out the new algorithm today! - Github release
https://github.com/Gigantua/Chess_Moveg ... ic8Ray.hpp

What is there to do next?
Of course taking inputs from other very fast non table lookup algorithms like QBB or SBAMG and seeing if a uniform smaller solution exists.

This tool was created out of pure lazyness. Of course bithacks are widely known but putting them together in bigger expressions gives me headaches. Also there was no way to proof if an algorithm is optimal. UNTIL NOW.

Greetings - Daniel

dangi12012 · Post by **dangi12012** » Sat Apr 16, 2022 8:09 pm

Update: I just verified the hyperbola quiescence extension when the bitrotation intrinsics is available. (Then all 4 paths converge into a single function). With AVX512 bitrotation is also available with the "galois field" extensions. (byteswap is not enought for a horizontal solution).

I found these alternative forms:

Code: Select all

(o - (1ull << x)) ^ bit_reverse((bit_reverse(occ) - (1ull << (x ^ 63)))
(o - (1ull << x)) ^ bit_reverse((bit_reverse(occ) - (1ull << (63ull - sq))))  	#24929596
(o - (1ull << x)) ^ bit_reverse((bit_reverse(occ) - bit_reverse(1ull << sq)))   #35865916

So its proven that the currently used CUDA code here is optimal with these intrinsics! - https://github.com/Gigantua/Chess_Moveg ... rotation.h

What is remaining is to find alternative forms for both these functions:
https://github.com/Gigantua/Chess_Moveg ... iff_IL.hpp
https://github.com/Gigantua/Chess_Moveg ... in/QBB.hpp

Other than that I can only see special hardware like FPGAs to outperform any of these very fast algorithms.

I will also try to find a fast hash function that could replace fancy magic that does not rely on the really quite slow IMUL64. (maybe 10-12 xor + shifts).

dangi12012 · Post by **dangi12012** » Sun Apr 17, 2022 7:52 pm

Big News: New Minimal Form for Obstruction Difference found!

The core algorithm for Obstruction Difference can be improved by more than 30% runtime!
I found new minimal form for the Obstruction Difference
- so the same input masks but faster:

The very core of the change is this equivalence found by AST sifting:

Code: Select all

        
uint64_t msb = (1ull << 63ull) >> std::countl_zero(lower | 1);  //Extract Highest Set Isolated Bit
uint64_t lsb = upper & -upper;                                                //Extract Lowest Set Isolated Bit
uint64_t oDif = lsb * 2 - msb;
return mask & oDif;

Above is equivalent to below

Code: Select all

const uint64_t msb = 0x8000000000000000ull >> std::countl_zero(lower | 1);
return (mask & (upper ^ (upper - msb))); //Skip LSB and LEA calculation

Proof of improvement:
https://github.com/Gigantua/Chess_Moveg ... onDiff.hpp

It would be great to update the cpw with this information: https://www.chessprogramming.org/Obstruction_Difference

Now I can see that there are Bitboard native algorithms emerging. These are algorithms that would like the source square to be passed into them in the form of (1ull << sq) and not the commonly used [0..64] sq.
The great thing about Bitboard nativity would be that enumerating set bits could skip the source square resultion - and directly pass the current bitmask.
This will improve performance of any bitboard engine that currently works in 0..64 space.

tcusr · Post by **tcusr** » Sun Apr 17, 2022 9:45 pm

how did the sifting algorithm come up with the constant?

mar · Post by **mar** » Mon Apr 18, 2022 8:38 am

bswap is byte swap, not bit reverse

dangi12012 · Post by **dangi12012** » Sat Apr 23, 2022 1:33 pm

This project is done. All sliding piece algorithms that had an improvement were implemented.
I will keep this tool around and make it public in time if there is enough intereset.

It can optimize any C++ expression into simpler forms if you provide plausible alternative tokens.
It can prove if an expression is optimal with the tokens provided. This was very useful in finding alternative raymask algorithms

Below images show what I mean exactly - (these algorithmic improvements are non trivial and I was suprised how many chess sliding piece algos are very far from the vanilla versions by the respective authors by now). (no more linex, lower and upper are looked up more effieciently etc)

Proof of optimisation
(This is already optimized compared to the vanilla author code)

I improved above to this in 2022 - and proved its the optimal form.

New Algorithms found
The sifting tool also helped me discover a version of the bitrotation algorithm (which in itself is an improvement upon hyperbola qsc) which works with 8 rays instead of 4: https://github.com/Gigantua/Chess_Moveg ... ic8Ray.hpp

Another example
This code:

Code: Select all

(((0x8040201008040201ULL >> LSB(piecesup)) & (0x8040201008040201ULL << MSB(piecesdo))) ^
((0x8102040810204081ULL >> LSB(piecesle)) & (0x8102040810204081ULL << MSB(piecesri))))

and this code:

Code: Select all

(((0x8040201008040201ULL << msb_do) << lsb_up) >> lsb_up) ^ 
(((0x8102040810204081ULL << msb_le) << lsb_ri) >> lsb_ri);

Are equivalent. Same number of instructions - but one constant less. This is also proven to be the shortest form of itself.

So all in all a verygreat success.

Gerd Isenberg · Post by **Gerd Isenberg** » Mon Apr 25, 2022 6:05 pm

dangi12012 wrote: ↑Sun Apr 17, 2022 7:52 pm Big News: New Minimal Form for Obstruction Difference found!

The core algorithm for Obstruction Difference can be improved by more than 30% runtime!
I found new minimal form for the Obstruction Difference
- so the same input masks but faster:

The very core of the change is this equivalence found by AST sifting:
Code: Select all
        
uint64_t msb = (1ull << 63ull) >> std::countl_zero(lower | 1);  //Extract Highest Set Isolated Bit
uint64_t lsb = upper & -upper;                                                //Extract Lowest Set Isolated Bit
uint64_t oDif = lsb * 2 - msb;
return mask & oDif;
Above is equivalent to below
Code: Select all
const uint64_t msb = 0x8000000000000000ull >> std::countl_zero(lower | 1);
return (mask & (upper ^ (upper - msb))); //Skip LSB and LEA calculation
Proof of improvement:
https://github.com/Gigantua/Chess_Moveg ... onDiff.hpp

It would be great to update the cpw with this information: https://www.chessprogramming.org/Obstruction_Difference

Now I can see that there are Bitboard native algorithms emerging. These are algorithms that would like the source square to be passed into them in the form of (1ull << sq) and not the commonly used [0..64] sq.
The great thing about Bitboard nativity would be that enumerating set bits could skip the source square resultion - and directly pass the current bitmask.
This will improve performance of any bitboard engine that currently works in 0..64 space.

Congrats, nice finding!
CPW Page updated

Sliding Piece Algorithm by Abstract Syntax Tree Sifting [RELEASE]

Sliding Piece Algorithm by Abstract Syntax Tree Sifting [RELEASE]

Re: Sliding Piece Algorithm by Abstract Syntax Tree Sifting [RELEASE]

Re: Sliding Piece Algorithm by Abstract Syntax Tree Sifting [RELEASE]

Re: Sliding Piece Algorithm by Abstract Syntax Tree Sifting [RELEASE]

Re: Sliding Piece Algorithm by Abstract Syntax Tree Sifting [RELEASE]

Re: Sliding Piece Algorithm by Abstract Syntax Tree Sifting [RELEASE]

Re: Sliding Piece Algorithm by Abstract Syntax Tree Sifting [RELEASE]