I had a productive lunch break and implemented two novel move generators. I'm not even kidding. PEXT makes this so easy.
So my original PEXT implementation (that I understand is the 'classic' version) looks like this:
Code: Select all
static readonly ulong[] Attacks = new ulong[5248 + 102400];
static readonly ulong[] BishopOffset = new ulong[64];
static readonly ulong[] RookOffset = new ulong[64];
static readonly ulong[] BishopMask = new ulong[64]
static readonly ulong[] RookMask = new ulong[64]
public static ulong RookAttacks(ulong occupation, int square)
{
return Attacks[RookOffset[square] + Bmi2.X64.ParallelBitExtract(occupation, RookMask[square])];
}
public static ulong BishopAttacks(ulong occupation, int square)
{
return Attacks[BishopOffset[square] + Bmi2.X64.ParallelBitExtract(occupation, BishopMask[square])];
}
Here all the bitboards are pregenerated. There exist 5248 bishop attacks for all legal combinations of bishop positions and blockers! And a whooping 102400 rook attacks! So that's more then 100K bitboards we need to store.
Now the subset idea as used in KiSS comes into play: If we can reconstruct the attack-bitboard from two subsets at runtime we don't need to store so many bitboards. How many subsets do we have to store if we pack them as dense as possible? The answer is 1664 for the bishop and 5120 for the rook:
Code: Select all
static readonly ulong[] DiagonalMask = new ulong[64];
static readonly ulong[] AntiDiagonalMask = new ulong[64];
static readonly ulong[] HorizontalMask = new ulong[64];
static readonly ulong[] VerticalMask = new ulong[64];
static readonly ulong[] Attacks = new ulong[1664 + 5120]; //6784
static readonly int[] DiagonalOffset = new int[64];
static readonly int[] AntiDiagonalOffset = new int[64];
static readonly int[] HorizontalOffset = new int[64];
static readonly int[] VerticalOffset = new int[64];
public static ulong RookAttacks(ulong occupation, int square)
{
ulong a = Attacks[HorizontalOffset[square] + Bmi2.X64.ParallelBitExtract(occupation, HorizontalMask[square])];
ulong b = Attacks[VerticalOffset[square] + Bmi2.X64.ParallelBitExtract(occupation, VerticalMask[square])];
return a | b;
}
public static ulong BishopAttacks(ulong occupation, int square)
{
ulong a = Attacks[DiagonalOffset[square] + Bmi2.X64.ParallelBitExtract(occupation, DiagonalMask[square])];
ulong b = Attacks[AntiDiagonalOffset[square] + Bmi2.X64.ParallelBitExtract(occupation, AntiDiagonalMask[square])];
return a | b;
}
Note how we need twice the amount of 64-sized arrays to look up offsets and twice the amount of masks so you may add that on top. But still the memory footprint is way less then 10K bitboards now. An order of magnitutde less!
But we don't have to store the bitboards densely. If you think about it (or print it to the console^^) for each square there can't be more than 64 distinct attack patterns along one line. If we allow us a bit more space to store the bitboards we can store them sparsely and don't have to keep track of the offsets but instead just compute them on the fly.
Code: Select all
static readonly ulong[] DiagonalMask = new ulong[64];
static readonly ulong[] AntiDiagonalMask = new ulong[64];
static readonly ulong[] HorizontalMask = new ulong[64];
static readonly ulong[] VerticalMask = new ulong[64];
static readonly ulong[] Attacks = new ulong[4 * 64 * 64];//16.384
public static ulong RookAttacks(ulong occupation, int square)
{
int offset = 8192 + square * 128;
ulong a = Attacks[offset + Bmi2.X64.ParallelBitExtract(occupation, HorizontalMask[square])];
ulong b = Attacks[offset + 64 + Bmi2.X64.ParallelBitExtract(occupation, VerticalMask[square])];
return a | b;
}
public static ulong BishopAttacks(ulong occupation, int square)
{
int offset = square * 128;
ulong a = Attacks[offset + Bmi2.X64.ParallelBitExtract(occupation, DiagonalMask[square])];
ulong b = Attacks[offset + 64 + Bmi2.X64.ParallelBitExtract(occupation, AntiDiagonalMask[square])];
return a | b;
}
So more than half of the slots in the Attacks bitboard are now empty. Wasted space. But not a lot of wasted space compared to PEXT. 16K is still 90K *less* than classical PEXT, so this could do well in real world scenarios, provided that the Perft speed is still competative.
And not surprisingly it is:
Code: Select all
Leorik Perft Sparse SubsetPEXT (16K)
1 OK! 1121ms, 106192K NPS
2 OK! 1703ms, 113734K NPS
3 OK! 1629ms, 109633K NPS
4 OK! 6716ms, 105122K NPS
5 OK! 0ms, 115093K NPS
6 OK! 1410ms, 116317K NPS
Total: 1361558651 Nodes, 12580ms, 108223K NPS
Code: Select all
Leorik Perft Dense SubsetPEXT (7K)
1 OK! 1063ms, 111974K NPS
2 OK! 1735ms, 111630K NPS
3 OK! 1652ms, 108100K NPS
4 OK! 7091ms, 99569K NPS
5 OK! 0ms, 84334K NPS
6 OK! 1475ms, 111225K NPS
Total: 1361558651 Nodes, 13017ms, 104593K NPS
You can see that both versions perform a little better than KiSS on hardware with native PEXT support (like my Ryzen 5900X) and both versions also have a small memory footprint. The sparse one equal to KiSS and the dense one uses less than half of that. On the other hand the densely packed one is 4M slower than the one that uses twice the memory. So both seem equally promising for a real life scenario with high cache pressure. And both have a likely chance to be performing better than classical PEXT and also better than KiSS!
I didn't think I'd catch the slider-move-gen bug but I've fallen right down the rabbit hole this weekend, haha. I bet at least Dangi and Michael will agree that this is exciting! I don't know about the other readers... if you want me to clarify something feel free to ask.
