However reading this forum and the wiki gave me the impression that I'd need magic bitboards to be allowed at the cool kids table. But I'm also lazy and don't like the idea of having almost MB sized sparsely populated tables (see me worrying about funky memory interactions above). So I wanted to know how much I leave on the table with the alternative.
Throughout my code I use separate sliding move generation per direction for things like pins, discoveries, skewers and so on. Which makes the "less magic" and much denser kindergarten bitboards attractive to me since they are a drop in replacement where magics would be much more invasive since their benefit is getting both directions at once.
I set out to do some synthetic and practical benchmarking of these three approaches. For arithmetic and kindergarten generators I have my own code and for the magics I borrowed the code from Pradu Kannan.
For synthetic testing I use the following loop. It repeatedly XORs the generated moveset into the blocker board.
Code: Select all
unsigned int tsc_aux;
uint64_t start = __rdtscp(&tsc_aux);
for(uint64_t i = 0u;i<N;++i) {
#pragma unroll
for(int j = 0;j<M;++j) {
block[j] ^= movegen_function((i+7u*j)&63u, block[j]);
}
}
uint64_t end = __rdtscp(&tsc_aux);
In the tables t = (end-start)/N is the average per iteration time of the outer loop. Measurements were done on a AMD 3900X. RDTSC frequency measures at 3.8GHz which is consistent with the specced base clock while the actual observed clock speeds during measurements are reported at around 4.35GHz. So if one wanted to know the values in "real clock cycles" they'd have to be multiplied by about 1.15.
Rooks
Code: Select all
arithmetic kindergarten magic
M t t/M t t/M t t/M
1 12.36 12.36 15.09 15.09 26.64 26.64
2 20.59 10.29 16.93 8.46 27.62 13.81
3 31.18 10.39 24.46 8.15 28.38 9.46
4 42.07 10.52 31.58 7.89 29.24 7.31
5 50.76 10.15 39.03 7.81 30.53 6.11
6 60.57 10.10 47.37 7.89 32.57 5.43
7 77.19 11.03 56.69 8.10 35.00 5.00
8 86.91 10.86 65.01 8.13 37.78 4.72
The arithmetic approach directly scales with amount of computations where magics are dominated by latency and each additional lookup just costs an extra cycle.
Bishops
Code: Select all
arithmetic kindergarten magic
M t t/M t t/M t t/M
1 15.36 15.36 15.13 15.13 10.80 10.80
2 23.75 11.88 16.05 8.03 11.00 5.50
3 35.41 11.80 20.81 6.94 11.18 3.73
4 46.16 11.54 25.33 6.33 11.75 2.94
5 59.87 11.97 33.62 6.72 13.00 2.60
6 73.74 12.29 40.30 6.72 15.29 2.55
7 93.07 13.30 47.94 6.85 17.64 2.52
8 108.51 13.56 55.19 6.90 20.60 2.57
Queens
Code: Select all
arithmetic kindergarten magic
M t t/M t t/M t t/M
1 24.12 24.12 19.32 19.32 32.85 32.85
2 47.78 23.89 26.92 13.46 33.74 16.87
3 65.85 21.95 39.97 13.32 35.66 11.89
4 95.15 23.79 55.30 13.82 41.77 10.44
5 122.70 24.54 68.36 13.67 43.55 8.71
6 150.33 25.05 90.59 15.10 50.24 8.37
7 167.43 23.92 102.42 14.63 57.98 8.28
8 189.12 23.64 116.76 14.60 66.81 8.35
At this point someone might object "this is all meaningless and pathological". And they would be right I guess. From a distance it is difficult to guess whether a specific piece of "real code" will expose the latency or just be throughput limited. You can't even make that statement globally for the entire engine. Move generation might be a "bulk operation" and be throughput limited while pin detection might be latency limited.
All of these measure in the one to low two digit realm of clock cycles. By comparison this is about as expensive as a 64bit integer division. A full cache miss from the TT will singelhandedly cost you a multiple of any of these operations.
To get the real impact I have to test it within my actual search. As mentioned above I can't easily integrate magics into my search but can just drop in the kindergarten lookups. The "real world" result was that the difference is barely distinguishable from measurement noise with a tiny advantage (1-3%) to the kindergarten bb over arithmetic ones. My personal conclusion is that magics might get me another single digit percentage increase in linear speed which seems insignificant compared to the more "exponential" gains to be had by improving the search algorithm and heuristics.
One last observation: This was all done on a AMD Zen2 processor which can execute most of the relevant logic ops (AND, OR, XOR, SHL, SHR, CLZ, POPCNT...) at throughputs of 2-4 per cycle. Intel CPUs of recent generations list lower throughputs (1-2/cycle) for the same family of instructions so the relative advantage of lookup based approaches could be more pronounced there? I have to eventually rerun this experiment on my Intel laptop I guess.
And finally, the code for the arithmetic sliding attacks mentioned above. The line masks are obtained by arithmetic for rooks and lookups for bishops.
Code: Select all
uint64_t slide_arithmetic(uint64_t piece, uint64_t line, uint64_t block) {
// restrict to possible blockers
block &= line;
if(!block) {
// don't bother if there are no blockers
return line;
}
// split the line into upper and lower rays
uint64_t bottom = piece ^ (piece - UINT64_C(1));
// for the upper part we can use the x^(x-1) trick to fill in from the bottom
uint64_t masked_up = block & ~bottom;
uint64_t blocked_up = (masked_up ^ (masked_up-UINT64_C(1)));
// for the bottom we use CLZ + a shift to fill in from the top
uint64_t masked_down = block & bottom & ~piece;
uint64_t blocked_down = (UINT64_C(0x7FFFFFFFFFFFFFFF) >> CLZ64(masked_down|UINT64_C(1)));
// the intersection of the two is the move set after masking with the line
return (blocked_up ^ blocked_down) & line;
}