Comparison of bitboard attack-getter variants

ZirconiumX · Post by **ZirconiumX** » Mon Jan 04, 2016 2:28 pm

Sven, if you don't mind me asking, could you run a quick comparison of if the extra bit twiddling of this article is worth the unified implementation.

Gerd might find the article useful too, although whether it can be suitably vectorized is not in my field of expertise (I'm a musician...).

Matthew:out

Gerd Isenberg · Post by **Gerd Isenberg** » Mon Jan 04, 2016 2:52 pm

Sven Schüle wrote: The following five approaches were tested:
- Magics
- Hyperbola Quintessence
- Dumb7Fill
- Obstruction Difference
- and an unnamed approach which I simply call "No Headache" (at least the author knows why).

A more realistic test would of course including a "real" static evaluation function, use SEE, improve the search (to reach much bigger search depths) and more. Although I doubt that this would change the results significantly, I plan to repeat the same test after improving the engine significantly. Also other attack-getter implementations are on my ToDo list, including Kindergarten bitboards and a few more.

Test result overview:
Code: Select all
Attack-getter variant   Perft NPS  -% vs. Magics  Search NPS  -% vs. Magics
===========================================================================
Magics                    1509538                    4689055
Obstruction Difference    1305917         -13,5%     4480245          -4,5%
"No Headache"             1290733         -14,5%     4411828          -5,9%
Hyperbola Quintessence    1327537         -12,1%     4380272          -6,6%
Dumb7fill                 1174150         -22,2%     4102345         -12,5%
My brief summary: Magics are still on rank 1. Several other techniques are not far away from Magics, and they might serve as a good starting point if program speed is not (yet) an issue.

Hi Sven,

ahh, attack-getters! Nice test. Interesting that beside the huge memory approach of magic bitboards as expected winner (inlined?), "No Headache" with loops and branches is faster than expected by me. Ok, both approaches may decline if more memory (magics) or branch resources ("No Headache") are used in other areas of a further developed engine.

Concerning Haralds former test and Rotated Indices, it is somehow difficult to compare approaches with different incremental update versus on the fly computation costs. Rotated Indices have higher costs in make/unmake or copy/make than rotated bitboards, which have higher incremental update costs than none-rotated approaches. Otoh, this may pay of if that deconcentrated (rotated or reversed) occupancy is probed often per node. But if you use more often temporarily changed occupancies per node in getting (xray) attacks per node, it is more work on the fly.

Of course, fillstuff like dump7fill is not intended to use for single sliding pieces, but for sets of multiple pieces in an directionwise approach, specially with AVX2 in mind à la DirGolem, or if you like to get progressive mobility, i.e. all squares a bishop may attack in two moves. Anyway, is Kogge Stone faster than Dumb7fill in your engine?

Thanks,
Gerd

hgm · Post by **hgm** » Mon Jan 04, 2016 4:25 pm

Sven Schüle wrote:System environment
- PC: Lenovo Laptop W530 running Windows 7 Pro SP1
- CPU: Intel Core i7-3610QM CPU @2.30 GHz

Unfortunately this makes the results a bit difficult to interpret on an absolute scale, as this i7 model has the 'turboboost' feature, so that in cases where not all its cores are fully loaded, the clock frequency might rise upto 3.30 GHz. And when doing a single-core test this is likely to happen.

Just to put your results in perspective, could you measure the nps of micro-Max 1.6 (WB version), which is also an engine with material-only eval and no hash table, on your machine? On my 2.4GHz i3 id did 3.17Mnps, (on the opening position), but the i3 does not have turboboost. If I would convert that to 3.3GHz I would get 4.36Mnps.

Micro-Max has no piece list, though, so it has to scan the board to find his own pieces, which of course is not very good for its nps.

Gerd Isenberg · Post by **Gerd Isenberg** » Mon Jan 04, 2016 4:37 pm

Thanks for the link. Funny, but looks a bit expensive to me.

Code: Select all

>>  rank*8
* 0x0101010101010
& 0x804020100804020
usual HQ attack from dia
* 0x010101010101010
/ 0x0100000000000000 hopefully replaced by >> 56
<< rank*8

compared to a "usual" rank-getter with 512 Byte lookup

Code: Select all

BYTE arrFirstRankAttacks64x8&#91;64*8&#93;; // 512 Bytes = 1/2KByte
 
U64 rankAttacks&#40;U64 occ, enumSquare sq&#41; &#123;
   unsigned int file = sq &  7;
   unsigned int rkx8 = sq & 56; // rank * 8
   occ = &#40;occ >> rkx8&#41; & 2*63;
   U64 attacks = arrFirstRankAttacks64x8&#91;4*occ + file&#93;;
   return attacks << rkx8;
&#125;

Sven · Post by **Sven** » Mon Jan 04, 2016 5:48 pm

ZirconiumX wrote:Sven, if you don't mind me asking, could you run a quick comparison of if the extra bit twiddling of this article is worth the unified implementation.

Gerd might find the article useful too, although whether it can be suitably vectorized is not in my field of expertise (I'm a musician...).

I'll try that as soon as I can. Might take a few days, though, until then.

Sven · Post by **Sven** » Mon Jan 04, 2016 5:51 pm

hgm wrote:
Sven Schüle wrote:System environment
- PC: Lenovo Laptop W530 running Windows 7 Pro SP1
- CPU: Intel Core i7-3610QM CPU @2.30 GHz
Unfortunately this makes the results a bit difficult to interpret on an absolute scale, as this i7 model has the 'turboboost' feature, so that in cases where not all its cores are fully loaded, the clock frequency might rise upto 3.30 GHz. And when doing a single-core test this is likely to happen.

Just to put your results in perspective, could you measure the nps of micro-Max 1.6 (WB version), which is also an engine with material-only eval and no hash table, on your machine? On my 2.4GHz i3 id did 3.17Mnps, (on the opening position), but the i3 does not have turboboost. If I would convert that to 3.3GHz I would get 4.36Mnps.

Micro-Max has no piece list, though, so it has to scan the board to find his own pieces, which of course is not very good for its nps.

I'll try that as soon as possible.

Sven · Post by **Sven** » Mon Jan 04, 2016 9:35 pm

Sven Schüle wrote:
hgm wrote:
Sven Schüle wrote:System environment
- PC: Lenovo Laptop W530 running Windows 7 Pro SP1
- CPU: Intel Core i7-3610QM CPU @2.30 GHz
Unfortunately this makes the results a bit difficult to interpret on an absolute scale, as this i7 model has the 'turboboost' feature, so that in cases where not all its cores are fully loaded, the clock frequency might rise upto 3.30 GHz. And when doing a single-core test this is likely to happen.

Just to put your results in perspective, could you measure the nps of micro-Max 1.6 (WB version), which is also an engine with material-only eval and no hash table, on your machine? On my 2.4GHz i3 id did 3.17Mnps, (on the opening position), but the i3 does not have turboboost. If I would convert that to 3.3GHz I would get 4.36Mnps.

Micro-Max has no piece list, though, so it has to scan the board to find his own pieces, which of course is not very good for its nps.
I'll try that as soon as possible.

On my machine, the speed of umax1_6w.exe seems to be comparable to that of my bitboard engine:

Code: Select all

tellics say         micro-Max 1.6w
tellics say         by H.G. Muller
new
level 40 120 0
post
go
 0      0        0          1 a8a8
 1     15        0         22 c2c4
 2      0        0        142 c2c4
 3     15        0       1784 c2c4
 4      0        1      15572 d2d4
 5     15        4     143800 b1c3
 6      0       43    1637236 b1c3
 7     14      291   13173109 e2e3
 8      2     3864  168516567 d2d4
 9     12    23255 1060333645 g1f3
move g1f3

13173109 / 2.91 = 4526842
168516567 / 38.64 = 4361195
1060333645 / 232.55 = 4559594

so it is roughly 4.4 .. 4.5 Mnps.

hgm · Post by **hgm** » Mon Jan 04, 2016 9:57 pm

Seems your i7 is running in full turbo mode (3.3GHz) in these tests.

And not bad for a slow mailbox (0x88 instead of boundary guards, no piece list).

Henk · Post by **Henk** » Tue Jan 05, 2016 12:23 pm

How many moves do you need to store to make magic bit boards work ? I think only storing the bit boards that contains the moves will be too slow. For you still have to extract the moves out of these bit boards.

Sven · Post by **Sven** » Tue Jan 05, 2016 2:47 pm

Henk wrote:How many moves do you need to store to make magic bit boards work ? I think only storing the bit boards that contains the moves will be too slow. For you still have to extract the moves out of these bit boards.

When generating moves I always store the moves themselves in the move list, not just their bitboards which may indicate more than one move at once. There is one simple reason for it: move ordering would become quite difficult without a per-move scoring. I don't know exactly what you mean by "how many moves do you need to store ..." but I guess the answer is: "all". Performance might be another reason but in my opinion move ordering is the main problem here.

Comparison of bitboard attack-getter variants

Re: Comparison of bitboard attack-getter variants

Re: Comparison of bitboard attack-getter variants

Re: Comparison of bitboard attack-getter variants

Re: Comparison of bitboard attack-getter variants

Re: Comparison of bitboard attack-getter variants

Re: Comparison of bitboard attack-getter variants

Re: Comparison of bitboard attack-getter variants

Re: Comparison of bitboard attack-getter variants

Re: Comparison of bitboard attack-getter variants

Re: Comparison of bitboard attack-getter variants