TalkChess.com

Posted: **Sat Apr 25, 2015 10:42 pm**

mattbrah wrote:2. How much of a speed difference does storing piece positions and movegen via bitboards make? Is the trade off in speed worth the time that it takes to understand and implement bitboards? What is the highest realistic level that it could play at using it's current method and instead focusing on optimizing the search/evaluation?

Welcome! I'm relatively new to this site as well.

Since I've written many chess engines using bitboards as well as several different types of array based implementations I will add my 2 cents to help answer this question.

In my experience bitboard implementations, if done correctly, provide about a 2 times increase in nodes per second. They offer more efficient solutions to very important aspects of the game. Writing a quiescence move generator (captures, promotions, and checks only) is much faster with bitboards. Calculating mobility, king safety, trapped pieces, locked pawns, etc are all much faster with bitboards.

That being said, faster move generation and positional evaluation are not going to make an engine a lot stronger on their own. The high branching factor of chess makes sure of that. Good move ordering, smart selective search (at inner nodes as well as quiescence nodes), and good positional evaluation are far more important. Bitboards just help you do those things more efficiently.

I've been working on two engines recently; one uses 0x88 board representation and the other uses bitboards. Aside from using bitboards for move generation and positional evaluation the bitboard engine is identical to the 0x88 engine. The bitboard engine is over twice as fast as the 0x88 engine, but it only achieves slightly higher search depths (about 1 extra ply per move in a blitz game). It's really just the smarter positional evaluation made possible in an efficient manner by the bitboard representation that makes it stronger than the 0x88 engine. The extra speed is of course valuable as well (1 extra ply per move is nothing to scoff at), but I believe the smarter positional evaluation plays a bigger role in it's better play.

In summary, I would recommend thinking of bitboard implementations as an exercise in optimization and a learning experience, but not a priority in developing your first engine. Understanding bit twiddling will be very valuable to you as a programmer in general so it's worth doing in the long run. But don't trip yourself up with it if you're just getting started.

STC

Posted: **Sat Apr 25, 2015 11:39 pm**

zd3nik wrote:The bitboard engine is over twice as fast as the 0x88 engine, ...

This should also depend on the details of the 0x88 implementation. E.g. do you just generate all moves and then extract the captures, or do you have a separate capture generator that generates per victim, so you can stop after a beta cutoff or due to futility. And whether you do a scan over the board to determine if slider captures are blocked, or do you just read the distance to the nearest blocker from a neirest-neighbor table.

In my estimate a properly designed 0x88 implementation could be ~5 times faster than a 'naive' one.

I am thinking of defining a simple reference engine for comparing the speed of various implementations. Like a fixed-depth full-with search followed by a capture-only search, and an evaluation consisting of Piece-Square Tables (including piece value) and mobility. It would be interesting to compare bitboard with an advanced 0x88 implementation. I would be disappointed if the 0x88 did not win by a large margin.

Posted: **Sun Apr 26, 2015 7:27 pm**

hgm wrote:
zd3nik wrote:The bitboard engine is over twice as fast as the 0x88 engine, ...
This should also depend on the details of the 0x88 implementation. E.g. do you just generate all moves and then extract the captures, or do you have a separate capture generator that generates per victim, so you can stop after a beta cutoff or due to futility. And whether you do a scan over the board to determine if slider captures are blocked, or do you just read the distance to the nearest blocker from a neirest-neighbor table.

In my estimate a properly designed 0x88 implementation could be ~5 times faster than a 'naive' one.

I am thinking of defining a simple reference engine for comparing the speed of various implementations. Like a fixed-depth full-with search followed by a capture-only search, and an evaluation consisting of Piece-Square Tables (including piece value) and mobility. It would be interesting to compare bitboard with an advanced 0x88 implementation. I would be disappointed if the 0x88 did not win by a large margin.

A reference of that type would be very useful, particularly if it includes implementation details. I'd also love to see a reference of 0x88 specific optimization tips and tricks, as would many people I'm sure.

I don't claim to have used the most optimal techniques in my 0x88 engine. Nor do I claim to have the most optimal bitboard implementation. For example I don't use magic bitboards, or rotated bit boards (though I have used those in the past). I use a technique I devised myself a few years ago (which is simpler and faster than rotated bitboards, but probably not as fast as magic bitboards). So needless to say my bitboard implementation isn't any more of a perfect representation of the potential of bitboards than my 0x88 implementation is a perfect representation of 0x88.

But I do claim to have a near perfect basis for comparison. The two engines I speak of are basically identical, design flaws and all. And simply swapping out the board representation from 0x88 to bitboard (which only effects move generation and positional evaluation) gave a 2-3 fold increase in nodes/second even though the bitboard engine's positional evaluation is easily 10x more sophisticated than the 0x88 engine in this case. I think that says something about the doors that bitboards open up in terms of performance.

STC

Posted: **Sun Apr 26, 2015 9:41 pm**

Well, if the incremental 0x88 implementation would speed it up by a factor 5, the bitboards would not have opened up all that much...

Posted: **Sun Apr 26, 2015 10:46 pm**

hgm wrote:Well, if the incremental 0x88 implementation would speed it up by a factor 5, the bitboards would not have opened up all that much...

5x increase by going to an incremental implementation. That would be something. Have any proof of this or is this just your estimation? And are we talking about going from 100K nodes/sec to 500K or something more like 500K to 2500K?

Here's the nps of Clubfoot, my 0x88 engine (which uses the naive approach):

Code: Select all

info depth 17 seldepth 33 nodes 38271405 time 87422 nps 437777 score cp 8 pv e2e4 e7e6 g1f3 d7d5 e4d5 e6d5 d2d4 g8f6 f1b5 c7c6 b5d3 f8b4 c2c3 d8e7 c1e3 b4d6 e1g1 e8g8 f1e1 b8d7

And the bitboard engine:

Code: Select all

info depth 17 seldepth 32 nodes 50882124 time 55796 nps 911931 score cp 32 pv e2e4 e7e6 d2d4 d7d5 b1c3 d5e4 c3e4 b8c6 g1f3 f7f5 e4c3 f8b4 d1d2 g8f6 f1b5 e8g8 b5c6 b7c6

If what you say is true then clearly all I need to do is change my 0x88 engine to use the incremental approach and stop wasting my time with the bitboard engine. Is there any reference material available that defines what this incremental approach is? Or an open source engine that uses it?

It may sound like I'm being flippant, and I am to a degree, but I'm also being perfectly serious. If the incremental approach can give a 5x speed boost, or even a 1.5x speed boost, I'm definitely going to try it out.

Thanks,
STC

Posted: **Mon Apr 27, 2015 12:58 pm**

zd3nik wrote:5x increase by going to an incremental implementation. That would be something. Have any proof of this or is this just your estimation? And are we talking about going from 100K nodes/sec to 500K or something more like 500K to 2500K?

So far this is just an estimate; I would like to put it to the test by actually implementing this, in an experiment similar to what you did with 0x88 vs bitboard. I think 2500K should not be impossible.

My engine Joker, which is a 'naive' 0x88, but has no mobility evaluation, does ~1750Knps on a single i7 core in the opening position. With mobility you have to run the move generator twice more in every node, so that would slow it down to ~600Knps. But the incremental scheme would give you the mobility for nearly free as a side effect, and the incremental update of the capture set should be faster than a naive 0x88 capture generation. This is how I got the 5x estimate: 3x by saving on mobility, and the remaing factor 1.5 because it is intrinsically faster to generate the captures.

Is there any reference material available that defines what this incremental approach is? Or an open source engine that uses it?

It was discussed here this month, and I also posted some tentative code to illustrate the idea. (This did not address the incremental update of mobility, however.) I am not aware of any engines that actually use this.

This is why I would want to make a 'reference implementation', where one can test the performance in real life. To save some work on nasty details, I will probably make it for a Chess variant that does not have castling, or e.p. capture (and perhaps promotion). After all, I am only interested in the nps it does, not in playing real games with it.

Posted: **Mon Sep 21, 2015 1:15 am**

hgm wrote:
zd3nik wrote:5x increase by going to an incremental implementation. That would be something. Have any proof of this or is this just your estimation? And are we talking about going from 100K nodes/sec to 500K or something more like 500K to 2500K?
So far this is just an estimate; I would like to put it to the test by actually implementing this, in an experiment similar to what you did with 0x88 vs bitboard. I think 2500K should not be impossible.

My engine Joker, which is a 'naive' 0x88, but has no mobility evaluation, does ~1750Knps on a single i7 core in the opening position. With mobility you have to run the move generator twice more in every node, so that would slow it down to ~600Knps. But the incremental scheme would give you the mobility for nearly free as a side effect, and the incremental update of the capture set should be faster than a naive 0x88 capture generation. This is how I got the 5x estimate: 3x by saving on mobility, and the remaing factor 1.5 because it is intrinsically faster to generate the captures.

Is there any reference material available that defines what this incremental approach is? Or an open source engine that uses it?
It was discussed here this month, and I also posted some tentative code to illustrate the idea. (This did not address the incremental update of mobility, however.) I am not aware of any engines that actually use this.

This is why I would want to make a 'reference implementation', where one can test the performance in real life. To save some work on nasty details, I will probably make it for a Chess variant that does not have castling, or e.p. capture (and perhaps promotion). After all, I am only interested in the nps it does, not in playing real games with it.

I've been working on a new 0x88 engine for the sake of satisfying my curiosity about this. I'm currently calling this engine "Clunk".

It uses incrementally updated attack maps as discussed in this thread. Probably implemented very differently than what H.G. has in mind, but still the same basic idea.

It doesn't use the 0x88 trick during move generation, so it's not technically an 0x88 move generator even though it uses 0x88 board representation. The primary move generator uses pre-compiled move vectors so there is no need for the 0x88 trick for destination square validation. I wrote a couple raw perft engines to test this and the pre-compiled vectors performed better than the simple 0x88 approach.

Anyway, getting back to the main point, I've got it incrementally updating attack vectors. So for any slider at any time I can determine the end point of its attacks in each direction. This makes calculating mobility for each slider trivial. And it makes capture move generation in quiescence search trivial as well.

Counter-intuitively, it is slightly faster using the pre-compiled vectors rather than the attack maps for primary move generation. This is probably because the attack maps are organized by direction so there can be gaps between attack vectors in those maps. The pre-compiled vector maps have no gaps. Perhaps my implementation is sub-optimal. But it does suffice for getting mobility info quickly as well as making capture generation very fast. The attack maps will also be useful for SEE calculation and positional evaluation I'm sure.

With positional evaluation currently being nothing more than material balance, mobility, and very basic pawn structure analysis (doubled, backward, passed pawn scoring) this engine is averaging around 2 to 3 million nodes per second. Where a "node" is the execution of a move. This is doing alpha/beta with PVS, IID, killer move heuristic, history heuristic, LMR, null move pruning, and Quiescence Search on an i5 laptop.

I've added some preliminary/naive positional evaluation functions and they easily take that down to about 1 million nodes per second. So I'm currently trying to find a good balance between smart positional evaluation and speed - and figure out faster ways to do some of the positional eval stuff.

For reference: on the same hardware, Clubfoot (my very simple/naive 0x88 engine) averages about 500K-800K nodes per second. Keep in mind Clubfoot is by no stretch a "highly optimized" implementation of an 0x88 engine and it doesn't do any mobility calculation in its positional eval.

So it seems that a speed up at least 1.5x to 2x (maybe more) is certainly possible by adding incrementally updated attack maps to an array based engine. Mainly due to the savings in capture move generation I think.

Posted: **Mon Sep 21, 2015 11:16 am**

Indeed, usually some 85% of the tree nodes are QS nodes, so speeding up capture generation has a huge impact.

I still want to try a design where the entire captures part of the move list is kept not as a list, but as a bit set. Where 256 bits corresponding to potential captures (16 pieces x 16 victims) are assigned in MVV/LVA order, so that you can run through the captures by simply extracting the next 1 bit. (E.g. you could distribute this set over four 64-bit words, each holding all captures of 4 pieces: one for captures of KQRR, the second of those of BBNN, and then two words of captures of four Pawns each. This way you can make sure you will extract the capture with the lowest attacker of any minor first.)

You can then easily add and delete captures by setting or clearing the corresponding bit, without upsetting the sorting. You could then limit the capture generation to moves of the previously moved piece from its new location (to add them) and from its old location (to delete them). And some elongation of unblocked moves, based on the existing slider captures to the from-square.

This is in general much less work than generating all captures from scratch. And the beautiful thing is that (apart from the slider unblocking) it only affects the captures of the side that now does not have the move. So you can postpone it to the point where you are actually sure that you are going to search some of the captures (i.e. no stand-pat cutoff, and no futility pruning). For knowing your own moves the capture set of the parent node would already be valid, after taking account of the slider unblocks (which also calculates their new mobility), except that you would have to mask away the pieces that are captured. So you basically moved the capture generation one level down, to the parent node, so that all its daughters can share it. And nodes with no daughters would never do it at all.

Posted: **Tue Sep 22, 2015 6:29 am**

hgm wrote:Indeed, usually some 85% of the tree nodes are QS nodes, so speeding up capture generation has a huge impact.

I still want to try a design where the entire captures part of the move list is kept not as a list, but as a bit set. Where 256 bits corresponding to potential captures (16 pieces x 16 victims) are assigned in MVV/LVA order, so that you can run through the captures by simply extracting the next 1 bit. (E.g. you could distribute this set over four 64-bit words, each holding all captures of 4 pieces: one for captures of KQRR, the second of those of BBNN, and then two words of captures of four Pawns each. This way you can make sure you will extract the capture with the lowest attacker of any minor first.)

You can then easily add and delete captures by setting or clearing the corresponding bit, without upsetting the sorting. You could then limit the capture generation to moves of the previously moved piece from its new location (to add them) and from its old location (to delete them). And some elongation of unblocked moves, based on the existing slider captures to the from-square.

This is in general much less work than generating all captures from scratch. And the beautiful thing is that (apart from the slider unblocking) it only affects the captures of the side that now does not have the move. So you can postpone it to the point where you are actually sure that you are going to search some of the captures (i.e. no stand-pat cutoff, and no futility pruning). For knowing your own moves the capture set of the parent node would already be valid, after taking account of the slider unblocks (which also calculates their new mobility), except that you would have to mask away the pieces that are captured. So you basically moved the capture generation one level down, to the parent node, so that all its daughters can share it. And nodes with no daughters would never do it at all.

Sounds like a cool idea. But I'm lost on how you associate this information with square positions. You're storing "what", but not "where", or I'm missing something. Do you have one of these 256 bit maps for every [occupied] square?

Posted: **Tue Sep 22, 2015 9:46 am**

There is just one 256-bit capture set per side (and a similarly formatted set of pseudo-captures). The system is piece-based rather than square-based. But the piece list (say location[pieceNr]) is available to tell us where a given piece stands, and board (board[sqr]) is available to look up what stands on a given square.

So there should be 'decoding tables' that convert a certain bit position in the set to piece numbers of the attacker and the victim, like

Code: Select all

captureNr = ExtractLSB&#40;captureSet&#91;stm&#93;);
pieceNr = attackerTable&#91;captureNr&#93;;
victimNr = victimTable&#91;captureNr&#93;;
from = location&#91;pieceNr&#93;;
to = location&#91;victimNr&#93;;

If the captureSet in reality is distributed over 4 words you would have to loop over those words:

Code: Select all

for&#40;i=0; i<4; i++) &#123;
  captureNr = 64*i + ExtractLSB&#40;captureSet&#91;stm&#93;&#91;i&#93;);
  pieceNr = attackerTable&#91;captureNr&#93;;
  victimNr = victimTable&#91;i&#93;&#91;captureNr&#93;;
  from = location&#91;pieceNr&#93;;
  to = location&#91;victimNr&#93;;
&#125;

If you want to elongate the sliders unblocked by the move you would do something like:

Code: Select all

pieceNr = board&#91;from&#93;; // you probably already have this
index = indexTable&#91;pieceNr&#93;; // tells in which of the 4 words attacks on this piece are
mask = victimMaskTable&#91;index&#93;&#91;pieceNr&#93;; // tells which 16 bits in this word associated with the piece as victim
// first extend enemy sliders
blockedSet = captureSet&#91;xstm&#93;&#91;index&#93; & mask & sliderMask & presenceMask;
while&#40; &#40;captureNr = ExtractLSB&#40;blockedSet&#41;) ) &#123;
  int slider = attackerTable&#91;captureNr&#93;; // enemy slider attacking the from-square
  int origin = location&#91;slider&#93;; // from where the attack is coming
  int direction = dirTable&#91;from-origin&#93;; // diection of slider attack
  int step = vectors&#91;direction&#93;; // elemntary board step in this direction
  int dist = viewDistance&#91;direction&#93;&#91;from&#93;; // free view behind from-Square in this direction
  int target = from + dist*step; // new end-point of slider move
  if&#40;OnBoard&#40;target&#41;) &#123;
    int newVictim = board&#91;target&#93;; // piece now attacked by slider
    int victimIndex = indexTable&#91;newVictim&#93;; // word of captureSet where attacks on this victim are flagged
    if&#40;Color&#40;newVictim&#41; == stm&#41; &#123; // enemy slider attacks our piece
      captureSet&#91;xstm&#93;&#91;victimIndex&#93; |= victimMaskTable&#91;victimIndex&#93;&#91;newVictim&#93; & attackerMaskTable&#91;victimIndex&#93;&#91;slider&#93;;
    &#125; else &#123; // enemy slider protects own piece
      protectSet&#91;xstm&#93;&#91;victimIndex&#93; |= victimMaskTable&#91;victimIndex&#93;&#91;newVictim&#93; & attackerMaskTable&#91;victimIndex&#93;&#91;slider&#93;;
    &#125;
  &#125;
&#125;
// now repeat for friendly sliders
blockedSet = protectorSet&#91;stm&#93;&#91;index&#93; & mask & sliderMask & presenceMask;
... // etc

This looks like a lot of code, but the secret of course is that pins and discovered attacks are not very common on the board, and most pieces do not unblock anything when moved, so that the while loops are never executed.

The code is also largely branch-free, and could be made even more branch-free by not declaring captureSet[2] and protectorSet[2] as separate arrays, but merge them into a captureSet[2][2]. So that the code that eventually adds the elongated slider move to the set can always do it to

captureSet[Color(newVictim)][Color(slider)] |= ...

(And even more optimized, the OnBoard test could be removed by surrounding the board with GUARD pieces where Color(GUARD) would emerge as 2, and you would extend captureSet[3][2] with two extra dummy elements for attacks on boundary guards.)

TalkChess.com

Questions about chess programming from a newbie

Re: Questions about chess programming from a newbie

Re: Questions about chess programming from a newbie

Re: Questions about chess programming from a newbie

Re: Questions about chess programming from a newbie

Re: Questions about chess programming from a newbie

Re: Questions about chess programming from a newbie

Re: Questions about chess programming from a newbie

Re: Questions about chess programming from a newbie

Re: Questions about chess programming from a newbie

Re: Questions about chess programming from a newbie