Are Bitboards More Intoxicating Than They Are Good?

hgm · Post by **hgm** » Wed Feb 24, 2021 4:02 pm

mar wrote: ↑Wed Feb 24, 2021 3:51 pmthe primary reason why I absolutely love bitboards is that you can recompute a lot of things from scratch very quickly without having to drag around some extra state in your board representation and/or do complicated incremental updates in your make/unmake.

But simple is not the same as fast. I dislike recalculating things from scratch all the time. If it requires extra state to avoid that, so be it.

Of course my opinion is slightly colored by the fact that I often deal with games with larger board, which often also have very many pieces. Redoing everything from scratch can then easily be one or two orders of magnitude slower.

Pop-count is nice (i.e. before CPUs supported popcount calculating mobility was pretty much a disaster even with bitboards), but you still have to do it for every piece. I'd rather do it only for the pieces that change their mobility. Not so many of those.

Sven · Post by **Sven** » Wed Feb 24, 2021 4:07 pm

hgm wrote: ↑Wed Feb 24, 2021 2:58 pm Sure, but the point I tried to make that it is not easy to affect the speed of a precisely specified algorithm. While there can be an enormous difference in efficiency between different algorithms.

Although I confess that contradicting you has become kind of a habit , I thought it important to point out that you present the comparison between one of the most advanced bitboard methods (calling it 'standard') with a very simplistic mailbox implementation, which might be 5-10 times slower than what could be achieved with state-of-the-art mailbox techniques. Otherwise the unsuspecting reader could have easily concluded that your observation reflected an intrinsic general speed difference between these board representations.

Well, I would not call my mailbox implementation that I had in Jumbo "simplistic". I can't tell whether it is really "5-10 times slower than ... state-of-the-art mailbox techniques", it is possible but I would certainly prefer a comparison to a mailbox engine in the Elo 2400-2600 range. Furthermore, some speed penalties in Jumbo (also in Perft) may result from code that is common for both board representations, so both versions would benefit from making these parts faster.

Here are some numbers I quickly copied from running the most recent public Jumbo version 0.7.0:

Code: Select all

Jumbo 0.7.0 (Mailbox)

Perft 6:
# Time elapsed: 1.61 s
# Nodes/second: 1195797

Search initial position to fixed depth 13 (time in centiseconds):
# total nodes = 13779874, QS nodes = 9029536 (65%), time = 1383, depth = 13, selDepth = 34, nps = 996375


Jumbo 0.7.0 (Bitboards)

Perft 6:
# Time elapsed: 1.13 s
# Nodes/second: 1703746

Search initial position to fixed depth 13 (time in centiseconds):
# total nodes = 13321155, QS nodes = 8298263 (62%), time = 924, depth = 13, selDepth = 38, nps = 1441683

So my "20-40%" claim was not exaggerated ...

Note that NPS in Perft is "visited nodes per second", not "counted leaves per second" which is much higher due to bulk counting.

hgm · Post by **hgm** » Wed Feb 24, 2021 6:24 pm

OK, perhaps 'simplistic' is a bit of an overstatement. You probably did have a piece list, so it could be worse; then at least you don't have to scan the board to find your own pieces. But making ray scans of the board to generate slider captures or determine slider mobility is also a very serious slowdown (plus that the generated captures come out in an unpredictable order, which hurts you a second time during move sorting). And you probably did do that.

BTW, bitboard representations usually also carry around redundant extra game state: most implementations use 'occupied', 'white' and 'black' bitboards, which could be derived from the individual piece bitboards. As long as the redundant info saves you much more work than what it takes to update it, it is worth it to keep it around. Minimizing the game-state representation is not a purpose in itself.

Mike Sherwin · Post by **Mike Sherwin** » Wed Feb 24, 2021 6:54 pm

mvanthoor wrote: ↑Wed Feb 24, 2021 10:49 am
Mike Sherwin wrote: ↑Wed Feb 24, 2021 10:03 am I agree. It is my experience that bitboard is not fast.
In my (limited) experience, they are easier to make fast, though.

When programming an engine that uses an array with numbers in it for the board representation, you're always looping around the board and pieces; you'll have to create extra features to optimize this, such as piece lists.

With a bitboard engine, I find things to be easier, at least for my brain. If I want to know the moves of a knight on a certain square, I can just look it up in the knight/square lookup table, and it'll return a bitboard with all valid moves for that knight on that square. If I want to know if a file is open, I can just do "if occupancy & <mask_for_file> == 0 { ... }", and I'll know.

You can cache/store many things for lookup in both bitboard and mailbox engines, but in my experience, when working with bitboards, it's easier to keep things fast.

PS: I think you're using some strange way of measuring nodes/sec. 25 MNps is REALLY fast. Not even engines like Stockfish can reach that (or have a very hard time reaching it) on current-day hardware when running on a single thread. My own engine runs perft at 36-40 million leaves/sec on my computer, but it "only" searches at 3-5 M nodes/sec depending on the position. Search is about 10 times slower than perft. (And I only do material counting and psqt evaluation, and both are even incremental.)

In Carnivor only material counting and psqt evaluation are done and it is also incremental. And everytime a move is made in search the nodes counter is incremented. So for example searching to 13 ply deep takes 17.42 seconds and 448,943,111 moves are made. On my calculator 448943111 / 17.42 = 25,771,705 moves per second. Am I doing something wrong? My search is about 3 times slower than perft in Bricabrac. In some of your code snippets I've looked at I remember seeing lots of "if" statements. In my code there are very few if statements. For example in my make and unmake move routines there is not a single "if" statement and only one switch statement each. In all my pawn move generation and make/unmake there is not a single "if" statement. I'll show the code snippets for a pawn on the 5th (cep).

Code: Select all

// generate pawn move on 5th rank for white
case RANK5:
        bb = (wPawnMoves[fs] & empty) | (wPawnCapts[fs] & (enemy | epbb[ply]));
        type = We; // White en passant or 5th rank pawn 
        break;

// make 
case We:
    sq = m->ts - ((epbb[ply] == (one << m->ts)) << 3);
    ctype = board[sq];
    mat[BLACK] -= value[ctype];
    pos[WHITE] += (wpPST[m->ts] - wpPST[m->fs]);
    pos[BLACK] -= pcePST[ctype][m->ts];
    piece[BLACK] ^= (u64)(ctype != OO) << sq;
    board[sq] = OO;
    board[m->ts] = WP;
    break;

// take back
  pos[WHITE] = possav[WHITE][ply];
  pos[BLACK] = possav[BLACK][ply];

case We:
    sq = m->ts - ((epbb[ply] == (one << m->ts)) << 3);
    board[m->ts] = OO;
    board[sq] = ctype;
    piece[BLACK] ^= (u64)(ctype != OO) << sq;
    board[m->fs] = WP;
    break;

If statements are the bane of high performance. Mispredictions are expensive. The more "if" statements there are the easier it is to overflow the branch prediction tables. And the constantly changing state of a chess engine makes for a high prediction fail rate.

Mike Sherwin · Post by **Mike Sherwin** » Wed Feb 24, 2021 7:12 pm

Daniel Anulliero wrote: ↑Wed Feb 24, 2021 11:50 am Hi mike
Good post anyway , as I'm writing an Isa Bitboard
Well yesterday I did some measure of speed , with the eval this time
With matérial, psqt, mobility, pieces atacks king, bishop pair, castle safety ( compute where the 3 pawns are) and pawns eval : passers , isolated, doubled
When I run 10 millions time the eval() function , I get 20% speed Ip with the bitboard version (12 seconds vs 15 for Isa "array") .
It seems it is a linear speed up because when I run 1 millions time the eval I get 1.2 ans 1.5 second .
Is + 20% is good ? Well .. better than nothing

Thanks Danny, good to see you!

Please do not get upset with me but I can guess why you got such a speed up. It is my guess that you use far less "if" statements in your bitboard code. Am I right? Goodluck with your new bitboard Isa!

Mike Sherwin · Post by **Mike Sherwin** » Wed Feb 24, 2021 7:19 pm

Sven wrote: ↑Wed Feb 24, 2021 1:09 pm
Mike Sherwin wrote: ↑Wed Feb 24, 2021 9:48 am
Sven wrote: ↑Wed Feb 24, 2021 9:17 am When comparing the raw speed of two different engines, there may be a lot more to consider than just mailbox vs bitboards. So is everything else really "the same" between Carnivor and Bricabrac? E.g. search features, move ordering, node counting, PST implementation, legality checking, incremental vs. from scratch calculations, data structures, piece encoding, programming style, compiler, compiler options, ...?
Very similar. Except Carnivor was compiled with Turbo C 1.0 and Bricabrac was compiled with MSVC 2019 Community. They both generate moves and put them in a huge stack with begin and end kept in arrays indexed by ply. The biggest difference is Bric is already set up for mp and therefore all data is in a thread structure whereas Carnivor's data is all global arrays. I get what you are saying, 'there are too many variables of which you have no knowledge of so how could you possibly answer the question'. But you also do not add an opinion based on your own experience! But, there are some people that do have opinions based on experience that they can give that might help form a consensus.
Here is my experience. Jumbo started as a pure mailbox engine. Later I added a bitboard implementation. It was a lot of new code but I decided to keep the mailbox code and to maintain two different binaries: Jumbo with mailbox and Jumbo with bitboards, built via #ifdef from exactly the same codebase. Few source files were split into two files for a better readability but most sources were either unaffected or had a very small amount of #ifdef-s related to different board representations.

It was quite some work to get that running, and it also required some abstractions here and there ... but once it worked, I was able to do a real performance comparison. My result was: bitboards were significantly faster, by some 20-40% I think, I don't remember the details anymore... No surprise here of course, that was the expected and desired result. I admit that my mailbox implementation could have been improved a lot. But, same developer and same style, the same would apply to my bitboard code! The difference might be that it could be harder to make a standard (magic) bitboard code slow than to do the same with mailbox code.

One disadvantage of my approach was a bigger maintenance effort whenever I made changes on the lower levels that were related to the board representation. For instance, adding checks in QS required to add move generator code (generate quiet checks) in two versions. Initially I accepted that challenge but later I became lazy and that led to the mailbox version falling behind in strength due to omitting some features that I only implemented for bitboards. That was especially true for a few new eval features. At some point I decided to drop the mailbox support in Jumbo and made it bitboards-only. The goal of getting a realistic performance comparison was reached, with virtually everything from <copy the list from my post above> being identical so only talking about differences that are actually related to board representation.

In your case I could imagine that differences in data structures (global variables vs. threads) may have a significant impact on speed, although that would not explain 25 vs. 10 Mnps ...Perhaps your bitboard impl has still some room for a big speedup?

Well I'm using my SISSY bitboards, lol. But, I know they can be optimized to be at least twice as fast as they are now. Anyway, 10 million nodes per second searched from the original position? How much more room can there be? That is a real question!

mvanthoor · Post by **mvanthoor** » Wed Feb 24, 2021 7:22 pm

Mike Sherwin wrote: ↑Wed Feb 24, 2021 6:54 pm In Carnivor only material counting and psqt evaluation are done and it is also incremental. And everytime a move is made in search the nodes counter is incremented. So for example searching to 13 ply deep takes 17.42 seconds and 448,943,111 moves are made. On my calculator 448943111 / 17.42 = 25,771,705 moves per second. Am I doing something wrong? My search is about 3 times slower than perft in Bricabrac. In some of your code snippets I've looked at I remember seeing lots of "if" statements. In my code there are very few if statements. For example in my make and unmake move routines there is not a single "if" statement and only one switch statement each. In all my pawn move generation and make/unmake there is not a single "if" statement. I'll show the code snippets for a pawn on the 5th (cep).

...

If statements are the bane of high performance. Mispredictions are expensive. The more "if" statements there are the easier it is to overflow the branch prediction tables. And the constantly changing state of a chess engine makes for a high prediction fail rate.

Seems you're not doing anything wrong; you're probably counting nodes differently. If your engine indeed has 25 Mnps on a single thread while counting the same as say, Stockfish, it would be 10 times faster than what I typically see on my computer.

With regard to if-statements... if you think my code has a lot of then, you should look at some other engines. My engine uses if and switch statements sparingly, compared to what I normally see.

There is something like branchless programming. One day I'll have to look into this to see if there are places in my engine where I can use that.

Mike Sherwin · Post by **Mike Sherwin** » Wed Feb 24, 2021 7:31 pm

mvanthoor wrote: ↑Wed Feb 24, 2021 3:23 pm With regard to bitboards, and chess engine programming in general, I think we must also take into account that Mike has a penchant for writing innovative, and experimental code. He has an entire class of bitboards named after him (how cool is that?!) and some other types of bitboards are based off of those.

In chess programming, only the fastest techniques survive. If someone comes up with an awesome way to create a bitboard move generator, and it turns out to be 10% slower than the magic bitboards, nobody is going ot use it.

So I wouldn't even be surprised if Mike's Briacabrac engine uses some sort of experimental type of bitboards never seen before... but probably they're not as fast as a bitboard engine could have been with established techniques.

Thanks for the kind words!

Yes, I'm using my new SISSY (Split Index Super Set Yielding) bitboards. And yes, they are about 10% slower than magic. They will be faster than magic once I figure out how to do the initialization that will fold them into fewer ranks for fewer lookups. Also SISSY gives some C/C++ compilers fits. Only Clang does a really good job at optimizing them. But, I get one error when I use Clang in MSVC 2019 that I do not know how to fix.

mvanthoor · Post by **mvanthoor** » Wed Feb 24, 2021 8:12 pm

Mike Sherwin wrote: ↑Wed Feb 24, 2021 7:31 pm Thanks for the kind words! Yes, I'm using my new SISSY (Split Index Super Set Yielding) bitboards. And yes, they are about 10% slower than magic. They will be faster than magic once I figure out how to do the initialization that will fold them into fewer ranks for fewer lookups. Also SISSY gives some C/C++ compilers fits. Only Clang does a really good job at optimizing them. But, I get one error when I use Clang in MSVC 2019 that I do not know how to fix.

Don't you _DARE_ make any bitboards faster than magics, because then I'd have to rewrite my move generator. You don't want to be responsible for causing weeks of hardship and suffering to other people, do you? Please keep this new bitboard variant 1% slower than magics. (Add a few useless statements that the compiler can't remove, if you have to.) Thanks for your consideration.

Just kidding of course

I can't help with Clang under Visual Studio. When running VS on Windows (mainly for work), I only use C# and the MS compiler. When using Clang in Windows, it's in the MSYS2 command line.

hgm · Post by **hgm** » Wed Feb 24, 2021 8:20 pm

So what would you say if I produced a mailbox engine that was 30% faster than magic bitboard?

Are Bitboards More Intoxicating Than They Are Good?

Re: Are Bitboards More Intoxicating Than They Are Good?

Re: Are Bitboards More Intoxicating Than They Are Good?

Re: Are Bitboards More Intoxicating Than They Are Good?

Re: Are Bitboards More Intoxicating Than They Are Good?

Re: Are Bitboards More Intoxicating Than They Are Good?

Re: Are Bitboards More Intoxicating Than They Are Good?

Re: Are Bitboards More Intoxicating Than They Are Good?

Re: Are Bitboards More Intoxicating Than They Are Good?

Re: Are Bitboards More Intoxicating Than They Are Good?

Re: Are Bitboards More Intoxicating Than They Are Good?