Speedup with bitboards on 64-bit CPUs

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Speedup with bitboards on 64-bit CPUs

Post by bob »

Tord Romstad wrote:
frankp wrote:This strikes a cord. I was writing a new 0x88 program for the Palm z22 I bought recently, but it was just too slow (on the pc) so I have gone back to bitboards mainly because this is the way I think. Eval on 0x88 was so hard. I guess it is familiarity that matters. There seems no intrinsic reason why 0x88 would not be as fast. (Bitboards will be no good for the z22 of course.).
In my opinion, 0x88 is obsolete. It is similar to mailbox, but less intuitive and somewhat slower, without offering any advantages in compensation.

Personally, I can neither think in terms of bitboards nor in terms of mailbox boards. The only way I can think about chess programming is in terms of higher-level terms. I try to isolate the the parts of my program that make use of the internals of my board representation to a few low-level functions, and keep the interesting parts of my search and evaluation agnostic to the board representation. Most of my code is exactly the same with bitboards and mailbox.
To answer your question, IIRC my old bitboard program got the order of 50% speedup moving to amd64 (gcc/Linux), but does not use rotated bitboards.
Not bad! If my results turn out to be similar, it is likely that my bitboard version will be faster than the mailbox version on a 64-bit CPU.

Thanks to everyone for the answers! I'm getting more optimistic now.

Tord
I'm confused. 0x88 _is_ a "mailbox" approach. The term "mailbox" comes from treating an array like a series of mailboxes, each element has a unique address. Any array-based approach that uses one array element per board square is a "mailbox implementation".

As far as being obsolete, I'm not sure why. It has its interesting points, such as any two squares on the same ray have a common 'offset' that can be tested easily to answer questions about attacks and blocking rays. And it provides a cute way of avoiding walking off the edge of the board or wrapping around.
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: Speedup with bitboards on 64-bit CPUs

Post by Michael Sherwin »

I'm confused. 0x88 _is_ a "mailbox" approach. The term "mailbox" comes from treating an array like a series of mailboxes, each element has a unique address. Any array-based approach that uses one array element per board square is a "mailbox implementation".
In TSCP the actual board is a 64 element array, however, the board for calculating moves is 10 x 12. Therefore a mailbox array needs to be set up to translate squares of one board representation to the other. So, real_square = mailbox120[pseudo_square] or pseudo_square = mailbox64[real_square]. That is TSCP's definition of mailbox.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
User avatar
hgm
Posts: 27788
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Speedup with bitboards on 64-bit CPUs

Post by hgm »

Yes, I noticed that the term 'mailbox' is used in different ways. In TSCP it means something very strange, and I was never able to figure out what. Most people seem to use it for an array-based representation that has a guard-band around the board of immovable uncapturable pieces, so that no special coding has to be done to intercept over-the-edge moves, but the don't-capture-own-piece code handles the board edge as well.
Guetti

post gone??

Post by Guetti »

Yesterday there was a posting here (actually a reply to Bob) about finding pieces inbetween two squares for bitboards with a link to a posting of Gerd in the winboard forum, somehow it seems to have misterioulsy disappeared now. Either that or I had very strange dreams last night.
Unfortunately I don't remember who was the poster, but since it was very late last night, I decided to read it the next day. Where hast that posting gone??? Does sombody still have the link to the winboard thread?
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Speedup with bitboards on 64-bit CPUs

Post by bob »

Michael Sherwin wrote:
I'm confused. 0x88 _is_ a "mailbox" approach. The term "mailbox" comes from treating an array like a series of mailboxes, each element has a unique address. Any array-based approach that uses one array element per board square is a "mailbox implementation".
In TSCP the actual board is a 64 element array, however, the board for calculating moves is 10 x 12. Therefore a mailbox array needs to be set up to translate squares of one board representation to the other. So, real_square = mailbox120[pseudo_square] or pseudo_square = mailbox64[real_square]. That is TSCP's definition of mailbox.
That's broken. The classic definition of "mailbox" is one array element per square on the board. Like a business mailbox that has rows and columns of boxes where mail is placed...

Looks like yet another case of terminology corruption...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Speedup with bitboards on 64-bit CPUs

Post by bob »

hgm wrote:Yes, I noticed that the term 'mailbox' is used in different ways. In TSCP it means something very strange, and I was never able to figure out what. Most people seem to use it for an array-based representation that has a guard-band around the board of immovable uncapturable pieces, so that no special coding has to be done to intercept over-the-edge moves, but the don't-capture-own-piece code handles the board edge as well.
I don't think you even need the "guard squares" based on old literature. Just an array of at least 64 elements for a 64-square board. More elements is OK, less is not however as then you have to compress more information into one element than is the property of one square. Bitboards is an example of this compression, which causes it to not be a "mailbox" approach..
Shaun
Posts: 322
Joined: Wed Mar 08, 2006 9:55 pm
Location: Brighton - UK

Re: Speedup with bitboards on 64-bit CPUs

Post by Shaun »

Tord,

a couple of old examples on a Core 2 Duo (e6300 at 2186mhz).

- dual boot XP Pro and XP Pro 64-bit

256mb hash

Rybka 2.1c

64-bit OS

32-bit - Depth: 20 00:15:08 113900kN
64-bit - Depth: 20 00:08:29 113900kN

Nothing short of stunning although Vas has mentioned he writes for 64-bit with no consideration for 32-bit performance and it is up to the compiler to do its best.

32-bit OS

Depth: 20 00:14:57 113900kN

Naum 2.0

64-bit OS

32-bit - Depth: 20/51 00:19:14 1139437kN
64-bit - Depth: 20/51 00:18:29 1139437kN

32-bit OS

Depth: 20/51 00:19:01 1139437kN

I noticed a 1-2% hit for running a 32-bit engine under the 64-bit OS rather than a native 32-bit OS.

Recently I tested Loop 13.5 - I do not have the figures at hand (actually I think I posted them somewhere on the forum) anyway I found the 32/64-bit versions to run almost identically with the 32-bit compile very slightly faster. My guess was that the 64-bit compile was better optimised for AMD?

Shaun
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: post gone??

Post by Gerd Isenberg »

Guetti wrote:Yesterday there was a posting here (actually a reply to Bob) about finding pieces inbetween two squares for bitboards with a link to a posting of Gerd in the winboard forum, somehow it seems to have misterioulsy disappeared now. Either that or I had very strange dreams last night.
Unfortunately I don't remember who was the poster, but since it was very late last night, I decided to read it the next day. Where hast that posting gone??? Does sombody still have the link to the winboard thread?
Yes, i saw that posting as well. It was about my proposal to shrink the 32K-lookup to 1/2K with the assumption sq1 != sq2 and sq1 and sq2 share a "sliding" ray. Since delta of 7 may occur on a rank as well on an antidiagonal, file-delta of a rank is decremented by "sameRank" from 1..7 to 0..6. The second routine of a recent thread here with pure caclulation is to slow, but usefull to initialize the arrays...

Code: Select all

u64 inbetweenByDiff[64];

u64 inBetween(u32 sq1, u32 sq2)
{
    int xdelta = sq2 ^ sq1;
    int delta  = sq2 - sq1;
    int sameRank = (xdelta - 8) >> 31; // {0,-1}
    int negaDiff = delta & (delta >> 31);
    sq1 += negaDiff; // min
    sq2 -= negaDiff; // max
    return inbetweenByDiff&#91;sq2-sq1+sameRank&#93; << sq1;
&#125;

Code: Select all

u64 inBetweenWithoutLookup&#40;u32 sq1, u32 sq2&#41;
&#123;
    const u64 o = 1, m1 = -o; 
    const u64 a2a7 = 0x0001010101010100;
    const u64 b7h1 = 0x0002040810204080;
    u64 btwnbits, raybits; 
    u32 rankDiff, fileDiff, antiDiff, diaxDiff;	

    btwnbits  = &#40;o<<sq1&#41; -  o;
    btwnbits ^= &#40;o<<sq2&#41; -  o;
    rankDiff  =(&#40;sq2 |7&#41; -  sq1&#41;>>3;
    fileDiff  = &#40;sq2 &7&#41; - &#40;sq1 &7&#41;;
    antiDiff  = rankDiff + fileDiff;
    rankDiff  = rankDiff &  15;
    fileDiff  = fileDiff &  15;
    antiDiff  = antiDiff &  15;
    diaxDiff  = rankDiff ^ fileDiff;
    raybits   =(&#40;rankDiff-1&#41;>>26&#41;*2;
    raybits  |= &#40;m1+fileDiff&#41;& a2a7;
    raybits  |= &#40;m1+antiDiff&#41;& b7h1;
    raybits  |= _byteswap_uint64&#40;&#40;m1+diaxDiff&#41; & b7h1&#41;;
    raybits  *= btwnbits &-btwnbits;
    return      raybits  & btwnbits;
&#125;
Gerd
Guetti

Re: post gone??

Post by Guetti »

Thanks, Gerd.
User avatar
hgm
Posts: 27788
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: post gone??

Post by hgm »

Gerd Isenberg wrote:Yes, i saw that posting as well. It was about my proposal to shrink the 32K-lookup to 1/2K with the assumption sq1 != sq2 and sq1 and sq2 share a "sliding" ray.

...
Depending on how you obtain that prior knowledge, you might arrange it such that it is also guaranteed that sq1 and sq2 are not neighboring squares on an anti-diagonal either, as in that case there are no in-between squares anyway. There is thus no reason to treat it differently from cases where sq1 and sq2 are not aligned on a ray. The disambiguation of that and the same-rank case would then no longer be needed.

Also here the 'rotate' instructions could be useful, as you don't rely on bits being clipped off as the tabulated templates are shifted in position. That would save you ordering the squares. You would need twice as large a table, though.