hyperbola-quintessence

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

hyperbola-quintessence

Post by Gerd Isenberg »

This is somehow a resurrection of the reversed bitboards idea.
Really amazing how well the bswap-approach based on o^(o-2r) performs with Aleks's improvements. To get sliding attack-sets on files and diagonals.

http://www.talkchess.com/forum/viewtopi ... 14&t=16002

Assume following masked file or diagonal - for simplicity as a flat byte (in a real bitboard with masked files or diagonals you have 6..8 scratch-bits between the bits of this byte). Thus bswap reverses the bits of this byte. The first "subtraction" is done implicitly by masking off the ray, removing the slider from the occupied set:

Code: Select all

       normal    reversed
 o     11010101  10101011 o' occupancy including slider
 r     00010000  00001000 r' slider
 o-r   11000101  10100011 o'-r'  1. sub clears the slider
 o-2r  10110101  10011011 o'-2r' 2. sub borrows "one" from next blocker in "msb"-direction
       |......|  \....../
normal 10110101   \..../
       11011001 <--XXXX   "unreverse"
single 
xor    01101100 -> to get the attack set
The second subtraction borrows a "one" from the next nearest blocker in msb-direction. Of course, if no blocker is available, it borrows a "one" in usual arithmetical manner from the overflow-bit 2**N. Only the changed bits (from original o, o') are the appropriate sliding attacks, including the blocker but exluding the slider. The fine optimization by Aleks is to safe two xors, since the changed bits, after "unreversing", are obviously always disjoint.

Code: Select all

rayattacks &#58;=  o^&#40;o-2r&#41; ^ reverse&#40;&#40;o'-2r')^o')
rayattacks &#58;=    &#40;o-2r&#41; ^ reverse &#40;o'-2r')
Beside shorter code this reduces register pressure - and clearly outperforms kindergarten bitboards - ipc-wise, in code size, memory requirements and elegance. With only 2KByte lookup it comes close to magic-bishops in a dumb snoob-test where the 38K-lookup stays in L1 as well.

I'll vote to call this approach hyperbola-quintessence ;-)
Seems the inventor of reversed bitbords was Ryan Mack.
His paper with quite too optimistic estimates seems no longer available in the net. A plain text copy again.

Gerd

The Hyperbola Project - New Technologies

What follows here is merely a glimpse at the new advances in computer chess that the Hyperbola Project will include. The revolutionary design will generate enormous speed improvements, optimized specifically for the Pentium III processor's SIMD and MMX enhamcements. The downside is that the initial releases of Hyperbola will require a P-III system to operate. As the core engine is being written in assembly code, it is very difficult to optimize for more than one chip at a time.

The past two years have yielded a serious look at the way computer chess programs operate today. The most significant problem with PC-oriented (non 64-bit) programs is the lack of speed associated with doing 64-bit arithmetic in a 32-bit environment.

MMX technology introduced some 64-bit operations to the Pentium family of microprocessors, which do increase performance noticeably, but not optimally. When using MMX, many operations still have to be completed in the old 32-bit system, which degrades the performance gain when the data has to be exchanged back and forth between the regular registers and the MMX registers.

The improvements shown below also address another significant bottleneck in chess programming: memory speeds. Despite the fact that Hyperbola is designed for systems with extremely fast memory (RDRAM), memory is still a bottleneck for some calculations, which require a very tight dependency chain. Since a memory read from the fastest (L1) cache has a latency of three clock cycles, the processor remains stalled in a tight dependency chain such as this, where no other calculations can be completed while the processor waits for the memory read to complete.
Bitboard Design and Sliding Piece Attack Generation

Generating the attacks for sliding pieces proved to be a significant bottleneck in the early trials. For example, before generating legal moves for a position, the program must determine whether the king is in check. To do this, the program has to find the attacks of every type of piece from the king's square and check if an appropriate enemy piece resides there. Since this cannot be done while analyzing other moves, this is a huge slowdown under the current technology, which uses four sets of bitboards (three rotated) and several memory lookups to determine the correct attacks.

Hyperbola's design is quite different in this respect from other programs. Hyperbola introduces the concept of a reversed bitboard in place of three rotated bitboards, and TOTALLY eliminates the lookup tables in favor of simple calculations, all of which can be done in the 64-bit MMX registers.

With the research still in progress on this topic, it is not clear as to exactly how much performance can be enhanced by this new technique, but it is clear that performance will be improved very significantly, especially with MMX hardware. Processors that are completely 64-bit, such as the Intel Itanium, will also be very much helped by this technique.
Now, how do reverse bitboards help generate sliding piece attacks?

First, let's cover what has to be maintained by the program. There is the normal set of bitboards, all generated in the normal direction. And there is a single reverse bitboard, which roughly takes the place of the three rotated bitboards in most programs today. The reverse bitboard contains the same information as the forward bitboard which denotes all occupied squares, but it is in reverse, that is, bit x in the forward bitboard corresponds to bit 63-x in the reverse bitboard.
What benefit comes from having a reversed bitboard in place of rotated bitboards?

Maintenance

A single reversed bitboard is easier to maintain than one rotated bitboard, let alone three. This is because a table lookup is required to convert a bit from the forward format to most rotated formats. With the reverse bitboard, simply subtract the bit number from 63. Considering that only one of these bitboards has to be maintained instead of three, this issue is important.
The Calculations

When taking account of memory latencies, calculating the piece attacks using the forward and reverse bitboards can be done significantly faster due to total independence on lookup tables and complex calculations. There are a few slight snags with the diagonal calculations, but they are minor.

The benefit of having the forward and reverse bitboards really shines in light of a very important bit manipulation trick: x XOR (x - 2) contains bits set to one, starting from the second bit, and stopping at the first bit of x that is one, but including it.

Now, if x is the bitboard of all occupied squares, and it is shifted to the right so that the square for which attacks are being calculated becomes bit zero, then XORing x with (x - 2) gives a mask of all squares attacked by the piece on the desired square, once the result is shifted back to the left. It does not include the piece that is making the attack, but it does include the first occupied square. This is exactly the result the lookup tables return!

The snags come from wraparound - that is, when the mask goes to the end of the file, rank, or diagonal... and keeps going where it shouldn't. This can be easily fixed in rank/file attacks by masking only the rank or file in question, but it is more complicated on diagonals.

Where do the reverse bitboards come in?

The regular bitboard is fine for calculating any attacks that go in the positive direction from the starting square (by masking the rank, file, or diagonal), but the bit-manipulating trick won't work in the opposite direction. The solution is simple: calculate attacks in the opposite direction using a reverse bitboard, so that on that bitboard they extend in the positive direction and can be calculated as above.

The real problem comes in trying to combine the forward and reverse bitboards - there is just no simple way to un-reverse the bitboard. The technique to do the switch is not terribly cumbersome, but it certainly cannot be used in cases where only one set of attacks is being generated... the delay far outweighs the gain.

However, in Hyperbola, this delay can be managed for an overall gain. Though it is unclear whether these types of calculations will even be used in generating legal moves, due to an incremental move generator, the two bitboards can be scanned separately, converting the returns from the bitscans from reverse to forward (by subtracting from 63) instead of converting the entire bitboard.

A more pertinent issue arises with determining whether the king is in check, which is done before generating any legal moves. Possible ways to go about this include converting the reverse bitboard, which is probably not optimal, maintaining reverse bitboards for the sliding pieces of both colors, which is possible, and testing the individual squares on the bitboards both in forward and in reverse, which is nearly out of the question. In the newer chip architectures any unpredictable branches severely degrade performance, and the technique to avoid that is also slow because it can only be used on 32-bit registers.

The most viable option is to maintain four reverse bitboards, one for the bishops/queens and rooks/queens of each color. Updating them does not degrade performance because only one of the four must be updated for any move, or none at all, in addition to the regular updates.

With the extra reverse bitboards, it is now very easy to determine if the king is in check without any memory tables. First, calculate the attacks from the king's square for both types of sliding pieces, and simply AND them with the appropriate bitboards of enemy pieces. After that, OR all the results of the comparisons together, and take a single branch.

A possible shortcut exists here still. If the king is not in check, which is almost always the case, and all of the possible sliding piece lines are closed or empty, which is usually true, then the king cannot be checked by a sliding piece. The merit of this idea is questionable, but a more concrete idea is much more helpful. If the previously used sliding piece attack bitboards are saved and it is determined that none of the squares on those bitboards have changed in the bitboard of occupied pieces, then the bitboards can be used again and the calculations skipped. If this shortcut is used, then it might be more economical to reverse the bitboard(s) after calculating them and ignoring the additional reverse bitboards. That depends on the hit rate.
How can the reverse bitboards be converted back to normal?

The fastest way to do this isn't a huge deal, but it cannot be run in parallel with other operations and does degrade performance if it is run too often. Suppose we have the lower 16 bits of a bitboard, x: FEDCBA9876543210.

1.

Load the constant, k = 5555555555555555h
2.

x = [(x shl 1) and k] or [(x and k) shr 1] EFCDAB8967452301
3.

Load the constant, k = 3333333333333333h
4.

x = [(x shl 2) and k] or [(x and k) shr 2] CDEF89AB45670123
5.

Load the constant, k = 0F0F0F0F0F0F0F0Fh
6.

x = [(x shl 4) and k] or [(x and k) shr 4] 89ABCDEF01234567
7.

Now, due to the availability of a MMX shift instruction that operates on 16-bit words, the bytes can now be switched more easily
8.

x = (x shlw 8) or (x shrw 8) 0123456789ABCDEF
9.

And finally, the pshufw on the P-III reorganizes the words in a single instruction
10.

. x = x pshufw 00011011b

Since all of this arithmetic can be done in the MMX registers, it runs very much in parallel and can complete in a reasonable time frame on the P-III. If this algorithm were to be redesigned for P-II or PMMX architectures (without the pshufw) it would be a little slower since two more steps need to be completed. Trying to do this without 64-bit support would be extremely slow due to the extended-precision operands, despite that most of it could be done in parallel. (This algorithm is also in parallel, so it would probably take over twice as long.)
Hyperbola - Coming 2001

Well, the new use of bitboards to accomodate simple 64-bit operations promises very high-speed calculations for Hyperbola, especially considering the decreased reliance on memory tables. The large tables just for sliding piece calculations under current programs can't even fit into the Level 2 cache of the Coppermine P-III, although most recently-used addresses would remain cached. However, even an access to the L2 cache degrades performance: the Coppermine, with the full-speed ATC, still has approximately a 9-clock latency. Since the 16K L1 cache can only hold 512 cache lines, it is probable that most of the recently used table values will get bumped to the L2 cache, or even to memory, which has a 12-clock latency even with PC800 RDRAM.
Hyperbola is a long way away from first release or even alpha testing, but the technological advancements, such as the concept of reverse bitboards explained in detail above, promise to give this new engine a significant edge. It is possible that, even with a relatively thorough evaluation function, the nps (Nodes Per Second) rate could exceed 10,000,000.


Copyright © 2000 Ryan Mack
This page was last updated on Sunday, February 13, 2000 16:27
tvrzsky
Posts: 128
Joined: Sat Sep 23, 2006 7:10 pm
Location: Prague

Re: hyperbola-quintessence

Post by tvrzsky »

Please Gerd, could You be so kind and give me the definition of bswap(), or am I asking for too much ... ? Is it reversing bits or just only swaping bytes? Yes, I am too lazy to figure it myself ... :oops:
Thanks
Filip
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: hyperbola-quintessence

Post by Gerd Isenberg »

tvrzsky wrote:Please Gerd, could You be so kind and give me the definition of bswap(), or am I asking for too much ... ? Is it reversing bits or just only swaping bytes? Yes, I am too lazy to figure it myself ... :oops:
Thanks
Filip
It just flips the bytes from little to big endian or vice versa.
Since a masked file or diagonal has only one set bit per byte, bswap changes the arithmetical order like a bit reversal. It should be available for all x86-64 architectures as instrinsic of recent compilers even for 32-bit mode (using a pair of 32-bit bswaps).

bswap reg64 is one cycle single direct path on K8 or K10, 1/3 reciprocal throughput, like other simple instructions. Core2duo is tad slower 4 cycles latency, 1 cycle reciprocal throughput. If you will have a closer look to the vc2005 generated assembly you will recognice what ipc-monster that is.

Code: Select all

?_bishopAttacks@@YA_K_KI@Z PROC   
  00000   40 53       push    rbx
  00002   4c 8d 15 00 00
   00 00       lea    r10, OFFSET FLAT&#58;?mask
  00009   8b d2       mov    edx, edx
  0000b   4c 8b c2    mov    r8, rdx
  0000e   48 83 f2 38    xor    rdx, 56
  00012   49 c1 e0 05    shl    r8, 5
  00016   48 c1 e2 05    shl    rdx, 5
  0001a   4b 8b 5c 10 08    mov    rbx, QWORD PTR &#91;r8+r10+8&#93;
  0001f   4f 8b 4c 10 10    mov    r9, QWORD PTR &#91;r8+r10+16&#93;
  00024   4f 8b 04 10    mov    r8, QWORD PTR &#91;r8+r10&#93;
  00028   4a 8b 14 12    mov    rdx, QWORD PTR &#91;rdx+r10&#93;
  0002c   4c 8b db    mov    r11, rbx
  0002f   49 8b c1    mov    rax, r9
  00032   48 23 c1    and    rax, rcx
  00035   4c 23 d9    and    r11, rcx
  00038   4c 8b d0    mov    r10, rax
  0003b   49 2b c0    sub    rax, r8
  0003e   49 8b cb    mov    rcx, r11
  00041   48 0f c9    bswap    rcx
  00044   49 0f ca    bswap    r10
  00047   4d 2b d8    sub    r11, r8
  0004a   48 2b ca    sub    rcx, rdx
  0004d   4c 2b d2    sub    r10, rdx
  00050   48 0f c9    bswap    rcx
  00053   49 0f ca    bswap    r10
  00056   49 33 c2    xor    rax, r10
  00059   49 33 cb    xor    rcx, r11
  0005c   49 23 c1    and    rax, r9
  0005f   48 23 cb    and    rcx, rbx
  00062   48 03 c1    add    rax, rcx
  00065   5b       pop    rbx
  00066   c3       ret    0
?_bishopAttacks@@YA_K_KI@Z ENDP 
tvrzsky
Posts: 128
Joined: Sat Sep 23, 2006 7:10 pm
Location: Prague

Re: hyperbola-quintessence

Post by tvrzsky »

Wow, you are really fast, thanks for reply ...
Now I see how it works, really simple and clever. It seems that reversed bitboards are not so much reversed actually :D
Filip
tvrzsky
Posts: 128
Joined: Sat Sep 23, 2006 7:10 pm
Location: Prague

Re: hyperbola-quintessence

Post by tvrzsky »

P.S. I remember reading the Hyperbola website a while back, maybee about some seven(?) years ago. I was starting with C and chess programming then and toying with some basic bit fiddling and assembler. I was taken with the idea but reversing bitboards or updating them during the make/unmake_move appeared to be too expensive to me and I got out of MMX registers in my make_move asm routine, so I abandoned the thing. :P
I do it my way now and I am quite satisfied with the performance but also curious how it compares to all those magic and other complex approaches summarized in Haralds test. However amount of the code in his excelent test is just overwhelming and I was discouraged by the idea of adapting those into my program. But this one looks so easy, I will surely try it!
P.P.S. Am I the only person using this mapping of bitboards: A1 = 0, A8 = (1<<7), H8 = (1<<63)? I could never get why should someone start at A8 or H1 or H8 ... :lol:
Filip
Aleks Peshkov
Posts: 892
Joined: Sun Nov 19, 2006 9:16 pm
Location: Russia

Re: hyperbola-quintessence

Post by Aleks Peshkov »

To make more clear: suggested "hyperbola-quintessence" approach is direct computation of only 6 from 8 sliding rays. It needs 3 arrays of 64 bitboards with masks for vertical, and both diagonals attacks from any square (excluding starting attacker's square).

Missing horizontal-attacks need a help for non-rotated table lookup part of rotated bitboards or any other method.
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: hyperbola-quintessence

Post by Gerd Isenberg »

tvrzsky wrote: P.P.S. Am I the only person using this mapping of bitboards: A1 = 0, A8 = (1<<7), H8 = (1<<63)? I could never get why should someone start at A8 or H1 or H8 ... :lol:
Filip
Every arbitrary mapping is fine and likely a matter of taste and habit. Your scheme with enumerating along a file has some advantages for pawn-attacks. My natural way to enumerate squares on the chess-board was row after row from a to h. I use a1=0,b1=1,..,a2=8,..,a8=56,..,h8=63. The disadvantage is that you become a mirror in your brain and start to confuse left and right or east and west ;-)
wgarvin
Posts: 838
Joined: Thu Jul 05, 2007 5:03 pm
Location: British Columbia, Canada

Re: hyperbola-quintessence

Post by wgarvin »

Gerd, how does it compare with this one which you posted a while ago in the resource for bit twiddlers thread? (the new one looks like it has fewer bswaps, for one thing)

Code: Select all

u64 bishopAttacks&#40;u64 occ, u32 sq&#41; &#123; 
    u64 rvsdia, rvsant, bishop; 
    u64 occdia = occ, occant = occ; 
    occdia &=  smsk&#91;sq&#93;.diagonalMaskEx; 
    occant &=  smsk&#91;sq&#93;.antidiagMaskEx; 
    bishop  =  smsk&#91;sq&#93;.bitMask; 
    rvsdia  = _byteswap_uint64&#40;occdia&#41;; 
    rvsant  = _byteswap_uint64&#40;occant&#41;; 
    occdia ^=  occdia - bishop; 
    occant ^=  occant - bishop; 
    bishop  = _byteswap_uint64&#40;bishop&#41;; 
    rvsdia ^=  rvsdia - bishop; 
    rvsant ^=  rvsant - bishop; 
    occdia ^= _byteswap_uint64&#40;rvsdia&#41;; 
    occant ^= _byteswap_uint64&#40;rvsant&#41;; 
    occdia &=  smsk&#91;sq&#93;.diagonalMaskEx; 
    occant &=  smsk&#91;sq&#93;.antidiagMaskEx; 
    return occdia ^ occant; 
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: hyperbola-quintessence

Post by Gerd Isenberg »

wgarvin wrote:Gerd, how does it compare with this one which you posted a while ago in the resource for bit twiddlers thread? (the new one looks like it has fewer bswaps, for one thing)

Code: Select all

u64 bishopAttacks&#40;u64 occ, u32 sq&#41; &#123; 
    u64 rvsdia, rvsant, bishop; 
    u64 occdia = occ, occant = occ; 
    occdia &=  smsk&#91;sq&#93;.diagonalMaskEx; 
    occant &=  smsk&#91;sq&#93;.antidiagMaskEx; 
    bishop  =  smsk&#91;sq&#93;.bitMask; 
    rvsdia  = _byteswap_uint64&#40;occdia&#41;; 
    rvsant  = _byteswap_uint64&#40;occant&#41;; 
    occdia ^=  occdia - bishop; 
    occant ^=  occant - bishop; 
    bishop  = _byteswap_uint64&#40;bishop&#41;; 
    rvsdia ^=  rvsdia - bishop; 
    rvsant ^=  rvsant - bishop; 
    occdia ^= _byteswap_uint64&#40;rvsdia&#41;; 
    occant ^= _byteswap_uint64&#40;rvsant&#41;; 
    occdia &=  smsk&#91;sq&#93;.diagonalMaskEx; 
    occant &=  smsk&#91;sq&#93;.antidiagMaskEx; 
    return occdia ^ occant; 
Yes, but the main issue are the mentioned xors.

Code: Select all

rayattacks &#58;=  o^&#40;o-2r&#41; ^ reverse&#40;&#40;o'-2r')^o')
is equivalent to

Code: Select all

rayattacks &#58;=    &#40;o-2r&#41; ^ reverse &#40;o'-2r') 
Four less xors, and no need to safe o, o', which safes some registers and allows more parallel computation.

Code: Select all

u64 bishopAttacks&#40;u64 occ, u32 sq&#41; &#123;
    u64 rvsdia, rvsant, bishop;
    u64 occdia = occ, occant = occ;
    occdia &=  smsk&#91;sq&#93;.diagonalMaskEx;
    occant &=  smsk&#91;sq&#93;.antidiagMaskEx;
    bishop  =  smsk&#91;sq&#93;.bitMask;
    rvsdia  = _byteswap_uint64&#40;occdia&#41;;
    rvsant  = _byteswap_uint64&#40;occant&#41;;
    occdia -= bishop;
    occant -= bishop;
    bishop  = _byteswap_uint64&#40;bishop&#41;;
    rvsdia -= bishop;
    rvsant -= bishop;
    occdia ^= _byteswap_uint64&#40;rvsdia&#41;;
    occant ^= _byteswap_uint64&#40;rvsant&#41;;
    occdia &=  smsk&#91;sq&#93;.diagonalMaskEx;
    occant &=  smsk&#91;sq&#93;.antidiagMaskEx;
    return occdia ^ occant; // ^,+,|
&#125;
The second minor point is related to the bishop reversal.
The single bishop-bit may be computed by 1<<sq or by lookup smsk[sq].bitMask. Getting the single bit from the same base-address and cacheline as the ray-masks is superior here - rather than loading a one plus shift left by square which requires register cl.

For the bishop-flip we need either a lookup of the flipped square, smsk[sq^56].bitMask or another bswap. Yes, using bswap seems superior for K8, since it safes the xor 56 plus the second cacheline read. The first time something was faster than magic-bishop in the snoob-test!

vc2005 does a rather good job by inlining the separate diagonal-attack getters here inside bishopAttacks. Register usage far from conserative for max parallel gain.

Code: Select all

u64 diagonalAttacks&#40;u64 occ, u32 sq&#41; &#123;
   u64 forward, reverse;
   forward = occ & smsk&#91;sq&#93;.diagonalMaskEx;
   reverse  = _byteswap_uint64&#40;forward&#41;;
   forward -= smsk&#91;sq&#93;.bitMask;
   reverse -= _byteswap_uint64&#40;smsk&#91;sq&#93;.bitMask&#41;;
   forward ^= _byteswap_uint64&#40;reverse&#41;;
   forward &= smsk&#91;sq&#93;.diagonalMaskEx;
   return forward;
&#125;

u64 antiDiagAttacks&#40;u64 occ, u32 sq&#41; &#123;
   u64 forward, reverse;
   forward  = occ & smsk&#91;sq&#93;.antidiagMaskEx;
   reverse  = _byteswap_uint64&#40;forward&#41;;
   forward -= smsk&#91;sq&#93;.bitMask;
   reverse -= _byteswap_uint64&#40;smsk&#91;sq&#93;.bitMask&#41;;
   forward ^= _byteswap_uint64&#40;reverse&#41;;
   forward &= smsk&#91;sq&#93;.antidiagMaskEx;
   return forward;
&#125;

u64 bishopAttacks&#40;u64 occ, u32 sq&#41; &#123;
   return diagonalAttacks &#40;occ, sq&#41;
        + antiDiagAttacks &#40;occ, sq&#41;;
&#125;

Code: Select all

occ$ = 16
sq$ = 24
?bishopAttacks@@YA_K_KI@Z PROC
  00000	40 53		 push	 rbx
  00002	8b c2		 mov	 eax, edx
  00004	4c 8d 15 00 00
	00 00		 lea	 r10, OFFSET FLAT&#58;?smsk
  0000b	48 c1 e0 05	 shl	 rax, 5
  0000f	4a 8b 5c 10 08	 mov	 rbx, QWORD PTR &#91;rax+r10+8&#93;
  00014	4e 8b 4c 10 10	 mov	 r9, QWORD PTR &#91;rax+r10+16&#93;
  00019	4e 8b 14 10	 mov	 r10, QWORD PTR &#91;rax+r10&#93;
  0001d	4c 8b db	 mov	 r11, rbx
  00020	49 8b d1	 mov	 rdx, r9
  00023	4d 8b c2	 mov	 r8, r10
  00026	48 23 d1	 and	 rdx, rcx
  00029	4c 23 d9	 and	 r11, rcx
  0002c	49 0f c8	 bswap	 r8
  0002f	48 8b c2	 mov	 rax, rdx
  00032	49 8b cb	 mov	 rcx, r11
  00035	49 2b d2	 sub	 rdx, r10
  00038	48 0f c8	 bswap	 rax
  0003b	48 0f c9	 bswap	 rcx
  0003e	4d 2b da	 sub	 r11, r10
  00041	49 2b c0	 sub	 rax, r8
  00044	49 2b c8	 sub	 rcx, r8
  00047	48 0f c8	 bswap	 rax
  0004a	48 0f c9	 bswap	 rcx
  0004d	48 33 c2	 xor	 rax, rdx
  00050	49 33 cb	 xor	 rcx, r11
  00053	49 23 c1	 and	 rax, r9
  00056	48 23 cb	 and	 rcx, rbx
  00059	48 03 c1	 add	 rax, rcx
  0005c	5b		 pop	 rbx
  0005d	c3		 ret	 0
?bishopAttacks@@YA_K_KI@Z ENDP
Gerd
Dann Corbit
Posts: 12538
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: hyperbola-quintessence

Post by Dann Corbit »

tvrzsky wrote:P.S. I remember reading the Hyperbola website a while back, maybee about some seven(?) years ago. I was starting with C and chess programming then and toying with some basic bit fiddling and assembler. I was taken with the idea but reversing bitboards or updating them during the make/unmake_move appeared to be too expensive to me and I got out of MMX registers in my make_move asm routine, so I abandoned the thing. :P
I do it my way now and I am quite satisfied with the performance but also curious how it compares to all those magic and other complex approaches summarized in Haralds test. However amount of the code in his excelent test is just overwhelming and I was discouraged by the idea of adapting those into my program. But this one looks so easy, I will surely try it!
P.P.S. Am I the only person using this mapping of bitboards: A1 = 0, A8 = (1<<7), H8 = (1<<63)? I could never get why should someone start at A8 or H1 or H8 ... :lol:
Filip
The reason a lot of people use A8-0 is that it maps transparently to FEN and EPD (where the first square is A8).
A1=0 is also quite natural, if you imagine playing as white.