DIRECT BITBOARD MOVEGENERATION

rjgibert · Post by **rjgibert** » Sun Jul 26, 2009 8:34 pm

xsadar wrote:
Gerd Isenberg wrote:
Desperado wrote:hi.

well, if i am thinking for a name, what do you think about this?

"Significant Bit Solution (SBS)".
Sounds a bit frumpy, imho. My proposals:

Hoffmann's Obstructed Difference Action
Desperado's Obstructed Difference
Obstructed Difference
I would vote "Obstruction Difference" (instead of obstructed) or "Hoffman's Obstruction Difference" (omit "action", but we can leave the smiley in the name if you want ).
"Obstructed difference" sounds like it's obstructing the difference somehow, but "obstruction difference" sounds like what it's doing: taking the difference of the obstructions.

By the way, thanks to both of you for working this out. Oddly, just before I read this the other day, I was thinking there had to be a way to generate moves similar to this, but couldn't quite work it out in my head.

How about "Hoffman's Obstruction Transformation?" Then the acronym would be "HOT" like his idea

Or given that Isenberg's contribution has been significant, how about "Hoffman-Isenberg Obstruction Delta?"

Gerd Isenberg · Post by **Gerd Isenberg** » Sun Jul 26, 2009 9:35 pm

rjgibert wrote:
xsadar wrote:
Gerd Isenberg wrote:
Desperado wrote:hi.

well, if i am thinking for a name, what do you think about this?

"Significant Bit Solution (SBS)".
Sounds a bit frumpy, imho. My proposals:

Hoffmann's Obstructed Difference Action
Desperado's Obstructed Difference
Obstructed Difference
I would vote "Obstruction Difference" (instead of obstructed) or "Hoffman's Obstruction Difference" (omit "action", but we can leave the smiley in the name if you want ).
"Obstructed difference" sounds like it's obstructing the difference somehow, but "obstruction difference" sounds like what it's doing: taking the difference of the obstructions.

By the way, thanks to both of you for working this out. Oddly, just before I read this the other day, I was thinking there had to be a way to generate moves similar to this, but couldn't quite work it out in my head.
How about "Hoffman's Obstruction Transformation?" Then the acronym would be "HOT" like his idea

Or given that Isenberg's contribution has been significant, how about "Hoffman-Isenberg Obstruction Delta?"

Hehe, I like Mike's "Obstruction Difference" and "HOT" as well.

Of course it is about Michael to choose a name if he likes, since it was alone his idea, while I helped with the implementation. It needs a fast 64-bit bsr and is not that 32-bit friendly as other approaches, I guess in most cases it will not do better than magic bitboards in real positions inside a search with a appropriate L1 hit rate on the magic database. It is nice to have one more interesting, resource friendly and general approach (same code for all lines, or even bishops and rooks) to determine attacks of sliding pieces.

I also wonder whether the idea is really new, but was not applicable on old, or none 64-bit hardware in former times, for instance by the Soviet and American bitboard pioneers in the 60ties and 70ties.

Desperado · Post by **Desperado** » Mon Jul 27, 2009 9:53 am

Hi everyone again.

Yesterday i ve got an hour to do the first comparison with magics.
Because my test conditions aren t optimal this is just a first impression.

Well the magics still faster about a factor of 1.7.

But by improving the seperation of the ms1b this may shrink and
of course just poor move generation is not comparable to the use
in an engine where the memory issues may become more significant.
(Therefore I think in practice the performance will be more closely to
the magic approach).

An optimal case seems to be, if the ms1b could be isolated as cheap like the ls1b, so i just ignored the correct return value and measured time (so to say just for fun). In this optimal case it would be slighty faster than magics, even by poor move-generation-frame.

So for now i just like to note a further micro-optinization.
If there is only a need for the attacks (instead of controls), the last line
of code is self explaining unnecessary code. We can just return the significant bits.

So at last: Obstruction Difference is nice name, and because Gerd likes it too we may name it just like this. (Hoffmann's Obstruction Difference would make me proud too, but i like it more neutral, thx everyone )

mcostalba · Post by **mcostalba** » Mon Jul 27, 2009 11:51 am

Desperado wrote: An optimal case seems to be, if the ms1b could be isolated as cheap like the ls1b

Yes this is the key. If someone comes up with a way to remove the bsr64 to find the msb of the negative ray would be great.

I have tought a bit on a subtraction between the lsb of the positive ray and all the remaining bits below the lsb, where remaining = all & (lsb -1)

I think once you get the lsb, remove the rook square and consider only the bits below the lsb could be possible to set to 1 all the bits between the lsb and the next lower one, and this should make the trick.

Desperado · Post by **Desperado** » Mon Jul 27, 2009 3:32 pm

hi marco,

well i was thinking the last weeks in the same direction as you are doing now.
One thing is to find the ms1b in a "general bitboard", maybe another thing on masked lines (where there are existing sth. like "special symetries",or other important attributes).

This thought was an initiator for the "Obstruction Difference"

method.

Unfortunately i wasn t able to do better then using some kind of bitscan reverse technique.(and of course this part is the one, which needs improvement, and which will lead to different performance results).

But i trust in my feelings that there is a clever way to find ms1b in masked lines. i will especially now examine this and hope to find an improvement. (but maybe there isn t a solution that does better than reverse bitscan)

Desperado · Post by **Desperado** » Mon Jul 27, 2009 4:19 pm

oh i forgot to say. Maybe here a relative cheap lookup for negative rays
is possible.(i will try).

ex: UI_08 ww[256];
UI_08 ss[256];
UI_08 sw[256];
UI_08 se[256];

where the index is given by "lo" and the content is the msb-index.
The index needs of course a line-conversion.(dont know how costly this
is)

Well just an unproven (until now ignored) idea...

Desperado · Post by **Desperado** » Mon Jul 27, 2009 5:04 pm

i just want to correct my last post.

what i meant was:

UI_08 id_msb[256]; //only one array is necessery

index : line_to_byte(lo,sq64)
content: msb-index

which should lead to:

lo = BB(msk.id_msb[rnk_to_byte(lo,sq64)])

instead of

//lo = (BTB)-1 << bsr64(lo|1);

that looks better...and may work faster then bsr

(just tried out with rooks-rank computation, if there wasnt an
error the time was decreased from original 18.xx s to 15.xx s)
i ll do the same for the file...(the magic approach need 10.xx s)

enough for me to work it out!

Gerd Isenberg · Post by **Gerd Isenberg** » Mon Jul 27, 2009 8:35 pm

mcostalba wrote:
Desperado wrote: An optimal case seems to be, if the ms1b could be isolated as cheap like the ls1b
Yes this is the key. If someone comes up with a way to remove the bsr64 to find the msb of the negative ray would be great.

The left/right iniquity of arithmetic, which overflows in one direction only ;-(

MS1B isolation is too expensive for the general case.

Code: Select all

x |= x >> 32;
x |= x >> 16;
x |= x >> 8;
x |= x >> 4;
x |= x >> 2;
x |= x >> 1;
MS1B = &#40;x >> 1&#41; + 1;

I don't think any line specialized MS1B lookup of up to seven scattered bits (of file and (anti)diagonals) is competitive. Then one may even better use Kindergarten like lookups directly, that is mul, shift right and lookup.

I fear there is no faster way than 1 << bsr (or -1 << bsr), which is fine for PIII, Core and Nehalem (CPUID family/stepping from 06_1AH, 3 cycles latency and throughput 1), but sucks for AMD K8 (which otoh has fast bswap with 1/3 througput for Hyperbola Q) and of course P4 and ATOM. AMD K10 has 4 cycle vector path bsr which is better than K8 (11 cycles vector path), but still quite slow because AMD vector path blocks other CPU-resources.

For bsr latencies and throughput, see
AMD:
http://www.amd.com/us-en/assets/content ... /40546.pdf
Intel:
http://www.intel.com/Assets/PDF/manual/248966.pdf

Intel Core Solo and Intel Core Duo processors are
represented by 06_0EH. Processorsbases on 65 nm Intel Core microarchitecture are represented by 06_0FH.Processors based on Enhanced Intel Core microarchitecture are represented by06_17H and 06_1DH. CPUID family/stepping signatures of processors based on Intelmicroarchitecture (Nehalem) starts with 06_1AH.

Instruction Latency/ThroughputTable C-13a.
General Purpose InstructionsBSF/BSR Latency 3 (06_1A) 1 (06_17H,06_1DH) 2 (06_0F) 2 (06_0EH)
Throughput 1 1 1 1

We are talking about 9 operations for one line, with two (or four per bishop/rook) independent chains.

Intended assembly for bishop/rook, I would guess ~16..17 cycles on a core2 or i7, but only those cpus in 64-bit mode:

Code: Select all

; input
; rcx - occ
; rdx - pointer to mask structure&#91;sq&#93;&#91;piece_kind&#93; &#40;not changed&#41;
; output&#58; rax - line attacks
mov rax, rcx                1
mov r09, rcx                1
mov r10, rcx                1
and rcx, &#91;rdx + LOW1&#93;        111
and r09, &#91;rdx + LOW2&#93;        111
or  rcx, 1                      1
or  r09, 1                      1
and rax, &#91;rdx + HIGH1&#93;        111 
and r10, &#91;rdx + HIGH2&#93;        111
mov r8, -1                     1
mov r11, -1                    1
bsr rcx, rcx                     11
shl r8, cl                         1 
bsr rcx, r09                     11 
shl r11, cl                        1
mov rcx, rax                       1
mov r09, r10                       1
neg rax                             1
neg r09                             1
and rax, rcx                         1
and r09, r10                         1
lea rax, &#91;2*rax + r8&#93;                 11
lea rcx, &#91;2*r09 + r11&#93;                11
and rax, &#91;rdx + MASKEX1&#93;                111
and rcx, &#91;rdx + MASKEX2&#93;                111
or  rax, rcx                               1

Gerd Isenberg · Post by **Gerd Isenberg** » Mon Jul 27, 2009 9:46 pm

Salvage for AMD K10: Leading Zero Count == bsr(noneEmpty) ^ 63

LZCNT reg, reg DirectPath Single Latency 2 Note 6: This operation is restricted to scheduling in pipe 2.

see http://msdn.microsoft.com/en-us/library/bb384809.aspx

We can save the xor 63 by shifting 0x8000000000000000 arithmetically right (ones from left) by leading zeros, "or" one still required.

Desperado · Post by **Desperado** » Wed Jul 29, 2009 8:16 pm

hi everyone again.

today i tested some ideas. Let me start up with what i have done
to compute the msb. i order from slow to fast.

1. fasttest() -> performs -97%
testing bits, starting from sq to msb, obstructed by end of line
(in simple loop)

2. divconq() -> performs -78%
check always upper half of line, splits until msb is found

3. BitScanReverse64 intrinsic -> performs -10%

4. bsr (double conversion, from Gerd). my original one +-0

5. msb-lookup +30%

Code: Select all

//******************************************************************************

	 //...
	 for&#40;BTB k=1;k<256;k++)
		&#123;
		 msk.msb_rank&#91;k&#93; = bsr64&#40;k&#41;; //UI_08 msb_rank&#91;256&#93;; maybe 128 is enough
		 msk.msb_file&#91;k&#93; = bsf64&#40;k&#41;; // ""
		&#125;
	 //...

//******************************************************************************

BTB msb_rank&#40;BTB bb,SQR_ID sq64&#41;
	&#123;
	 UI_08 shift = &#40;sq64>>3&#41; << 3;

	 return&#40;&#40;BTB&#41; 1 << &#40;msk.msb_rank&#91;bb>>shift&#93; + shift&#41;);
	&#125;

//******************************************************************************

BTB msb_file&#40;BTB bb,SQR_ID sq64&#41;
	&#123;
	 //obstruction to rank 1 has to be set...
	 BTB x;

	 x   = bb >> &#40;sq64 & 7&#41;;
	 x  *= diag;
	 x >>= 56;
	 x   = bR8 >> &#40;msk.msb_file&#91;x&#93;*8&#41;;

	 return&#40;x & bb&#41;;
	&#125;

//******************************************************************************

So what conclusion has to be done here, with respect to Kindergarten BB ?

The 2 msb lookups are faster than computing by bitscan-reverse !?
should someone now use KG-BB ? Where are the differences ?
Because i am not (of course not) expert for KG-BB. What s about the memory usage ?
Or which kind of bsr should be used ? (Gerd's double conversion is the fastest i used until now. it s faster than the intrinsic ? or am i doing sth totally wrong ?)

note:

- in the case how the msb is used for the obstruction difference, it
is enough to isolate the corresponding file _OR_ rank, to build the ms1b.
That may lead to some other ideas.

- maybe a fast magic bsr-function without lookup can be created, due
to the simpleness of a masked line ?? any idea for that ? (would be enough to have it for one line,all other lines maybe shifted...)

well, now with the kind of msb-lookup the performance comes closer to
the magic approach, some more 20 % speedup and it will beat it...

DIRECT BITBOARD MOVEGENERATION

Re: DIRECT BITBOARD MOVEGENERATION

Re: DIRECT BITBOARD MOVEGENERATION

Re: DIRECT BITBOARD MOVEGENERATION

Re: DIRECT BITBOARD MOVEGENERATION

Re: DIRECT BITBOARD MOVEGENERATION

Re: DIRECT BITBOARD MOVEGENERATION

Re: DIRECT BITBOARD MOVEGENERATION

Re: DIRECT BITBOARD MOVEGENERATION

Re: DIRECT BITBOARD MOVEGENERATION

Re: DIRECT BITBOARD MOVEGENERATION