bit boards - what is betwen

gladius · Post by **gladius** » Mon Dec 03, 2012 11:08 pm

Gerd Isenberg wrote:
gladius wrote:
Gerd Isenberg wrote:Indexing an array of 240 bitboards by 0x88 square difference and rotate might we worth a try.
Thanks for the idea Gerd! I ended up trying this in Stockfish, and got a small speed win. I ended up using an array of size 120, as I couldn't see how to avoid a conditional on the shift even when using size 240.

Github commit here: https://github.com/glinscott/Stockfish/ ... b79b7606c4.
Its a matter of correct initialization, to make the rotate left approach with 240 element lookup work, f.i.:
Code: Select all
       . . . . . . . .    . . . . . . . .
       . . . . . . . .    . . . . . . . 1
       . . . . . . . .    . . . . . . 1 .
       . . . . . . . .    . . . . . 1 . .
       . . . 1 . . . .    . . . . . . . .
       . . 1 . . . . .    . . . . . . . .
       . 1 . . . . . .    . . . . . . . .
       . . . . . . . .    . . . . . . . .

between   &#40;0, 36&#41; 		  &#40;36, 0&#41; 	
lookup -> &#123; 9, 18, 27&#125;    &#123;37, 46, 55&#125;
rotl   -> &#123; 9, 18, 27&#125;    &#123; 9, 18, 27&#125;  

between   &#40;10, 46&#41;        &#40;46, 10&#41; 
lookup -> &#123; 9, 18, 27&#125;    &#123;37, 46, 55&#125;
rotl   -> &#123;19, 28, 37&#125;    &#123;19, 28, 37&#125;

between   &#40;27, 63&#41;        &#40;63, 27&#41;
lookup -> &#123; 9, 18, 27&#125;    &#123;37, 46, 55&#125;
rotl   -> &#123;36, 45, 54&#125;    &#123;36, 45, 54&#125;
Otherwise, with 120, to avoid that branch, better use std:abs of the +/- difference.

The rotate left approach seemed to be a bit more expensive when I tried that.

I'm not quite sure what you mean to use the abs(). Would I not still need a conditional to pick the amount to shift by? ie. << std::min(s1, s2)

Gerd Isenberg · Post by **Gerd Isenberg** » Mon Dec 03, 2012 11:17 pm

gladius wrote:I'm not quite sure what you mean to use the abs(). Would I not still need a conditional to pick the amount to shift by? ie. << std::min(s1, s2)

Oups, I was wrong, you need to swap the values for the shift amount anyway. I had this in mind, which is of course wrong.

Code: Select all

inline Bitboard between_bb&#40;Square s1, Square s2&#41; &#123;
   return BetweenBB_88&#91;std&#58;&#58;abs&#40;x88_diff&#40;s1, s2&#41;)&#93; << s2;
&#125;

abs, for the absolute value of a difference, should be branchless {cdq, xor, sub}.

gladius · Post by **gladius** » Mon Dec 03, 2012 11:34 pm

Gerd Isenberg wrote:abs, for the absolute value of a difference, should be branchless {cdq, xor, sub}.

Ah, very cool! I tried this for a branchless version, but it was a bit slower sadly.

Code: Select all

inline Bitboard between_bb&#40;Square s1, Square s2&#41; &#123;
  int d = s1 - s2;
  int min = s2 + &#40;d & &#40;d >> 31&#41;);
  int diff = d + &#40;s1 | 7&#41; - &#40;s2 | 7&#41;;
  int s = diff >> 31;
  diff ^= s;
  diff -= s;
  return BetweenBB_88&#91;diff&#93; << min;
&#125;

diep · Post by **diep** » Mon Dec 03, 2012 11:54 pm

Gerd Isenberg wrote:
diep wrote:
Don wrote:
hgm wrote:What do you mean by 'without memory access', then? Surely indexing a table counts as a memory access...
I meant without memory access but I would settle for anything that produced a speedup on modern processors.

The mailbox programs would index a much smaller array by using (destination - origin) as the index. To work with a 6 bit square a little bit twiddling would do the job. So if having a small array is a win, it might still be a win despite some additional bit twiddling to convert a 6 bit square to a 7 bit square but it's hard to believe that would make any difference.
Why would a 64x64 array be slower than executing more code with smaller array access?

The real problem of modern processors is mainly decode speed of instructions.
Really? I thought still branches and L1, L2 misses - at least in chess programs.

Sometimes smaller memory pays off, 32K vs. < 2K, likely more often in L1. A few additional instructions may improve ipc and hide latencies. How much is a L1 miss on todays processors?

From what i understand from the hardware engineers posting, the problem is the decoding nowadays at the fastest CPU's.

This has a lot of truths if you think about it.

making caches faster or predicting branches even better - it just isn't going to speed you up significantly, as they are already at the limits there if you ask me.

More threads will however speed you up easily.

Those can only run on those cores if they manage to decode more instructions.

For diep it's total trivial the problem is the instruction stream.

If they would manage to decode more instructions you can run 2x more threads as execution units the processors have plenty nowadays.

The question is how motivated intel is to progress fast with so little competition.

In itself i understood from Joel Hruska that AMD's new design would split the 1x4 decoding to 2x2 decoding.

That should speed it up in IPC, yet the real limitaion is that it's still decoding in total 4 instructions a clock, whereas Diep's IPC already is around 1.7 there.

So decoding 2x3 instructions instead of 1x4 right now, it would be possible for a small increase in number of transistors, to run 4 threads rather than 2 right now at each module.

Same logics applies to intel as well, though they would need more of a redesign to benefit from this.

We see how at the latest intel HPC cpu, larrabee, they run 4 threads now at each core as well. Now of course that's a vector design, so not comparable.

Yet if they would decode more instructions more threads make sense simply.

Let's look objectively at i7 the ipc already is nearby 1.8 for Diep. If i fix things like 'integer' to 'unsigned int' and remove branches by doing a single copmare in an 'if' instead of several, i'm sure i would be able to get it to up to 2.2 even if i really do effort there.

This where the cpu really has problems delivering 4 instructions a clock.

The theoretic max would be somewhere around 60% or so for codes like this, so that's IPC 2.4.

A few very seldom codes where the designers tuned for achieve over 70%, yet that's throughput codes - let's not compare with those.

Of course for bitboards what you can achieve, without vectorisation that is, is probably lower than what i can achieve with Diep; as of course there is only 1 execution units that can do specific instructions and only 2 execution units that can shift.

That also will influence the average IPC getting it probably not much over 1.5 with 2 threads at 1 core.

See how close Diep virtual already is there, despite its huge amount of branches if we include the possible improvements there, which i actually might go for one day (last systematic speedup of diep's coding was 1998 or so, so another sprint there makes sense).

So the statement of the hardware engineers there really makes sense. Also it wasn't contradicted by any expert. Who is silent agrees in such forums. Meaning that they would fix rest of the cpu if it would be able to decode more!

Don · Post by **Don** » Tue Dec 04, 2012 5:41 pm

The approach works, but it's not faster for me. It's a very close call but the 64x64 array lookup is still slightly faster.

Don

Gerd Isenberg wrote:
gladius wrote:
Gerd Isenberg wrote:Indexing an array of 240 bitboards by 0x88 square difference and rotate might we worth a try.
Thanks for the idea Gerd! I ended up trying this in Stockfish, and got a small speed win. I ended up using an array of size 120, as I couldn't see how to avoid a conditional on the shift even when using size 240.

Github commit here: https://github.com/glinscott/Stockfish/ ... b79b7606c4.
Its a matter of correct initialization, to make the rotate left approach with 240 element lookup work, f.i.:
Code: Select all
       . . . . . . . .    . . . . . . . .
       . . . . . . . .    . . . . . . . 1
       . . . . . . . .    . . . . . . 1 .
       . . . . . . . .    . . . . . 1 . .
       . . . 1 . . . .    . . . . . . . .
       . . 1 . . . . .    . . . . . . . .
       . 1 . . . . . .    . . . . . . . .
       . . . . . . . .    . . . . . . . .

between   &#40;0, 36&#41; 		  &#40;36, 0&#41; 	
lookup -> &#123; 9, 18, 27&#125;    &#123;37, 46, 55&#125;
rotl   -> &#123; 9, 18, 27&#125;    &#123; 9, 18, 27&#125;  

between   &#40;10, 46&#41;        &#40;46, 10&#41; 
lookup -> &#123; 9, 18, 27&#125;    &#123;37, 46, 55&#125;
rotl   -> &#123;19, 28, 37&#125;    &#123;19, 28, 37&#125;

between   &#40;27, 63&#41;        &#40;63, 27&#41;
lookup -> &#123; 9, 18, 27&#125;    &#123;37, 46, 55&#125;
rotl   -> &#123;36, 45, 54&#125;    &#123;36, 45, 54&#125;
Otherwise, with 120, to avoid that branch, better use std:abs of the +/- difference.

gladius · Post by **gladius** » Tue Dec 04, 2012 6:00 pm

wgarvin wrote:Why not AND together two masks? Even if they both come from lookup tables, those will be really small (e.g. 512 bytes each) and you ought to be able to keep them in L1 all the time.

With a bit of care, you can probably get routine(s) that are inclusive or exclusive at each end, or produce the opposite mask, etc. all with the same cost of two loads (or two shifts) and one logic-op.

Interesting, I'm not sure I understand though. What two masks would be anded together?

gladius · Post by **gladius** » Tue Dec 04, 2012 6:02 pm

Don wrote:The approach works, but it's not faster for me. It's a very close call but the 64x64 array lookup is still slightly faster.

Don

I tried a few variations, and only the final version in the commit was any faster. Interestingly, the branchless versions, and 240 size array with rotate left were about the same speed as the [64][64] array. I double checked and GCC was correctly optimizing the (bb << s) | (bb >> (64 - s)) correctly into a rol as well. So, I was surprised it still ended up being a bit slower than the swap method.

The vagaries of performance optimization on modern CPUs I guess

.

wgarvin · Post by **wgarvin** » Tue Dec 04, 2012 6:47 pm

gladius wrote:
wgarvin wrote:Why not AND together two masks? Even if they both come from lookup tables, those will be really small (e.g. 512 bytes each) and you ought to be able to keep them in L1 all the time.

With a bit of care, you can probably get routine(s) that are inclusive or exclusive at each end, or produce the opposite mask, etc. all with the same cost of two loads (or two shifts) and one logic-op.
Interesting, I'm not sure I understand though. What two masks would be anded together?

Ahhh, I misunderstood the original question. I was thinking "bitboard with all bits between N and M set" which is obviously not too hard.

You actually want that bitboard ANDed with the appropriate ray mask, and the simple 64x64 table lookup does sound pretty hard to beat.

64x64x8 is only a 32KB table. To beat that, you'd need something with (1) a much smaller table, (2) one or two table lookups and only a very small amount of math before/after, and (3) no dependent lookup.

bit boards - what is betwen

Re: bit boards - what is betwen

Re: bit boards - what is betwen

Re: bit boards - what is betwen

Re: bit boards - what is betwen

Re: bit boards - what is betwen

Re: bit boards - what is betwen

Re: bit boards - what is betwen

Re: bit boards - what is betwen