bitsets - extraction

Desperado · Post by **Desperado** » Tue Oct 13, 2009 8:44 pm

Hi everyone,

because i try to design my engine-concept with incremental attacksets
i have to make a decision of the following issue.

Code: Select all


//------------------------------------------------------------------------------
// access on bitset by list-type
// requires one extraction per stage &#40;by bitscan&#41;,
// ex&#58; attack-sets may be handled like this...

//struct LIST_T 
//	&#123;
//	 UI_08 len;
//	 UI_08 id&#91;64&#93;;
//	&#125;;
//------------------------------------------------------------------------------

void loopExtracted&#40;UI_16 id&#41;

	&#123;for&#40;UI_08 i=0;i<list&#91;id&#93;.len;i++) nmb^=list&#91;id&#93;.id&#91;i&#93;;return;&#125;

//------------------------------------------------------------------------------
// direct extraction of bitset by bitscan
//------------------------------------------------------------------------------

void loopCompact&#40;UI_64 bb&#41;

	&#123;while&#40;bb&#41;&#123;nmb^=bsf64&#40;bb&#41;;bb&=bb-1;&#125;;return;&#125;

//------------------------------------------------------------------------------
// nmb is just a global that doesnt do anything. it should only avoid
// compiler optim. for my testloops...
//------------------------------------------------------------------------------

Well, at least on my machine, the _extractedLoop_ is 2x,3x,4x faster
than the _pop1_ version (on the same data). And that is the problem.
Maybe on fast bitscan machines the _pop1_ version performs as
well as the _extractedLoop_ version.

Now i am interested in what kind of version you would prefer and
the reasons for, of course.

(as mentioned, i want to use the indices of attacksets for movegeneration,
evaluation terms..., so one extraction per stage seems logic with later
access to the indices by list-data-structure _IF_ that would perform at least equal on fast bitscan machines)

Micha

bob · Post by **bob** » Tue Oct 13, 2009 8:50 pm

Desperado wrote:Hi everyone,

because i try to design my engine-concept with incremental attacksets
i have to make a decision of the following issue.
Code: Select all
//------------------------------------------------------------------------------
// access on bitset by list-type
// requires one extraction per stage &#40;by bitscan&#41;,
// ex&#58; attack-sets may be handled like this...

//struct LIST_T 
//	&#123;
//	 UI_08 len;
//	 UI_08 id&#91;64&#93;;
//	&#125;;
//------------------------------------------------------------------------------

void loopExtracted&#40;UI_16 id&#41;

	&#123;for&#40;UI_08 i=0;i<list&#91;id&#93;.len;i++) nmb^=list&#91;id&#93;.id&#91;i&#93;;return;&#125;

//------------------------------------------------------------------------------
// direct extraction of bitset by bitscan
//------------------------------------------------------------------------------

void loopCompact&#40;UI_64 bb&#41;

	&#123;while&#40;bb&#41;&#123;nmb^=bsf64&#40;bb&#41;;bb&=bb-1;&#125;;return;&#125;

//------------------------------------------------------------------------------
// nmb is just a global that doesnt do anything. it should only avoid
// compiler optim. for my testloops...
//------------------------------------------------------------------------------
Well, at least on my machine, the _extractedLoop_ is 2x,3x,4x faster
than the _pop1_ version (on the same data). And that is the problem.
Maybe on fast bitscan machines the _pop1_ version performs as
well as the _extractedLoop_ version.

Now i am interested in what kind of version you would prefer and
the reasons for, of course.

(as mentioned, i want to use the indices of attacksets for movegeneration,
evaluation terms..., so one extraction per stage seems logic with later
access to the indices by list-data-structure _IF_ that would perform at least equal on fast bitscan machines)

Micha

You did not say how you are testing. If you use a simple driver and just call those over and over, the results will be meaningless when compared to what happens in a real chess engine. Those functions will be inlined, and then the instructions will be moved around to reduce dependency latencies. And then there are cache issues as well. What really counts is how each works in a real engine where there is cache/cpu pressure from other instructions besides just these.

Desperado · Post by **Desperado** » Tue Oct 13, 2009 9:17 pm

Hi Robert,

thx for this quick reply.Well,think you hit the point.
Like you described, it seems to be a _meaningless_ test i used,
outside a real chess engine with artificial generated data.

My hope was(is), someone was faced with similar decisions and
can give me some hints and experience feedback.

As i said, i am currently designing the engine, and the use of
the incremental attacksets(and how i will handle them), will
lead to a totally different _look_ of my functions. I fear this
cannot be handled with preprocessor-compile-options only.

If i could expect _about_ equal performance of both trials,
i would absolutley avoid the _loopExtracted_ version, causing
redundant data,polluting the cache...(that s my intuition,
but the _meaninless_ results making me totally insecure).

Well, in other words i am interested in your intuition so to say.

thx

Gerd Isenberg · Post by **Gerd Isenberg** » Tue Oct 13, 2009 9:38 pm

Desperado wrote:Well, at least on my machine, the _extractedLoop_ is 2x,3x,4x faster
than the _pop1_ version (on the same data). And that is the problem.
Maybe on fast bitscan machines the _pop1_ version performs as
well as the _extractedLoop_ version.

Now i am interested in what kind of version you would prefer and
the reasons for, of course.

(as mentioned, i want to use the indices of attacksets for movegeneration,
evaluation terms..., so one extraction per stage seems logic with later
access to the indices by list-data-structure _IF_ that would perform at least equal on fast bitscan machines)

Micha

If loopExtracted requires extraction by bitscan aka loopCompact anyway, I don't get your point. Why to serialize the same bitboard several times per stage? Also, you need a list of up to max cardinality elements, instead of one bitboard which might be serialized in a small loop keeping the bb inside a register. Of course K8 with its dead slow bitscan is slower than to get the precomputed list from L1, but for core2, I7, K10 it might be the opposite, specially if L1 becomes a bottleneck inside a real chess program.

Code: Select all

; input rdx bb
l1&#58; lea rcx, &#91;rdx-1&#93;
    bsf rax, rdx
    ...
    and rdx, rcx
    jnz l1

Desperado · Post by **Desperado** » Tue Oct 13, 2009 10:15 pm

Hi Gerd,

in my case, i simply want to use the extracted information at
different points of the code.

- first, when generating moves putting the destination sq.
to the move-data
(so this is done before doing any iternal move)

- later updating the SquareControl-counter (inc,dec) as
example.
(this is done when updating the board by executing the move)

... and maybe in future there are some more situations i dont think
of at the moment, especially for the evaluation function.

So, i think (i am sure) i will use the _extracted_ information of my
attacktable more than once.

But all in all, your arguements underline my intuition ( and i have
to forget the irritating test i made)

cheers

mcostalba · Post by **mcostalba** » Wed Oct 14, 2009 8:59 am

In Stockfish / Glaurung is used the _extractedLoop_

When introduced (in Glaurung times) it proved faster then bit scan. I have to say that bitscan is done with De Bruijn multiplication not with bsf instruction. It seems to be of same speed and be more portable.

Gian-Carlo Pascutto · Wed Oct 14, 2009 10:51 am

What is this used for? I don't really get what you're trying to do.

Desperado · Post by **Desperado** » Wed Oct 14, 2009 11:37 am

Gian-Carlo Pascutto wrote:What is this used for? I don't really get what you're trying to do.

The background:

2 months ago, i decided to write my engine from scratch.
The most important reason was the introduction of the attackgetterApi
using Obstruction difference.
Of course a lot of other stuff changed beside.
Now it was time to add features on my evaluation functions, where
i always think in terms of attacks,controls etc...(i dont know why
but i do

)

All ideas which were coming to my mind, always needed information
a full attacktable includes.To compute this information on different
situation on demand turned out to be too costly.

So i decided to introduce incremental attacktable which is also very
costly if wrong implemented. And it took several days until i found
a propper,simple,working solution to have full information about
controls(attacks) of the complete board.

Just implementing the update controls(but dont using it at any point),
gave me a speed loss of 10-15 %, which seems to be a very good result.

The next step was to think about the profits which can be done, having
these controls at hand everywhere (and using them

)

And this begins with the movegeneration, ends on evaluation terms.
Beside there are a lot of other issues, where you can benefit from.
(incheck detection(legal moves)...).

So i decided to give it a try, to get full advantage of incremental attacktables, the whole concept must base on it, i think.
(it seems not enough to build up an already running engine in my eyes).

So the speed loss of 15% should easily be regained, because no
attack computations necessary for the movegeneration,legal move detection,and for example mobility evaluation.(All these things are
for free now).

So, in my eyes it is very hard to take advantage of incremental Attacktables when using them as additional feature.
It is a complete different situation when the total concept is based
on it. (it is like with the board-representation and its dependencies
and influence on the complete code)

Micha

bitsets - extraction

bitsets - extraction

Re: bitsets - extraction

Re: bitsets - extraction

Re: bitsets - extraction

Re: bitsets - extraction

Re: bitsets - extraction

Re: bitsets - extraction

Re: bitsets - extraction