Is there such a thing as branchless move generation?

smatovic · Post by **smatovic** » Mon Jun 11, 2012 11:35 pm

Hey Gerd,

Your qbb legal move generation in zeta.cl with domove/undomove, king in check test, x-times getting piece codes from qbb in domove/undomove and elsewhere, and lots of branches in inner loops, looks horrorful inefficient and ugly to me Wink

Yep, i admit it is ugly, copy and paste from different sources and ported to OpenCL...at least it works, somehow...

Fillstuff with all 8 ray-directions inside one thread with some local memory or SIMD register file to keep own and disjoint opponent dir-attacks of sliders (you know fillstuff works with multiple rooks/queens or bishops/queens in one run) and kings during generation time, should be much better suited

Thanks Gerd!
I feel encouraged to dare an redesign.

I guess. But of course, I am not a GPU expert.

I guess more than me.

--
Srdja

johnhamlen · Post by **johnhamlen** » Tue Jun 12, 2012 12:12 am

Daniel Shawul wrote: Well that is not a piece list.

LOL. Okay then, it's a list of squares that the pieces are on

. Wow, it's a tough crowd in tonight!

....

Daniel Shawul wrote: You should be able to loop over pieces of certain kind without testing any flag. If that was the case, why not do what TSCP does with only board[64]? The comparison of piece lists with bitboards should be without degrading performance. A set of bitboards for knight,bisthop,etc do allow you to loop over each kind of piece separately. Also chars infact do waste a lot of space, when 4bits per square would be enough to store the piece type. So you see if you go this road , you should call this method a bitboard too

Cheers Daniel, I will. And I take your point re: my 0x88 plan morphing into bitboards anyway. Just don't think my old neurons are ready to cope with a new (to me) way of movegen (SIMD bitboards) + a new architecture (GPUs) + a new programming language (OpenCL or CUDA or C++ AMP). So I'll be starting with 0x88 to get some experience and then - after finding that, even at 100% utilisation, 900 iterations to get a single move list(!) - use the experience gained to move to faster bitboards.

johnhamlen · Post by **johnhamlen** » Tue Jun 12, 2012 12:35 am

smatovic wrote:Hi John,

maybe you want to take a look at "Zeta CL"

http://zeta-chess.blogspot.de/

Code is published under GPL:

https://github.com/smatovic/Zeta/

Zeta uses QuadBitboards (thx to Gerd) for Board Presentation and a Magic-Bitboard Move Generator (parts ported from Stockfish).

I tried an 0x88 move generator (port of MicroMax) and it was slower than Magic-Bitboards.

I tried also an parallel Kogge-Stone approach with 8 threads for 8 directions, and it was slower.

Like Daniel already mentioned, Move Generation is only one point, the overall performance is tight to the post-process,:

http://zeta-chess.blogspot.de/2012/03/p ... llers.html

--
Srdja

Gosh, thanks Srdja. Your Zeta Chess blog is very interesting indeed, especially your documenting what did and didn't work and the associated performance figures. Very interesting that magic bitboards gave the best performance. My understanding (up to this point) is that the lookup tables would be too large to fit in local memory and hence kill the performance because of the global memory accesses. Yes, Daniel has been very helpful in sharing his experience. However, my aims - in the first instance - aren't to create a complete chess engine running on the GPU, I want to see if I can run Monte-Carlo game playouts in the most SIMD-friendly way possible.

Cheers,
John

Daniel Shawul · Post by **Daniel Shawul** » Tue Jun 12, 2012 1:00 am

My understanding (up to this point) is that the lookup tables would be too large to fit in local memory and hence kill the performance because of the global memory accesses.

Yes it is and that is why I told you to try the kindergarten approach 1.5kb for me.

Yes, Daniel has been very helpful in sharing his experience. However, my aims - in the first instance - aren't to create a complete chess engine running on the GPU, I want to see if I can run Monte-Carlo game playouts in the most SIMD-friendly way possible.

You are missing the whole point it has become annoying. Monte carlo is a search. After a couple of make moves, all of your threads would be working on completely different postiions. Some would go deeper and some would stop prematurely. That is divergence. And yet you are worried about a SIMD friendly move generation, branchless move generator. This is ridiclous because optimization for GPUs start with memory. GPGPUs are not SIMD they are SIMT, and they can do far more than matrix manuplations. Only warps are exectued in a SIMD fashion. I consider each core like a single processing unit capable of doing a search by itself. If you want to use say 32 cores for generating moves on a single position, go kogge-stone or whatever. Doesn't interest me a bit as you would be wasting a possible 32x speed up to begin with (i.e if there was enough L1). But by now you have gathered enough suggestions to start coding. So do that,type something and see how it goes ..

Daniel Shawul · Post by **Daniel Shawul** » Tue Jun 12, 2012 1:12 am

johnhamlen wrote:
Daniel Shawul wrote: Well that is not a piece list.
LOL. Okay then, it's a list of squares that the pieces are on . Wow, it's a tough crowd in tonight! ....

What are you missing? It is a piece _array_ not a list. Like I said it can be done better if I knew that is what you wanted. Nibbles of board[64] for a total of 32 bytes, i.e same as quad bb. When a piece is captured, you check if a piece is captured ( -1 entries in your array ). I directly check if the square is empty. So what is the difference ? 64 vs 32 ... ?

Daniel Shawul wrote: Cheers Daniel, I will. And I take your point re: my 0x88 plan morphing into bitboards anyway. Just don't think my old neurons are ready to cope with a new (to me) way of movegen (SIMD bitboards) + a new architecture (GPUs) + a new programming language (OpenCL or CUDA or C++ AMP). So I'll be starting with 0x88 to get some experience and then - after finding that, even at 100% utilisation, 900 iterations to get a single move list(!) - use the experience gained to move to faster bitboards.

Look man do what suits you. I provided you with the reality of GPU computing for chess but you seem to have difficulty dealing with it. Searching for excuses ,not to accept something delivered to you probably in a cold manner, is not a solution. I have coded a monte-carlo chess & hex and UCT search too, so once you start coding and getting results, maybe I will learn something that I missed...

Daniel Shawul · Post by **Daniel Shawul** » Tue Jun 12, 2012 1:26 am

Hey Srdja
I agree with reducing the memory usage, BUT I do not think your code is ugly at all. Trying to avoid branching code is a waste of time and many people seem to miss that... GPU computation is memory bound (200x slower) even with the new L1 and L2. Infact it is encouraged to compute / recompute stuff rather thand do a single uncache probe... So all this stuff about branching code, parallel move generation doesn't make that much sense to me, when you have far bigger issues to deal with. This is not CPU computation with wide registers (SWAR), there you have a couple of threads (say 4) where you can do vector computations (instead of scalar) to get better performance. The caching is really good there with few threads. For GPUs you have thousands_ of threads to work with. Unless the proposition is to launch few threads say 32x less than what is possible and hope to gain something from parallel move generation, evaluation etc, it doesn't make sense. AFAIK you and I have been trying to use each core for doing real search. So unless you think it is impossible toreach full occupancy with the current method used, there is no need to waste cores for that purpose.
cheers
Daniel

johnhamlen · Post by **johnhamlen** » Tue Jun 12, 2012 1:41 am

Sorry Daniel, I didn't mean to annoy/upset/frustrate you in any way, but clearly I have and I apologise

. I sincerely appreciated all your points and have learned a lot. Good night and have a great week.

John

smatovic · Post by **smatovic** » Tue Jun 12, 2012 3:05 am

just for the record:

i retested my old kogge-stone implementation, now with only one thread,
looks like it is a bit faster:

Code: Select all

            // every direction
            for&#40;i=0;i<8;i++) &#123;

                pro = ~bbBlockers;
                gen = SetMaskBB&#91;pos&#93;;
                r = shift&#91;&#40;piece>>1&#41;*8+i&#93;;
                pro &= avoidWrap&#91;&#40;piece>>1&#41;*8+i&#93;;

                // do kogge stone
                gen |= pro & (&#40;gen << r&#41; | &#40;gen >> &#40;64-r&#41;));
                pro &=       (&#40;pro << r&#41; | &#40;pro >> &#40;64-r&#41;));
                gen |= pro & (&#40;gen << 2*r&#41; | &#40;gen >> &#40;64-2*r&#41;));
                pro &=       (&#40;pro << 2*r&#41; | &#40;pro >> &#40;64-2*r&#41;));
                gen |= pro & (&#40;gen << 4*r&#41; | &#40;gen >> &#40;64-4*r&#41;));

                // Shift one for Captures
                bbTemp |= (&#40;gen << r&#41; | &#40;gen >> &#40;64-r&#41;)) & avoidWrap&#91;&#40;piece>>1&#41;*8+i&#93;;
            &#125;

instead of

Code: Select all

            // Knight and King
            bbTemp  = ( &#40;piece>>1&#41; == KNIGHT || &#40;piece>>1&#41; == KING&#41;? AttackTables&#91;&#40;som*7*64&#41;+(&#40;piece>>1&#41;*64&#41;+pos&#93; &#58; 0;

            // Sliders
            // rook or queen
            bbTemp  |= ( &#40;piece>>1&#41; == ROOK || &#40;piece>>1&#41; == QUEEN&#41;?      ( RAttacks&#91;RAttackIndex&#91;pos&#93; + ((&#40;bbBlockers & RMask&#91;pos&#93;) * RMult&#91;pos&#93;) >> RShift&#91;pos&#93;)&#93; ) &#58; 0;
            // bishop or queen
            bbTemp  |= ( &#40;piece>>1&#41; == BISHOP || &#40;piece>>1&#41; == QUEEN&#41;?    ( BAttacks&#91;BAttackIndex&#91;pos&#93; + ((&#40;bbBlockers & BMask&#91;pos&#93;) * BMult&#91;pos&#93;) >> BShift&#91;pos&#93;)&#93; ) &#58; 0;

            // Pawn attacks and forward step
            bbTemp  |= ( &#40;piece>>1&#41; == PAWN&#41; ? &#40;PawnAttackTables&#91;som*64+pos&#93; & bbOpposite&#41;  | &#40;PawnAttackTables&#91;&#40;som+2&#41;*64+pos&#93; & ~bbBlockers&#41;                        &#58; 0 ;

thanks again for your suggestions, good to know that there is much place for improvement.....

--
Srdja

smatovic · Post by **smatovic** » Tue Jun 12, 2012 3:16 am

Hey Daniel,

I agree with reducing the memory usage, BUT I do not think your code is ugly at all.

Thanks for this one.
I guess the code of an GPU-Chess-Programm must be a bit arbitrary

AFAIK you and I have been trying to use each core for doing real search. So unless you think it is impossible toreach full occupancy with the current method used, there is no need to waste cores for that purpose.

I totally agree with you.
I ve tested different approaches to get multible threads working on the same item but the one-thread-one-board design outperforms them.

...actually i am happy with the performance of Zetas Move Generator...looking forward to debug and enhance the search-algorithm....

--
Srdja

Gerd Isenberg · Post by **Gerd Isenberg** » Tue Jun 12, 2012 8:31 am

smatovic wrote:just for the record:

i retested my old kogge-stone implementation, now with only one thread,
looks like it is a bit faster:

Code: Select all

            // every direction
            for&#40;i=0;i<8;i++) &#123;

                pro = ~bbBlockers;
                gen = SetMaskBB&#91;pos&#93;;
                r = shift&#91;&#40;piece>>1&#41;*8+i&#93;;
                pro &= avoidWrap&#91;&#40;piece>>1&#41;*8+i&#93;;

                // do kogge stone
                gen |= pro & (&#40;gen << r&#41; | &#40;gen >> &#40;64-r&#41;));
                pro &=       (&#40;pro << r&#41; | &#40;pro >> &#40;64-r&#41;));
                gen |= pro & (&#40;gen << 2*r&#41; | &#40;gen >> &#40;64-2*r&#41;));
                pro &=       (&#40;pro << 2*r&#41; | &#40;pro >> &#40;64-2*r&#41;));
                gen |= pro & (&#40;gen << 4*r&#41; | &#40;gen >> &#40;64-4*r&#41;));

                // Shift one for Captures
                bbTemp |= (&#40;gen << r&#41; | &#40;gen >> &#40;64-r&#41;)) & avoidWrap&#91;&#40;piece>>1&#41;*8+i&#93;;
            &#125;

instead of

Code: Select all

            // Knight and King
            bbTemp  = ( &#40;piece>>1&#41; == KNIGHT || &#40;piece>>1&#41; == KING&#41;? AttackTables&#91;&#40;som*7*64&#41;+(&#40;piece>>1&#41;*64&#41;+pos&#93; &#58; 0;

            // Sliders
            // rook or queen
            bbTemp  |= ( &#40;piece>>1&#41; == ROOK || &#40;piece>>1&#41; == QUEEN&#41;?      ( RAttacks&#91;RAttackIndex&#91;pos&#93; + ((&#40;bbBlockers & RMask&#91;pos&#93;) * RMult&#91;pos&#93;) >> RShift&#91;pos&#93;)&#93; ) &#58; 0;
            // bishop or queen
            bbTemp  |= ( &#40;piece>>1&#41; == BISHOP || &#40;piece>>1&#41; == QUEEN&#41;?    ( BAttacks&#91;BAttackIndex&#91;pos&#93; + ((&#40;bbBlockers & BMask&#91;pos&#93;) * BMult&#91;pos&#93;) >> BShift&#91;pos&#93;)&#93; ) &#58; 0;

            // Pawn attacks and forward step
            bbTemp  |= ( &#40;piece>>1&#41; == PAWN&#41; ? &#40;PawnAttackTables&#91;som*64+pos&#93; & bbOpposite&#41;  | &#40;PawnAttackTables&#91;&#40;som+2&#41;*64+pos&#93; & ~bbBlockers&#41;                        &#58; 0 ;

thanks again for your suggestions, good to know that there is much place for improvement.....

--
Srdja

That shows that magic bitboards were inefficient for you, since Kogge-Stone is not intended to generate attacks for one piece only, but multiple sliding pieces, that is rooks/ queens or bishops/queens in one run for the appropriate ray-directions. Also better unroll the loop to avoid the variable shift rotates, but use fixed shifts and avoid wrap mask ands.

So for your loop over pieces (instead of directions) framework, I agree with Daniel to go with 1.5K Kindergarten bitboards.

Gerd

Is there such a thing as branchless move generation?

Re: Zeta CL uses Magic Bitboards

Re: Is there such a thing as branchless move generation?

Re: Zeta CL uses Magic Bitboards

Re: Zeta CL uses Magic Bitboards

Re: Is there such a thing as branchless move generation?

Re: Zeta CL uses Magic Bitboards

Re: Is there such a thing as branchless move generation?

Re: Zeta CL uses Magic Bitboards vs Kogge-Stone

Re: Zeta CL uses Magic Bitboards

Re: Zeta CL uses Magic Bitboards vs Kogge-Stone