Is there such a thing as branchless move generation?

Gerd Isenberg · Post by **Gerd Isenberg** » Tue Jun 12, 2012 8:58 am

Daniel Shawul wrote:Hey Srdja
I agree with reducing the memory usage, BUT I do not think your code is ugly at all. Trying to avoid branching code is a waste of time and many people seem to miss that... GPU computation is memory bound (200x slower) even with the new L1 and L2. Infact it is encouraged to compute / recompute stuff rather thand do a single uncache probe... So all this stuff about branching code, parallel move generation doesn't make that much sense to me, when you have far bigger issues to deal with. This is not CPU computation with wide registers (SWAR), there you have a couple of threads (say 4) where you can do vector computations (instead of scalar) to get better performance. The caching is really good there with few threads. For GPUs you have thousands_ of threads to work with. Unless the proposition is to launch few threads say 32x less than what is possible and hope to gain something from parallel move generation, evaluation etc, it doesn't make sense. AFAIK you and I have been trying to use each core for doing real search. So unless you think it is impossible toreach full occupancy with the current method used, there is no need to waste cores for that purpose.
cheers
Daniel

I mean this inner loop in generation, and sorry, it looks ugly for my taste

It is not only about branches bit to avoid domove/legaltest/undomove by predetermining pinned pieces and attacks.

Code: Select all

            while&#40; bbMoves ) &#123;

                // pop 1st bit
                to = (&#40;Square&#41;&#40;BitTable&#91;(&#40;bbMoves & -bbMoves&#41; * 0x218a392cd3d5dbf&#41; >> 58&#93;) );
                bbMoves &= &#40;bbMoves-1&#41;;

                cpt = to; // TODO&#58; en passant
                pieceto = ( &#40;piece>>1&#41; == PAWN && ( &#40;som == WHITE && &#40;to>>3&#41; == 7&#41; || &#40;som == BLACK && &#40;to>>3&#41; == 0&#41; ) ) ? &#40;QUEEN<<1 | som&#41;&#58; piece; // pawn promotion

                piececpt = (&#40;board&#91;0&#93;>>cpt&#41; &1&#41; + 2*(&#40;board&#91;1&#93;>>cpt&#41; &1&#41; + 4*(&#40;board&#91;2&#93;>>cpt&#41; &1&#41; + 8*(&#40;board&#91;3&#93;>>cpt&#41; &1&#41;;

                // make move and store in global
                move = (&#40;Move&#41;pos | &#40;Move&#41;to<<6 | &#40;Move&#41;cpt<<12 | &#40;Move&#41;piece<<18 | &#40;Move&#41;pieceto<<22 | &#40;Move&#41;piececpt<<26 );

                // TODO&#58; pseudo legal move gen&#58; 2x speedup?
                // domove
                domove&#40;board, pos, to, cpt, piece, pieceto, piececpt&#41;;

                //get king position
                bbTemp = board&#91;0&#93; ^ &#40;board&#91;1&#93; | board&#91;2&#93; | board&#91;3&#93;);
                bbTemp = &#40;som == BLACK&#41;? board&#91;0&#93; &#58; bbTemp ;
                bbTemp = bbTemp & board&#91;1&#93; & board&#91;2&#93; & ~board&#91;3&#93;; // get king
                kingpos = &#40;Square&#41;&#40;BitTable&#91;(&#40;bbTemp & -bbTemp&#41; * 0x218a392cd3d5dbf&#41; >> 58&#93;);

                kic = 0;
                // king in check?
                kic = PieceInCheck&#40;board, kingpos, som, RAttacks, BAttacks&#41;;

                if &#40;kic == 0&#41; &#123;

                    // copy move to global
                    global_pid_moves&#91;pid*max_depth*256+sd*256+n&#93; = move;
                    // Movecounters
                    n++;
                    COUNTERS&#91;totalThreads*3+pid&#93;++;
                &#125;

                // undomove
                undomove&#40;board, pos, to, cpt, piece, pieceto, piececpt&#41;;
            &#125;

For storing some intermediate results inside the local scope of a routine, say 32 bitboards as local variables is a bottleneck on a GPU?

Gerd

Daniel Shawul · Post by **Daniel Shawul** » Tue Jun 12, 2012 3:46 pm

Maybe he could have taken the legality test out side the inner loop since it is better to do pseudo-legal move generation first. There is the majic multiplier and then piece extraction from the quad bb which all seem born out of need and can't do anything about.

For storing some intermediate results inside the local scope of a routine, say 32 bitboards as local variables is a bottleneck on a GPU?

Number of registers used are the bottleneck. The nvcc compiler does so much optimization with the registers it is very hard to free up a single register. Temporaries are quickly reused and manual optimization for me did not work, still stuck at 45 registers or so for chess i.e including move generation,make and everything else. Hex is far lower about 12 register IIRC so it is the nature of the game IMO. If you do a completely "disjoint" set of operations that don't share registers in the same kernel, the register usage will not increase. For example I have a move generator and number of moves counter, which almost does the same calculations but doesn't save the move. The two consume about 30 registers each but my register usage still remains the same. So I would be curious to see how much can be saved by taking out the legality testing outside if any.

Anyway Kogge-stone is good iff you are going to make the fills in the 8 directions using a core each. As Srdja mentioned he tested that and the focus is on using each core to do its own move generation and search. Well with the the trend I saw from the new device Kepler, maybe that approach of unrolling loops and working with kogge stone will be better. Too many number of cores, too little memory, lower clock rates and addition of dynamic parallelism that unrolls loops automatically and stuff like that. My favourite for general computing is still the fermi ...

smatovic · Post by **smatovic** » Tue Jun 12, 2012 4:35 pm

Maybe he could have taken the legality test out side the inner loop since it is better to do pseudo-legal move generation first.

Thanks,
i will test an pseudo-legal move generator in future..

Daniel Shawul · Post by **Daniel Shawul** » Tue Jun 12, 2012 6:37 pm

smatovic wrote:
Maybe he could have taken the legality test out side the inner loop since it is better to do pseudo-legal move generation first.
Thanks,
i will test an pseudo-legal move generator in future..

Well it may help some but for me chess on gpu is totally dominated by memory constraints, and optimizations like avoiding branching code is at the bottom of the to do list. You can see some claims that code is accelerated by 50% or so on gpu through reduction of branching but that is not game tree search. My hex engine ( a cleverly selected game I should say ) was kind of a success because it doesn't suffer from memory constraints, and very suitable to gpu kind of computaitons.
In Hex, even the warp divergence in _search_ is minimal and I don't do too many tests while making moves. The MC search goes down to the end of the game (exactly 64 half plies), and each thread has equal number of moves at every ply. So for hex it helps some to avoid branching code. But for chess it is a completely different story for me. First there is the long pipeline latency similar to PIVs (24 cycles). If you don't have enough warps (6 out of 24 for 25% occupancy), your add,sub and other trivial operation take 24 cycles instead of 4cycles. I had difficulty running enough number of warps to hide that latency. Second,there is the global memory latency that requires far more number of warps to hide than that of the pipeline's. My solution for that was to consult the hashtables as rarely as possible. Each warp (32 threads) take one node from the tree and diverge from there onwards by making random moves. So I went through all sorts of trouble to avoid consulting global mem, and some may think it is completely insane what I did because I computed a lot to avoid un cached probes. A routine to count number of moves as expensives as move generation, keeping a bit set of moves generated so far and other things could all have been avoided if I just had a little bit extra space to store generated moves. Well maybe I was too parnanoid but that explains my position on why I didn't bother with branchless optimization in chess.

diep · Post by **diep** » Sat Jun 16, 2012 10:40 am

johnhamlen wrote:I'm starting to write the prototype of a move generator to run on GPUs. It's for monte-carlo game simulations so I can allow myself certain simplifications without affecting performance too much e.g. no underpromotions and no ep-captures, but to fit nicely into a GPU compute environment, there are two horrible constraints:
1) Low memory footprint (bye-bye big lookup tables), and
2) The code should be branchless.

Sadly, branchless code and avoiding big lookup tables seem to be rather incompatible.

Has anyone tried writing a completely branchless move generator? It's an interesting thought experiment e.g. some how treating non-sliding pieces as sliding pieces of range 1 etc. - but has it ever been done? Is it even possible to generate non-sliding, sliding, promotion and pawn moves with the same instructions with the only difference being the data/masks etc. I think it is, but haven't proved it with any code yet!

Many thanks,
John

1) peanuts
2) peanuts

it eats a few more instructions, that's all.
note gpu's have a lot of conditional move type instructions.
so extremely simple branches are very fast.

also not, though not needed for move generation, you can use branches provided that all cores take the same path.

the fast move generation that i put in GPL for CPU, already is using in move generator a trick that is fast for cpu's, to avoid a branch for move generation.

you just need a few more of those tricks that's all.
tricky part is get maximum speed out of it and proving that you can't do it a lot faster. I have no answer to that as of yet.

diep · Post by **diep** » Sat Jun 16, 2012 10:46 am

Daniel Shawul wrote:
smatovic wrote:
Maybe he could have taken the legality test out side the inner loop since it is better to do pseudo-legal move generation first.
Thanks,
i will test an pseudo-legal move generator in future..
Well it may help some but for me chess on gpu is totally dominated by memory constraints, and optimizations like avoiding branching code is at the bottom of the to do list. You can see some claims that code is accelerated by 50% or so on gpu through reduction of branching but that is not game tree search. My hex engine ( a cleverly selected game I should say ) was kind of a success because it doesn't suffer from memory constraints, and very suitable to gpu kind of computaitons.
In Hex, even the warp divergence in _search_ is minimal and I don't do too many tests while making moves. The MC search goes down to the end of the game (exactly 64 half plies), and each thread has equal number of moves at every ply. So for hex it helps some to avoid branching code. But for chess it is a completely different story for me. First there is the long pipeline latency similar to PIVs (24 cycles). If you don't have enough warps (6 out of 24 for 25% occupancy), your add,sub and other trivial operation take 24 cycles instead of 4cycles. I had difficulty running enough number of warps to hide that latency. Second,there is the global memory latency that requires far more number of warps to hide than that of the pipeline's. My solution for that was to consult the hashtables as rarely as possible. Each warp (32 threads) take one node from the tree and diverge from there onwards by making random moves. So I went through all sorts of trouble to avoid consulting global mem, and some may think it is completely insane what I did because I computed a lot to avoid un cached probes. A routine to count number of moves as expensives as move generation, keeping a bit set of moves generated so far and other things could all have been avoided if I just had a little bit extra space to store generated moves. Well maybe I was too parnanoid but that explains my position on why I didn't bother with branchless optimization in chess.

In 90s we just didn't store the moves at all. You generate 1 move and then search it. Worked for cpu's with 128 bytes RAM. The 128 bytes victory goes to Frans Morsch if i remember well

Doing anything in a cache is not clever. part of L1 for fast hashtable, that should be the max you want to store and what's closer to the root is allowed to probe the RAM, nothing else you'd want to store in the RAM...

Algorithmically the hashtable is always the problem to solve well.

diep · Post by **diep** » Sat Jun 16, 2012 10:57 am

Edmund wrote:Some additional thought.
Instead of looping over the 64 squares in the outher loop, you can also keep piecelists and loop over the maximum of 16 pieces.

In total you would then increment the inner part of the loop
16*8*7=896 times
instead of the 64*8*7=3584

The additional effort needed during move and unmove (updating of piecelists) should be compensated by the time saved in movegeneration.

Regarding Svens argument; IIRC conditions are not a problem for GPUs, rather different threads taking different routes at conditions is. As such loops with fixed number of iterations should be fine. If I am wrong you would have to unroll the 896 iterations, if the compiler can't do that automatically.

You can loop usually over less of course using this method - you can use the maximum of pieces in this computation unit.

diep · Post by **diep** » Sat Jun 16, 2012 11:07 am

Daniel Shawul wrote:Well 0x88 is simply a bad idea on gpus because of maintaing an additional piece list that you don't need and ofcourse an int board[64] for each thread you spawn (could be thousands). With the small L1 cache,computing stuff is always better than carrying luggage around. Ofcouse if you are planning not to use each thread to do an independent search then that is fine. But do realize you are loosing big before you even begin coding. Sorry if I come strong but I saw the thread going in the wrong direction (from _my perspective_) so I thought I would throw in my opinion.
The OP posted here not on TLCV chat plus lots of garbage there.

Yes storing anything not needed is no good, i agree with that.

The problem is of course that you want to generate the best move and have some sort of decent best move. You're allowed to waste massive cycles onto that each move. Yes in fact you can do a full evaluation for each move in fact, as we can then say something about how many cycles a node you're gonna burn up, even slowing down another factor 2, just to each time try the best move is worth it.

You always generate of course moves in the same order. So then picking the best move based upon the sorting you created, then just count down until you skip the move you just searched there.

Deep Blue/Thought was supposed to just store the move number just searched.

All you need to store is the move itself. Could be done 16 bits so to speak, or a 8 bits number.

8 bits number should work

That's what i did do in the move generator i built for Diep's old GUI. Storing moves really had to get compressed there, as otherwise book wouldn't fit well on a floppy disk.

On average i used 4 to 5 bits to store best move first converted from internal datastructure where we know of course maximum number of semi-legal moves is under 220 or so, so we can always start at 8 bits storage.

If you're order of trying always is in the same manner, the big slowdown of course is the check whether it's a killermove

This should get things down on GPU to 5k - 10k clocks a node.

At Tesla's here i have at 448 cores @ 1.15 Ghz roughly each one that translates to roughly: 448 * 1.15 G / 5k = 448 * 1.15M / 5 = 103M nps

diep · Post by **diep** » Sat Jun 16, 2012 11:23 am

johnhamlen wrote:I'm starting to write the prototype of a move generator to run on GPUs. It's for monte-carlo game simulations so I can allow myself certain simplifications without affecting performance too much e.g. no underpromotions and no ep-captures, but to fit nicely into a GPU compute environment, there are two horrible constraints:
1) Low memory footprint (bye-bye big lookup tables), and
2) The code should be branchless.

Sadly, branchless code and avoiding big lookup tables seem to be rather incompatible.

Has anyone tried writing a completely branchless move generator? It's an interesting thought experiment e.g. some how treating non-sliding pieces as sliding pieces of range 1 etc. - but has it ever been done? Is it even possible to generate non-sliding, sliding, promotion and pawn moves with the same instructions with the only difference being the data/masks etc. I think it is, but haven't proved it with any code yet!

Many thanks,
John

hi John,

On Monte Carlo, this is easy on GPU to do. What you're looking at is about a 5k-10k clocks per node. I realize you'll be disappointed with that number as in 90s already Schach 3.0 was under 600 clocks at a P5. But losing factor 10 to that is realistic.

See it as that you get free mobility evaluation during move generation

The Monte Carlo then should hardly lose anything and i had calculated a parallel worst case overhead of factor 2 using the chosen algorithm i had written on paper to parallellize.

That's a 'scaling overhead', not the speedup. With Monte Carlo you should have no troubles parallellizing, so for initial framework you want under 10k clocks a node and if not then keep optimizing, as effectively that's already factor 20 slower than you can do simple searching at a CPU.

Kind Regards,
Vincent

p.s. be careful to not cut'n paste from guys i wouldn't even assign a tictactoe programming contest to, as it might have you make datastructure in wrong manner. you optimize for storing the minimum and only when you can prove on paper you need something, you use it.

Each core needs the global bound and each compute unit you can see whether you can get away not storing the starting position. Then from that starting position each core might want to just record the moves played in its datastructure.

Go from there, store the minimum in the local registers.

diep · Post by **diep** » Sat Jun 16, 2012 11:56 am

johnhamlen wrote:
Daniel Shawul wrote:Your calculation for piece list is totally wrong, at least according to what I am using now. The piece list entries with LIST { int sq, List* prev,List* next } requires far more than a thousand bytes. You need to declare an LIST[128] too...
Agreed. Totally wrong if I was calculating for a doubly linked list. Woodpusher used a singularly linked list before swapping to bitboards, but I was trying to think of the most memory-efficient representation which (I think) is an array of 2 x 16 entries: one byte for each of the 16 possible pieces for white and black.

drop the idea of using linked lists on gpu man. you will need 0 pointers if you want to be fast.

Is there such a thing as branchless move generation?

Re: Zeta CL uses Magic Bitboards

Re: Zeta CL uses Magic Bitboards

Re: Zeta CL uses Magic Bitboards

Re: Zeta CL uses Magic Bitboards

Re: Is there such a thing as branchless move generation?

Re: Zeta CL uses Magic Bitboards

Re: Is there such a thing as branchless move generation?

Re: Is there such a thing as branchless move generation?

Re: Is there such a thing as branchless move generation?

Re: Is there such a thing as branchless move generation?