Is there such a thing as branchless move generation?

johnhamlen · Post by **johnhamlen** » Thu Jun 07, 2012 9:50 pm

I'm starting to write the prototype of a move generator to run on GPUs. It's for monte-carlo game simulations so I can allow myself certain simplifications without affecting performance too much e.g. no underpromotions and no ep-captures, but to fit nicely into a GPU compute environment, there are two horrible constraints:
1) Low memory footprint (bye-bye big lookup tables), and
2) The code should be branchless.

Sadly, branchless code and avoiding big lookup tables seem to be rather incompatible.

Has anyone tried writing a completely branchless move generator? It's an interesting thought experiment e.g. some how treating non-sliding pieces as sliding pieces of range 1 etc. - but has it ever been done? Is it even possible to generate non-sliding, sliding, promotion and pawn moves with the same instructions with the only difference being the data/masks etc. I think it is, but haven't proved it with any code yet!

Many thanks,
John

Gerd Isenberg · Post by **Gerd Isenberg** » Thu Jun 07, 2012 10:44 pm

Hi John,

Direction-wise move generation can be done almost branchless without any lookups but pure SIMD register computation with fill algorithms aka Kogge-Stone for sliding piece attacks.

Another related cpw link:
Hashing Multiple Bits

Cheers,
Gerd

johnhamlen · Post by **johnhamlen** » Fri Jun 08, 2012 12:18 am

Cheers Gerd, you're a star. What a wonderful resource your CPW is!

Those fill algorithms look very promising. Now "all" I need to do is figure out how to get rid of any need for explicit "case (PieceType) ..." type code and I'm sorted!

All the best,
John

Edmund · Post by **Edmund** » Fri Jun 08, 2012 12:57 am

I would go for the 0x88 board approach.

make an array that for each piece stores whether it can slide
make an array that for each piece for each direction stores the move increment

Loop over every square
Loop over all 8 directions

int move_is_legal = (from == correct piece color) && direction is legal //(e.g. a rook can only have 4 directions, the other 4 entries in the direction array could be 0)

Loop over all 7 moves per direction {

store the move in the movelist

move_is_legal &= (slide == 0 || canslide) && (to is on the board) && (to is not own piece)

increment movecounter iif move_is_legal

move_is_legal &= (to is opponent piece)

}

johnhamlen · Post by **johnhamlen** » Fri Jun 08, 2012 10:21 am

Thanks Edmond. This is great for clarifying my thoughts.

I was getting a bit bogged down thinking about what I know (Woodpusher uses 1990s vintage bidboards) and marrying this to what looks elegant and I really wanted to try (0x88).

I'll kick off with "pure" 0x88 as I can get my head around it quicker than the Kogge-Stone magic, and delay wrestling with the latter until after I find 0x88 is horribly slow

.

I've got two computer science degrees under my belt, so should be blasé about GPU architectures, but I still find the idea of single card scheduling instructions for 100s (and now 1000s!) of little compute cores somewhat magical!

So looking forward to experimenting.

Thanks again, John

Sven · Post by **Sven** » Fri Jun 08, 2012 11:21 am

johnhamlen wrote:Thanks Edmond. This is great for clarifying my thoughts.

I was getting a bit bogged down thinking about what I know (Woodpusher uses 1990s vintage bidboards) and marrying this to what looks elegant and I really wanted to try (0x88).

I'll kick off with "pure" 0x88 as I can get my head around it quicker than the Kogge-Stone magic, and delay wrestling with the latter until after I find 0x88 is horribly slow .

I've got two computer science degrees under my belt, so should be blasé about GPU architectures, but I still find the idea of single card scheduling instructions for 100s (and now 1000s!) of little compute cores somewhat magical! So looking forward to experimenting.

Thanks again, John

But please keep in mind that loops are usually not branchless, unless the compiler can unroll them ... For that very reason the 0x88 approach might not be what you are looking for: exactly the "0x88" test implies branching code in order to stop before dropping off the board.

If 100% branchless move generation code is really your requirement then you will have a hard time with any of today's existing board representation + move generation concepts. For instance, how to avoid a loop if you have the bitboard of all white pawns and need to generate all single step moves for each of these pawns? At any point where the number of steps to perform depends on the board state you will most probably need some condition.

I can imagine an approach based on tables of function pointers, where each function is responsible for generating exactly N moves and takes as input a bitboard where exactly N bits are set. The calling function uses a (branchless) popcnt() to index into the function pointer table. With C++ one could use templates to do this but it is quite a lot of work.

I am not sure yet whether you actually need that "100% branchless" requirement, or whether you only want to come as close as possible to that goal.

Sven

johnhamlen · Post by **johnhamlen** » Fri Jun 08, 2012 12:31 pm

Hi Sven.

Many thanks for your thoughts. Yes, your reservations mirror those that I had - and was the reason for posting in the first place to see it better brains than mine had already solved this little puzzle.

I agree with your point about looping and branches, though Edmund's pseudo-code goes a long way to fixing this for non-pawn pieces. If I understand it correctly, this would mean not testing if you are off the board on every iteration of following the ray, but rather iterating 7 times in every direction whether you stepped off the board on the previous iteration or not.

Haven't thought it through properly yet, but if this is the case then it would break the "pure" 0x88 model, but hopefully can be rescued by some clever modulo arithmetic?

With further small lookup arrays as proposed by Edmund I can see pawn promotions being taken care of by just assuming that every move is a promotion, it just being that pawns and pieces normally promote to themselves. I can't (at the moment!) see how pawn capture moves can be generated in this branchless SIMD framework, but looking forward to thinking about it!

Re: function pointers. Unfortunately with GPUs, the C programmer (ie me) is limited to OpenCL (AMD or Nvidia) or CUDA (Nvidia) and neither support function pointers.

100% branchless is more of an aspiration than a requirement. With branchless code, all the little SIMD cores can be kept busy at the same time. As soon as you hit a IF (X) THEN run A; ELSE run B section of code then execution of the B code is halted while the A code is executed and vice versa. This means cores are sitting around doing nothing. The problem escalates exponentially once you start having branches within branches until you get to the situation where there are sections of the code where only a small percentage of the available cores are doing any useful work.

Of course it's all a trade-off as it might be better to accept the hit of underutilisation described above to avoid the redundancy of exploring "rays" from a pawn! Don't know yet

All the best,
John

Edmund · Post by **Edmund** » Fri Jun 08, 2012 12:32 pm

Some additional thought.
Instead of looping over the 64 squares in the outher loop, you can also keep piecelists and loop over the maximum of 16 pieces.

In total you would then increment the inner part of the loop
16*8*7=896 times
instead of the 64*8*7=3584

The additional effort needed during move and unmove (updating of piecelists) should be compensated by the time saved in movegeneration.

Regarding Svens argument; IIRC conditions are not a problem for GPUs, rather different threads taking different routes at conditions is. As such loops with fixed number of iterations should be fine. If I am wrong you would have to unroll the 896 iterations, if the compiler can't do that automatically.

Edmund · Post by **Edmund** » Fri Jun 08, 2012 12:41 pm

johnhamlen wrote: I agree with your point about looping and branches, though Edmund's pseudo-code goes a long way to fixing this for non-pawn pieces. If I understand it correctly, this would mean not testing if you are off the board on every iteration of following the ray, but rather iterating 7 times in every direction whether you stepped off the board on the previous iteration or not.

Haven't thought it through properly yet, but if this is the case then it would break the "pure" 0x88 model, but hopefully can be rescued by some clever modulo arithmetic?

Detecting stepping off the board is trivial in the 0x88 model. The &= with the legality of the last move in a certain sliding direction is only needed to capture the case where the last move was illegal due to an occupied square. All further sliding moves must be set illegal then too.

johnhamlen wrote:With further small lookup arrays as proposed by Edmund I can see pawn promotions being taken care of by just assuming that every move is a promotion, it just being that pawns and pieces normally promote to themselves. I can't (at the moment!) see how pawn capture moves can be generated in this branchless SIMD framework, but looking forward to thinking about it!

Pawn caputres are no big deal.
Just add before incrementing the movecounter:
move_is_legal &= (piece is no pawn) || (direction is vertical && to is empty) || (direction is diagonal && to square is opponent piece)

Daniel Shawul · Post by **Daniel Shawul** » Fri Jun 08, 2012 1:12 pm

Listen to Sven. You are getting wrong advice here. I don't know if you just want to write a 100% branchless code or write a real chess engine ,but that is the least of your concerns. Say I give you a 100% branchless code, then what? The threads are going to execute different search threads anyway. 0x88 requires piece lists too, have you thought about where you are going to store those and the generated moves. This is just a bad advice sorry. Try bitboards (kogge-stone or not) doesn't really matter but atleast you won't spend your time optimizing something you don't need later. You are not going to do any real SIMD instruction on them. Plus there is no requirement that the code be branchless (even if it is possible). Warps don't just sit idle like cpus do when a branchless code is encountered. Other warps are scheduled for execution thus you can get your occupancy higher this way.

Is there such a thing as branchless move generation?

Is there such a thing as branchless move generation?

Re: Is there such a thing as branchless move generation?

Re: Is there such a thing as branchless move generation?

Re: Is there such a thing as branchless move generation?

Re: Is there such a thing as branchless move generation?

Re: Is there such a thing as branchless move generation?

Re: Is there such a thing as branchless move generation?

Re: Is there such a thing as branchless move generation?

Re: Is there such a thing as branchless move generation?

Re: Is there such a thing as branchless move generation?