Is there such a thing as branchless move generation?

Edmund · Post by **Edmund** » Fri Jun 08, 2012 1:18 pm

Daniel Shawul wrote:Listen to Sven. You are getting wrong advice here. I don't know if you just want to write a 100% branchless code or write a real chess engine ,but that is the least of your concerns. Say I give you a 100% branchless code, then what? The threads are going to execute different search threads anyway. 0x88 requires piece lists too, have you thought about where you are going to store those and the generated moves. This is just a bad advice sorry. Try bitboards (kogge-stone or not) doesn't really matter but atleast you won't spend your time optimizing something you don't need later. You are not going to do any real SIMD instruction on them. Plus there is no requirement that the code be branchless (even if it is possible). Warps don't just sit idle like cpus do when a branchless code is encountered. Other warps are scheduled for execution thus you can get your occupancy higher this way.

Hey Daniel, nice to hear your thought. would love to talk to you in the TLCV Chat some time again.

Daniel Shawul · Post by **Daniel Shawul** » Fri Jun 08, 2012 5:39 pm

Well 0x88 is simply a bad idea on gpus because of maintaing an additional piece list that you don't need and ofcourse an int board[64] for each thread you spawn (could be thousands). With the small L1 cache,computing stuff is always better than carrying luggage around. Ofcouse if you are planning not to use each thread to do an independent search then that is fine. But do realize you are loosing big before you even begin coding. Sorry if I come strong but I saw the thread going in the wrong direction (from _my perspective_) so I thought I would throw in my opinion.
The OP posted here not on TLCV chat plus lots of garbage there.

johnhamlen · Post by **johnhamlen** » Sat Jun 09, 2012 3:17 pm

Edmund wrote: Pawn caputres are no big deal.
Just add before incrementing the movecounter:
move_is_legal &= (piece is no pawn) || (direction is vertical && to is empty) || (direction is diagonal && to square is opponent piece)

Thanks Edmund, very neat and obvious (in a good way!). I guess I hadn't fully gotten my head around the move_is_legal flag use. Powerful. Cheers.

johnhamlen · Post by **johnhamlen** » Sat Jun 09, 2012 4:43 pm

Daniel Shawul wrote:Listen to Sven. You are getting wrong advice here. I don't know if you just want to write a 100% branchless code or write a real chess engine ,but that is the least of your concerns. Say I give you a 100% branchless code, then what? The threads are going to execute different search threads anyway. 0x88 requires piece lists too, have you thought about where you are going to store those and the generated moves. This is just a bad advice sorry. Try bitboards (kogge-stone or not) doesn't really matter but atleast you won't spend your time optimizing something you don't need later. You are not going to do any real SIMD instruction on them. Plus there is no requirement that the code be branchless (even if it is possible). Warps don't just sit idle like cpus do when a branchless code is encountered. Other warps are scheduled for execution thus you can get your occupancy higher this way.

Hi Daniel. Really pleased to see you chipping in as, from looking at other posts on the forum - you and Srdja Matovic are actively coding chess on GPUs. Was hoping that you and/or Srdja would join.

I must say say that I'm still very pleased with Edmund's (and everyone else's) advice but probably need to explain a bit more background so you know where I'm coming from and also to answer your questions in the process. I posted a longer, question before this one but deleted it because it was just too big a brain dump of all the questions in my head for anyone to have a realistic chance of answering ... so no-one did

. This branchless movegen query was just one of them!

Daniel Shawul wrote: I don't know if you just want to write a 100% branchless code or write a real chess engine ,but that is the least of your concerns.

Yes and yes. I'm looking to run branchless code on the GPU to run monte-carlo simulations (which is why I'm so pleased you've joined the thread!) and - if successful and promising, I'll write a chess engine to run on the host CPU(s).

Daniel Shawul wrote: Say I give you a 100% branchless code, then what? The threads are going to execute different search threads anyway.

I definitely don't want anyone to give me any branchless code, especially not you because I'd that your code is going to be rather elegent as you've already been coding in this MC/chess/GPU space for a while. Nothing worse than seeing really good code and then trying to clear it out of your mind and do your own thing

. The search threads and all intelligence will be on the host.

Daniel Shawul wrote: 0x88 requires piece lists too, have you thought about where you are going to store those and the generated moves.

Only need to store one movelist per core, or just one move if I can figure out how to generate one random move, rather than generate all moves and pick one from the list. So hoping that it/they will fit in register/local/L1 (?) (Still learning about how GPU memory models work.

Daniel Shawul wrote: Try bitboards (kogge-stone or not) doesn't really matter but atleast you won't spend your time optimizing something you don't need later. You are not going to do any real SIMD instruction on them.

Plan at the moment is to forget about the host search for the moment and just see if I can write and optimise a MC simulation engine that will run on a GPU. So the idea is indeed to see if I can run fully SIMD and keep all units occupied until the final gather operation. Re: bitboards. My ancient chess program (Woodpusher) uses these, so they are within my comfort zone but, even though things seem to have moved on a lot in this space, they still appear to rely rather a lot on big table lookups which - from what I've learned so far about GPU computing - seems to be a real no-no as far as performance is concerned because of the huge latency of global memory access. So I was very happy to get the following response from my old friend Gerd:

Gerd Isenberg wrote:Direction-wise move generation can be done almost branchless without any lookups but pure SIMD register computation with fill algorithms aka Kogge-Stone for sliding piece attacks.

However, it's not immediately obvious to me how the Koggie-Stone methods work and how I can map them to branchless code. I feel that I need to walk before I run, so Edmund's post was very welcome! I haven't written a line of chess code since 1997 (Woodpusher) or a line of Monte-Carlo simulation code since 2004 (DumbGo), so 0x88 appears to be a nice introduction back into the world of coding, and I can see how it maps easily to a pure, branchless SIMD model. If I can get that running, I can then use that as a stepping stone (pardon the pun) to Koggie-Stone. What do you think.

Many thanks,
John

johnhamlen · Post by **johnhamlen** » Sat Jun 09, 2012 4:57 pm

Edmund wrote:Some additional thought.
Instead of looping over the 64 squares in the outher loop, you can also keep piecelists and loop over the maximum of 16 pieces.

In total you would then increment the inner part of the loop
16*8*7=896 times
instead of the 64*8*7=3584

The additional effort needed during move and unmove (updating of piecelists) should be compensated by the time saved in movegeneration.

Regarding Svens argument; IIRC conditions are not a problem for GPUs, rather different threads taking different routes at conditions is. As such loops with fixed number of iterations should be fine. If I am wrong you would have to unroll the 896 iterations, if the compiler can't do that automatically.

Good thinking. Thanks. I believe you're right regarding the looping. Fixed length loops won't mess up any massively parallel goodness, and unrolling 896 iterations won't be practical because of the limited size of GPU instruction caches. Code "kernels" are limited in size. Hopefully Daniel or another GPU programmer will be able to confirm or refute this.

Cheers,
John

Daniel Shawul · Post by **Daniel Shawul** » Sat Jun 09, 2012 8:02 pm

Yes and yes. I'm looking to run branchless code on the GPU to run monte-carlo simulations (which is why I'm so pleased you've joined the thread!) and - if successful and promising, I'll write a chess engine to run on the host CPU(s).

Communication between host and device is very slow so it is much better to do everything either on the device or the host in some cases (e.g hash table generation). You should really consider doing everything on the gpu as the latency there will not probably change anytime soon.

I definitely don't want anyone to give me any branchless code, especially not you because I'd that your code is going to be rather elegent as you've already been coding in this MC/chess/GPU space for a while. Nothing worse than seeing really good code and then trying to clear it out of your mind and do your own thing . The search threads and all intelligence will be on the host.

You sound like a really nice guy but my code is bad too, and I don't want to spoil the fun for anyone else. I tried to do everything on the gpu (so did Srdja) for reasons I mentioned above.

Only need to store one movelist per core, or just one move if I can figure out how to generate one random move, rather than generate all moves and pick one from the list. So hoping that it/they will fit in register/local/L1 (?) (Still learning about how GPU memory models work.

I think you should try it first picking up one of better gpus with the larger cache size. It is hard to fit everything on the L1. Bitboards are better because they are compact. To run a search of any sort you need atleast (ant int board[64] and piece lists for 0x88). The most compact would be a Quad bitboard but that requires more computation to extract which again is nothing compared to what it saves. I don't use that but any bitboard representation would be much better than a piece list representation. Also if you do a copy of the position later (for instance copy-make is better for me than do-undo), bitboards are better. The number of threads you can spawn is limited by how many you can fit in the L1 or else face the consequences of the slow global memory. If that is the case with the global memory you can imagine what it would be with a cpu-gpu shared search.

Plan at the moment is to forget about the host search for the moment and just see if I can write and optimise a MC simulation engine that will run on a GPU. So the idea is indeed to see if I can run fully SIMD and keep all units occupied until the final gather operation. Re: bitboards. My ancient chess program (Woodpusher) uses these, so they are within my comfort zone but, even though things seem to have moved on a lot in this space, they still appear to rely rather a lot on big table lookups which - from what I've learned so far about GPU computing - seems to be a real no-no as far as performance is concerned because of the huge latency of global memory access. So I was very happy to get the following response from my old friend

Try kindergarten bitboards, they are very compact and fit in to constant cache (8kb per mp), not more than 3kb for my case. They are as fast as the registers when stored there. Magics move generation requires bigger tables so I didn't try that but it is much better than rotated bitboards in this regard.

However, it's not immediately obvious to me how the Koggie-Stone methods work and how I can map them to branchless code. I feel that I need to walk before I run, so Edmund's post was very welcome! I haven't written a line of chess code since 1997 (Woodpusher) or a line of Monte-Carlo simulation code since 2004 (DumbGo), so 0x88 appears to be a nice introduction back into the world of coding, and I can see how it maps easily to a pure, branchless SIMD model. If I can get that running, I can then use that as a stepping stone (pardon the pun) to Koggie-Stone. What do you think.

Honestly I think you should forget about branchless code. Even if you manage to do that (eg. Hex,reversi have branchless move genrators) the search diverges. Also you should look into how gpu handles branching code. Another warp gets scheduled for execution. This is different from the way it is handled in cpus that incurs wasted time on branching code. If your occupancy is above 25% (which is relatively easy to achieve), you can assume your branching code executes as fast as the branchless. GPU code is usually memory bound, and anything you do to make computaiton faster (in your case branchelss code) will not benefit you unless you solve the former first.
cheers
Daniel

Gerd Isenberg · Post by **Gerd Isenberg** » Sat Jun 09, 2012 9:48 pm

johnhamlen wrote:So I was very happy to get the following response from my old friend Gerd:
Gerd Isenberg wrote:Direction-wise move generation can be done almost branchless without any lookups but pure SIMD register computation with fill algorithms aka Kogge-Stone for sliding piece attacks.
However, it's not immediately obvious to me how the Koggie-Stone methods work and how I can map them to branchless code.

First, thank you for your kind words on cpw, founded by Mark Lefler, where I happen to become addicted editor.

The point with fill-stuff is that it is SIMD-friendly and one may compute vectors of bitboards simultanously, i.e. a quad-bitboard within two SSE 128-bit xmm or one AVX 256-bit ymm register(s), for instance with white/black rooks/queens and kings for the four orthogonal directions, and white/black bishops/queens and kings for the diagonal directions, for all sliding and pseudo attacks to determine pinned pieces etc, to finally store legal (one condition per direction) move-targets for each direction as a kind of fixed-sized unsorted move-list. May be better look on dumb7fill first, which is not much worse than its parallel prefix Kogge-Stone approach.

Of course, picking the moves from this move-lists requires an appropriate state machine (big jump-table on state) and conditional branches, bitscan target, determining source, and you don't want to sac exponential speedup due to worse move-ordering by linear one from almost branchless and parallel generation. I use such a thing within a quad-bitboard color-flipper, white to move only, boosting probe-hash prefetch.

Cheers,
Gerd

johnhamlen · Post by **johnhamlen** » Sun Jun 10, 2012 4:21 pm

Daniel Shawul wrote: Communication between host and device is very slow so it is much better to do everything either on the device or the host in some cases (e.g hash table generation). You should really consider doing everything on the gpu as the latency there will not probably change anytime soon.

Many thanks. Yes, even PCI-E 3.0 isn't the quickest thing in the world. I guess I was hoping that as most of the grunt work is done in the MC simulation runs the ratio of time spent churning away on the GPU to time spent getting results back to the CPU would still be pretty high. But you're right, I should definitely plan to move the gather, and anything else I can across to the GPU.

Daniel Shawul wrote: I think you should try it first picking up one of better gpus with the larger cache size. It is hard to fit everything on the L1.

Will do. I have a GTX 550 Ti at the moment. It's a pretty entry level card but reports a CL_DEVICE_LOCAL_MEM_SIZE of 48Kb while the top-end cards report no more. Sorry for the newbie question, but what number(s) should I be looking at to determine availability of fast memory per compute core? Actually, hold that thought ... I'll post a separate thread for that as it might get some interesting CUDA vs OpenCL and AMD vs Nvidia discussion going! Watch this space...

Daniel Shawul wrote: Bitboards are better because they are compact. To run a search of any sort you need atleast (ant int board[64] and piece lists for 0x88). The most compact would be a Quad bitboard but that requires more computation to extract which again is nothing compared to what it saves. I don't use that but any bitboard representation would be much better than a piece list representation.

Is it significantly less? Thinking aloud:
14 x 64bit bitboards = 112 bytes
vs.
128 x 8bit piece-on-sq + 16 x 8bit sq-of-piece = 160 bytes

Daniel Shawul wrote: Also if you do a copy of the position later (for instance copy-make is better for me than do-undo), bitboards are better. The number of threads you can spawn is limited by how many you can fit in the L1 or else face the consequences of the slow global memory. If that is the case with the global memory you can imagine what it would be with a cpu-gpu shared search.

Agreed. When I was coding DumbGo I was so happy I didn't have to worry about doing an unmakemove() function! Do you have a magic formula for calculating the L1 cache available to each core? Surfing around various pages - including the excellent http://www.codeproject.com/Articles/122 ... ory-Spaces aren't giving a straight answer

Daniel Shawul wrote: Try kindergarten bitboards, they are very compact and fit in to constant cache (8kb per mp), not more than 3kb for my case. They are as fast as the registers when stored there. Magics move generation requires bigger tables so I didn't try that but it is much better than rotated bitboards in this regard.

Cheers Daniel. I will check these out!

Daniel Shawul wrote: Honestly I think you should forget about branchless code. Even if you manage to do that (eg. Hex,reversi have branchless move genrators) the search diverges. Also you should look into how gpu handles branching code. Another warp gets scheduled for execution. This is different from the way it is handled in cpus that incurs wasted time on branching code. If your occupancy is above 25% (which is relatively easy to achieve), you can assume your branching code executes as fast as the branchless. GPU code is usually memory bound, and anything you do to make computaiton faster (in your case branchelss code) will not benefit you unless you solve the former first.

Thanks again Daniel. Great advice.
Cheers
John

johnhamlen · Post by **johnhamlen** » Sun Jun 10, 2012 7:28 pm

Gerd Isenberg wrote: First, thank you for your kind words on cpw, founded by Mark Lefler, where I happen to become addicted editor.

You're most welcome. The whole of the CC community appreciates your efforts on CPW. If I keep momentum going, and have something useful to add, I'll certainly contribute some copy to the site.

I remember Mark well from Vancouver TWENTY-ONE years ago. Gosh we're getting old

.

Gerd Isenberg wrote: The point with fill-stuff is that it is SIMD-friendly and one may compute vectors of bitboards simultanously, i.e. a quad-bitboard within two SSE 128-bit xmm or one AVX 256-bit ymm register(s), for instance with white/black rooks/queens and kings for the four orthogonal directions, and white/black bishops/queens and kings for the diagonal directions, for all sliding and pseudo attacks to determine pinned pieces etc, to finally store legal (one condition per direction) move-targets for each direction as a kind of fixed-sized unsorted move-list. May be better look on dumb7fill first, which is not much worse than its parallel prefix Kogge-Stone approach.

Interesting. I haven't written a line of GPU code yet, but I do know that vector arithmetic is their bread and butter, and that ulongN is a built in data type in OpenCL where N is the length of the vector, so could be very handy for this sort of thing.

Gerd Isenberg wrote: Of course, picking the moves from this move-lists requires an appropriate state machine (big jump-table on state) and conditional branches, bitscan target, determining source, and you don't want to sac exponential speedup due to worse move-ordering by linear one from almost branchless and parallel generation. I use such a thing within a quad-bitboard color-flipper, white to move only, boosting probe-hash prefetch.

I think the good news here is that with Monte-Carlo simulations, one only needs a single, randomly chosen move from the move list. So one could keep it branchless using equivalents of ROL/ROR and BSF/BSR if the GPU supports them.

Thanks again Gerd and have a great week.
John

Daniel Shawul · Post by **Daniel Shawul** » Sun Jun 10, 2012 8:33 pm

I have a GTX 550 Ti at the moment. It's a pretty entry level card but reports a CL_DEVICE_LOCAL_MEM_SIZE of 48Kb while the top-end cards

Last time I checked NVIDIA tesla's have 64kb that you can divide into 48kb + 16kb shared memory and local cache. If you just want the device to do most of the caching of global mem for you, use 48kb for that otherwise 48kb of shared memory per multi-processor is what you get.

Is it significantly less? Thinking aloud:
14 x 64bit bitboards = 112 bytes
vs.
128 x 8bit piece-on-sq + 16 x 8bit sq-of-piece = 160 bytes

Well I use 8 bitboards (2 for white/black pieces and 6 for knight,bishop ...). With the most compact format i.e quad only 4 bitboards = 32 are required. Also since you are using 0x88 representation and piece lists you need 128x32bit + 16x32bits = 560bytes. That is a significant difference. You are going to allocate that much for each thread you spawn if you want them to do different work. Also you should be careful about using char tables on gpus because of bank conflicts. Use int whenever possible. Some have tried to use multiple threads (could be a warp) to do a single search (i.e use the threads for fast move generation or evaluation). If that is what you want to do then SIMD friendly move generation makes sense, otherwise the discussion should be a SIMD friendly search...

Agreed. When I was coding DumbGo I was so happy I didn't have to worry about doing an unmakemove() function! Do you have a magic formula for calculating the L1 cache available to each core? Surfing around various pages - including the excellent http://www.codeproject.com/Articles/122 ... ory-Spaces aren't giving a straight answer

There is no magic formula. Once you write your kernel and determine the amount of registers , shared memory and other resources consumed, you can determine what limits your occupancy. Usually it is register usage that is the problem. For cuda, there is an excel table you can use to tune your usage to get maximum occupancy.

Is there such a thing as branchless move generation?

Re: Is there such a thing as branchless move generation?

Re: Is there such a thing as branchless move generation?

Re: Is there such a thing as branchless move generation?

Re: Is there such a thing as branchless move generation?

Re: Is there such a thing as branchless move generation?

Re: Is there such a thing as branchless move generation?

Re: Is there such a thing as branchless move generation?

Re: Is there such a thing as branchless move generation?

Re: Is there such a thing as branchless move generation?

Re: Is there such a thing as branchless move generation?