Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

dragontamer5788 · Post by **dragontamer5788** » Wed Jul 24, 2019 10:56 pm

I know I started this topic saying Tomasulo's algorithm is probably how I'd implement Futures... but the more I write code for that, the more I feel like it'd be an inefficient approach. My brain has been thinking of a message-passing scheme for updating blocked tasks, which seems faster in my brain.

Ah well, I probably should just write both methodologies and then measure.

Dann Corbit · Post by **Dann Corbit** » Thu Jul 25, 2019 12:02 am

It seems logical to me that at some point the GPU card vendors will add a hardware instruction for it.

dragontamer5788 · Post by **dragontamer5788** » Thu Jul 25, 2019 12:53 am

Dann Corbit wrote: ↑Thu Jul 25, 2019 12:02 am It seems logical to me that at some point the GPU card vendors will add a hardware instruction for it.

On the one hand, I think its unlikely. High-level constructs are probably best left to developers to figure out the details. On the other hand: AMD Vega ISA has thread-barriers (not to be confused with memory barriers... also in the ISA), semaphores, and "wakeup" instructions at the assembly level (!!). Sooo... anything is possible.

Honestly, for the kind of work that I'm doing, what I really want are those Acquire/Release semantics (that NVidia is allegedly working on). Who knows how many years will go by before they're implemented however (let alone widespread), so its probably best to just work with what we got today. A half-barrier should be more efficient than a full barrier.

dragontamer5788 · Post by **dragontamer5788** » Fri Jul 26, 2019 2:20 pm

I'm not very studied in multi-threaded / parallel programming. But I keep coming up with this pattern. I bet its already been discovered by other people, but on the off chance that it isn't, here's the pattern.

I call it "Last one out turn off the lights". Lets say you have a room (aka: resource) with a number of agents concurrently using that room. For example: Malloc + Reference Counting. Once all entities have "ref-count down" a malloc'd region, you want to concurrently delete the object. Second example: message passing. You want every entity to receive a message before executing the next step.

The code to do so seems to be:

Code: Select all

uint32_t count = blah; // Reference count in Malloc for example. 

// Whenever an entity "uses" it...
oldCount = atomicSubtract(count, 1); // Can be a fully relaxed ordering
if(oldCount == 1){// This thread is the "last one out"
	// Turn out the lights
	__threadfence(); // Create the happens-before relationship before setting flag
	lights-out-task(); // In the case of reference-counting, this would be a "free()"
	__threadfence();
	lights_out_flag = 0;
}

The important thing is to have a separate flag to perform the memory-ordering / happens-before relationship on. You don't want all your threads performing the expensive __threadfence() (remember: this flushes the incoherent L1 cache on GPUs. __threadfence() is extremely expensive, costing thousands of clock ticks).

oldCount = 0 happens before lights-out-task() happens before lights_out_flag = 0.

------------------------

Oh, and my malloc /free goes to a thread local buffer, to minimize global transactions. So the "lights-out-task" (which is a "global free") is only run when there are 2048 pointers to free... and if it is ever run for malloc, there are 2048 pointers gathered at a time. To minimize the expense of that very expensive __threadfence() routine.

----------------------

I still have some unresolved questions: like whether or not the atomicSubtract() can be relaxed, or if it needs to be ordered. I'll have to think about it further. Still, this general pattern seems to be extremely useful when working with a large number of threads. If you want to cooperatively schedule one thread to do something after all other threads has completed a task (either for work-efficiency reasons, or for code-correctness reasons), this pattern will inevitably be used.

grahamj · Post by **grahamj** » Fri Jul 26, 2019 6:10 pm

The example in B.5 of the CUDA C programming Guide seems similar.

dragontamer5788 · Post by **dragontamer5788** » Wed Jul 31, 2019 7:53 pm

The more and more I write this, the more I realize that I don't know how schedulers work.

My code, as it stands, is close to the old Linux 2.4 O(n) scheduler. But I keep hesitating because I know that I will have ~1 million tasks in my data-structures, and running a O(n) algorithm over that is just nutty. Linux's current scheduler is the "Completely Fair Scheduler", which is complicated and does a bunch of things I don't care about. Its also tree-based and is therefore O(log(n)).

Some research shows that back in Linux 2.6.1, Linux's scheduler was a O(1) algorithm: https://github.com/bdaehlie/linux-cpu-s ... eduler.pdf

This seems closer to what I want. I don't need the "expired" lists (because I don't care about timeslices. All "tasks" will end rather quickly in my code). I think writing a SIMD-version of the various data-structures (SIMD-linked list, SIMD-bitmasks, etc. etc) and implementing the O(1) scheme in the GPU will be the best hope for this Alpha-Beta program.

The major qualm is that I want to enable speculative execution of Beta-Blocked tasks, which means that some Beta-Blocked tasks will go from the blocked-state (on a WaitQueue) directly into the runnable state. I guess that's similar to Linux's "interruptable" waits however. So maybe everything will work out if I study the 2.6.1 Linux scheduler enough.

I think I want to do fancier things with priorities. I really want to priority schemes: a "Work efficient" scheduler, and a "Memory efficient" scheduler. But running two schedulers simultaneously with different data-structures is confusing to my brain. Tasks at "DepthRemaining 0" will be more memory-efficient: they won't have any new children to spawn.

The main issue is that the "Memory Efficient" scheduler will execute Beta-blocked tasks speculatively: to try to minimize memory usage. A speculative Beta-blocked task with Depth-Remaining 2 will be more memory-efficient than the Work-efficient AlphaBeta task with Depth Remaining 3.

Speculative execution however is wasteful: there is only a chance that the work needed to be done... while the other tasks I've identified need to be run regardless. The work-efficient scheduler will provably run out of memory however (it needs an exponential amount of memory). So balancing these two schemes is difficult. A dynamic scheduler which switches between the two schemes would probably be best, but it requires morphing the data-structures in a very costly manner.

Hmm... the minimum-viable product is the memory-efficient scheduler. So that's probably what I should write first.

dragontamer5788 · Post by **dragontamer5788** » Wed Jul 31, 2019 8:20 pm

A dynamic scheduler which switches between the two schemes would probably be best, but it requires morphing the data-structures in a very costly manner.

Actually... is it costly? It would be O(n), and only done sometimes. It won't even have to be a global data-morph either, maybe the schedulers (all 256-independent wavefronts) could individually choose memory-efficient vs work-efficient modes.

Hmm... that... probably would work. As long as it was a sufficiently rare operation, it may not be that bad. Especially if there weren't any global-synchronization. A "memory-efficient" scheduler wouldn't be work-stealing (why steal work from the global work-queues if your local-queues are nearly out of memory?)

dragontamer5788 · Post by **dragontamer5788** » Tue Aug 13, 2019 10:08 pm

Most of my recent coding efforts have been relatively straightforward SIMD-data structures. Since this is C++, I'm templating these data structures as necessary. They are specialized to go for higher-speeds (for example, my SIMD-malloc structure only works for a singular, templated size). A "Local" datastructure will only work within a workgroup (I've configured 256-SIMD Threads per workgroup: 4x wavefronts). "Local" datastructures cannot communicate outside of their workgroup, but can use barriers, __shared__memory, and block-level thread fences for superior data-sharing.

1. SIMDAllocArray: Global Lock-free single-size SIMD-malloc -- Single-size lock-free Bitset with "local" caching (4096 pointers cached per workgroup). Current sizes I'm using are 2-bytes (for Futures), 64-bytes (for movelists), 512 bytes (for linked-lists, tasks, and other larger groups of data). 24-bit "pointer", for a max of 16-million elements per SIMDAllocArray. The Local cache uses locks (with only 4-wavefronts per local cache, I don't expect much thread contention), while the global data-structure is entirely lock-free / AtomicAnd / AtomicOr + memory fences based.

2. Local SIMD-Linked List -- Simple use of SIMDAllocArray to create a linked list. 64-elements per node so a single pointer dereference allows all 64-threads of a wavefront to select an item and work on it.

3. Movelist -- struct MoveList { MoveChunk[8] theArray, uint8_t length }; struct MoveChunk{uint16_t[32] moves}. Very simple data-structure for move-lists: supporting 32-moves (100 bytes used) to 256-moves (548 bytes) while reducing the number of linked-list traversals needed. Inspired by page-descriptor tables.

3. Local SIMD-HashTable -- Currently working on an linear-probing SIMD Local Hash table. 64-adds / removes per wavefront.

SIMDAllocArray is fundamental to #2 Linked Lists and #3 MoveLists. TaskParameters are also put onto the global heap to be shared between different workgroups.

------------

I've settled on the WaitQueues approach, which will use #2: SIMD Linked Lists to represent Futures. Whenever a task is complete and its FutureValue is assigned to its associated FutureSymbol, all tasks waiting on that FutureValue will be notified (O(n) time, where n is the number of Tasks that need to be notified).

I really don't want to write a SIMD-Hash Table, but its the only way to organize the set of WaitQueues any particular workgroup is interested in. If Workgroup #5 is interested in the futures F1, F5, F20, and F24, its a waste to have a full size 16Million array (2^24 futures possible). Local Schedulers will only be able to handle 4096 tasks... and those tasks have a worst-case size of 16-futures they're waiting for (65536 futures worst-case). So a SIMD-Hash Table of size 65536 per workgroup (x256 workgroups) just makes sense.

16-Million sized arrays aren't necessarily bad (only ~64MB), but with 256x workgroups in play, that 64MB array balloons into 16GB of RAM. All of these data-structures have a x256 or x65536 multiplier on them, because so many threads (or workgroups) are going to have data-structures of their own. So I really can't afford to use MB sized data-structures.

I'm having some difficulty on how to handle the SIMD-Delete operator. Linear Probing requires you to shift data backwards, but a SIMD-delete will delete (up-to) 64 elements from a Hash Table in parallel. I've got a "temporary deleted" state to handle this case, but testing all of the edge cases is taking a lot of brain power and time.

-------

Hash function is a "secret sauce" of mine that I'm quite proud of discovering. GPUs have a single-cycle bitreversal instruction. So my hash function is XOR -> 32-bit multiply (with odd-number) -> Bit-reversal -> XOR -> 32-bit Multiply (with odd constant). GPUs should be able to execute all of that in less than 15 cycles, and its provable that the sequence is 1-to-1 and invertible (All 4-Billion 32-bit inputs map uniquely to exactly one of the 4-billion 32-bit outputs).

I don't know if anyone else has discovered this methodology, but XOR -> Multiply -> Bit-reverse -> XOR -> Multiply really does have great statistical properties for fixed-bit hashing (32-bit or 64-bit). As long as those multiplies are with odd-numbers of course (bottom bit of the constant is set to 1)

CPU programmers can perform a similar "bit reversal" with the bswap instruction by the way.

duncan · Post by **duncan** » Tue Aug 13, 2019 11:27 pm

dragontamer5788 wrote: ↑Thu Jun 20, 2019 9:50 pm Introduction

A few weeks ago, I theorized that GPUs can effectively perform work-efficient alpha-beta search. It led to a long discussion elsewhere on the internet, which pretty much ended with "If you think its possible, then you should do it". A couple of weeks later, here I am today ready to learn from the Chess community and begin tackling this project.
memory bandwidth than your typical 20GB/s DDR4 channel.

May I ask if you have a target date in your head, when you think you may have had a chance to solve it.?

dragontamer5788 · Post by **dragontamer5788** » Tue Aug 13, 2019 11:42 pm

duncan wrote: ↑Tue Aug 13, 2019 11:27 pmMay I ask if you have a target date in your head, when you think you may have had a chance to solve it.?

Hmmm... well... I've already math'd out the original specification and have found it unreasonable in practice. There is certainly an O(#children^(ply/2)) portion of parallel work that can be lazily evaluated, but that requires O(#children^(ply/2)) amount of memory to execute all of that in parallel.

If you're talking about my current attempt at the problem (which will have an amount of speculative execution: and therefore a degree of work-inefficiency): maybe a few months for an initial prototype over a simplified minimax tree (with a chess port over the next few months after that).

A lot of these data-structures I want to use don't have any GPU implementation. Like, I want a fixed-size open-addressing SIMD-hash table.

--------

EDIT: Lol, I just checked NVidia and... look at this:

https://docs.nvidia.com/cuda/cuda-toolk ... w-features

thrust::event and thrust::future<T>, uniquely-owned asynchronous handles consisting of a state (ready or not ready), content (some value; for thrust::future only), and an optional set of objects that should be destroyed only when the future's value is ready and has been consumed.

...

Multiple thrust::events and thrust::futures can be combined with thrust::when_all.

Well, it seems like NVidia's libThrust may have accomplished one of my major innovations, in general. So it seems like maybe I'd be a lot faster if I used LibThrust + NVidia's kit of tools...

Ah well, too late for that now. I already wrote that portion of the code a few weeks ago. Hmm, I'm not sure if its "task based", but NVidia's Thrust would at least be useful as a prototyping framework...

Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search