Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

dragontamer5788 · Post by **dragontamer5788** » Mon Jun 24, 2019 7:52 pm

Philipp Bach wrote: ↑Mon Jun 24, 2019 7:41 pm
* Node Type P: Alpha = -infinity and Beta=infinity. All children of Node-type P can be evaluated in parallel. max(score(child1), score(child2), score(child3)...) is returned. Therefore, all children can be searched in parallel.
Isn't the returned score from child1 usually used as bound for child2 etc...?
If you evaluated all childs in parallel each would still have the bounds Alpha = -infinity and Beta=infinity, so it would effectively degenerate to a minimax search?

Usually. Yes. But my search is different.

Does Child2 really need Child1's calculation to perform its search? ONLY the Type-C nodes under Child2 need Child1's calculation. All other nodes need to be explored regardless of the value of eval(child1).

All the Type-P nodes and Type-A nodes under Child2 can be searched in parallel, while eval(Child1) is being calculated. Type-C nodes must wait if we wish for work-efficiency (and I want work-efficiency).

dragontamer5788 · Post by **dragontamer5788** » Mon Jun 24, 2019 8:28 pm

I think drawing out the tree of Alpha-Beta calls will help.

Notation: Position "P" is the root. P1 is the 1st child of the root. P17 is the 7th child of the 1st-child of the root. F1 is the "future" (potentially evaluated in parallel) of Minimax(P1). Lets assume there are only 9-children (1-9) per node to simplify notation. P1754 is the 4th child of the 5th child of the 7th child of the 1st child of the root. F1754 is minimax(P1754).

Here's the example code for Negamax Alpha-Beta from Chessprogramming.org for reference. I've modified it slightly to make this post easier to follow... I don't care about depth, and Position p is explicitly added so we can talk about the various nodes.

FutureExpression is a tree of futures composed of Max() functions and Negate() functions. A FutureSymbol is a value that will be returned by a Spawn eventually.

Code: Select all

FutureExpression alphaBeta(Position p, FutureExpression alpha, FutureExpression beta) {
   for ( all moves "M" of Position P)  {
      FutureExpression score = Spawn(-alphaBeta(P.M -beta, -alpha));
      if( score >= beta )
         return beta;   //  fail hard beta-cutoff
      alpha = max(score, alpha);
   }
   return alpha;
}

F = AlphaBeta(P, -Inf, +Inf); // Spawns 9 children, P1 through P9.

What does AlphaBeta do inside of the root node?

Code: Select all

F1 = AlphaBeta(P1, -Inf, +Inf); // Child1 is a TypeP node.
// Beta-cutoff is impossible, because Beta = +Inf. All children can be evaluated in parallel.
F2 = AlphaBeta(P2, -Inf, [F1, Negate] ); // F1 is "lazy", do not "wait" for F1 unless absolutely necessary.
F3 = AlphaBeta(P3, -Inf, [F1, F2, Max, Negate] ); 
F4 = AlphaBeta(P4, -Inf, [F1, F2, Max, F3, Max, Negate] ); 
F5 = AlphaBeta(P5, -Inf, [F1, F2, Max, F3, Max, F4, Max, Negate] ); // This statement "lazy evaluates" -(Max(F4, Max(F3, Max(F2, F1)))).
...
F9 = AlphaBeta(P9, -Inf, [F1, F2, Max, F3, Max, F4, Max, ... F8, Max, Negate]);

Note that F itself = [F1, F2, Max, F3, Max, F4, Max, ... F8, Max, F9, Max]; You can evaluate a FutureExpression by imagining a stack, pushing values onto the stack. "Max" takes the max of the previous two elements on the stack. So [F1, F2, Max] becomes Max(F1, F2). [F1, F2, Max, F3, Max] becomes Max(F3, Max(F1, F2)). Etc. etc.

Okay, so nifty. We can evaluate different parts of the tree in parallel, as long as we construct this tree of F-statements to "track" how we get back. But how does Beta-cutoff work? All we gotta do is see the next "level" of the tree. Lets take F5 and see how F51 through F59 are generated.

Code: Select all

Remember: F5 = AlphaBeta(P5, -Inf, [F1, F2, Max, F3, Max, F4, Max, Negate] );
* Alpha = -Inf.
* Beta = [F1, F2, Max, F3, Max, F4, Max, Negate]

F51 = AlphaBeta(P51, [F1, F2, Max, F3, Max, F4, Max, Negate, Negate], +Inf);

// Potential Beta Cutoff! if(F51 >= [F1, F2, Max, F3, Max, F4, Max, Negate, Negate]), then BetaCutoff.
// We must block. We must wait for F51, F1, F2, F3, F4 before we can continue this path

// If F51 >= Beta is false, then F52 is evaluated as follows:

F52 = AlphaBeta(P52, [F1, F2, Max, F3, Max, F4, Max, Negate, Negate], F51);

F52 through F59 are all potentially going to block as well. But most importantly, they won't be evaluated if F52 will not spawn until the Beta-cutoff F51 >= [F1, F2, Max, F3, Max, F4, Max, Negate, Negate] is evaluated. This clearly leads to work-efficiency.

For good measure, lets go down the list one more time, and see how F51 is evaluated.

Code: Select all

F511 = AlphaBeta(P511, -Inf, +[F1, F2, Max, F3, Max, F4, Max, Negate, Negate, Negate]);

// Hey, look at that. F511 has -Inf for Beta, so all F511 through F519 can be evaluated in parallel.

F512 = AlphaBeta(P512, -Inf, +[F1, F2, Max, F3, Max, F4, Max, Negate, Negate, Negate, F511, Max, Negate]);
F513 = AlphaBeta(P513, -Inf, +[F1, F2, Max, F3, Max, F4, Max, Negate, Negate, Negate, F511, Max, F512, Max, Negate]);
F514 = AlphaBeta(P513, -Inf, +[F1, F2, Max, F3, Max, F4, Max, Negate, Negate, Negate, F511, Max, F512, Max, F513, Max, Negate]);
...
F519 = AlphaBeta(P519, -Inf, +[F1, F2, Max, F3, Max, F4, Max, Negate, Negate, Negate, F511, Max, F512, Max, F513, Max, ... F518, Max, Negate]);

F51 = [F1, F2, Max, F3, Max, F4, Max, Negate, Negate, Negate, F511, Max, F512, Max, F513, Max, ... F518, Max, F519, Max]

We now have the FutureExpression for F51: [F1, F2, Max, F3, Max, F4, Max, Negate, Negate, Negate, F511, Max, F512, Max, F513, Max, ... F518, Max, F519, Max]. Representing Max (F519, Max(... Max(F512, Max(F511,- - -Max(F4, Max(F3, Max(F1, F2)))))

The triple-negate comes from the Negamax framework, negating each time a child is visited. We now see that the "Beta-Cutoff" blocking F52 will require F1, F2, F3, F5, F511, F512, F513... F519 to calculate. But we've generated a LOT of nodes that can be processed in parallel still, so there's a lot of work the GPU can do.

Daniel Shawul · Post by **Daniel Shawul** » Mon Jun 24, 2019 8:44 pm

dragontamer5788 wrote: ↑Mon Jun 24, 2019 7:52 pm
Philipp Bach wrote: ↑Mon Jun 24, 2019 7:41 pm
* Node Type P: Alpha = -infinity and Beta=infinity. All children of Node-type P can be evaluated in parallel. max(score(child1), score(child2), score(child3)...) is returned. Therefore, all children can be searched in parallel.
Isn't the returned score from child1 usually used as bound for child2 etc...?
If you evaluated all childs in parallel each would still have the bounds Alpha = -infinity and Beta=infinity, so it would effectively degenerate to a minimax search?
Usually. Yes. But my search is different.

Does Child2 really need Child1's calculation to perform its search? ONLY the Type-C nodes under Child2 need Child1's calculation. All other nodes need to be explored regardless of the value of eval(child1).

All the Type-P nodes and Type-A nodes under Child2 can be searched in parallel, while eval(Child1) is being calculated. Type-C nodes must wait if we wish for work-efficiency (and I want work-efficiency).

You seem to have some serious misunderstanding of node types. Just because you classify a node as CUT/ALL node or whatever scheme you use, it doesn't mean it will abide by your classification. Infact it is common that cut/all nodes are flipped when fail high/low occurs. The node type classification can be used to increase efficiency of speculative parallel algorithms though. If you get the classification of nodes accuratley more often than not, say using some heuristic, you can use that to delay parallelization (e.g. in case of CUT nodes) or start parallelization on ALL nodes right away even without searching the first move.

For you to understand this, I suggest you start from the very first parallel alpha-beta splitting algorithm: the PV-split. This one does splitting only at PV nodes i.e. [-inf, inf] windows for fist move. People in the 70s understood that you need to establish a proper bound for efficient search before you parallelize, otherwise, you would end up doing minmax search because your bound is [-inf,inf] when you decide to parallelize right away at PV nodes ...

EDIT: Okay, lemme give a quickie explanation why I think its possible.

The short-version of my hypothesis is that... PV-nodes are inherently parallel (all children of PV-nodes must be searched, as you have [-infinity, infinity] as alpha-beta). ALL-nodes are parallel in classical Alpha-Beta search (in Alpha-Beta, all children of an ALL node are searched. Other algorithms, such as aspiration-windows, reduce the amount of work by visiting fewer than all children in an ALL-node). CUT-nodes are the only sequential node in Alpha-Beta search.

Are you serious ??

dragontamer5788 · Post by **dragontamer5788** » Tue Jun 25, 2019 7:07 am

I'm beginning to realize the importance of my previous post (viewtopic.php?f=7&t=71058&start=20#p802522). However, upon re-reading it, it seems like I've made quite a few minor errors, and the overall wording seems subpar. I'm going to remake that post, except this time, I'll pay more attention to negates!

Once again: P represents the root node. P1 represents the 1st child of the root node. P15 represents the 5th child of the 1st child of the root node. F represents the "FutureSymbols" that keep track of the Spawned tasks. All spawned tasks will eventually report a 16-bit integer for F, representing centipawns.

Starting at the root (Node P).

Code: Select all

Evaluation of: F = AlphaBeta(P, -Inf, +Inf);
Alpha: -Infinity
Beta: +Infinity

F1 = - AlphaBeta(P1, -Inf, +Inf);
F2 = - AlphaBeta(P2, -Inf, [F1, Negate]);
F3 = - AlphaBeta(P3, -Inf, [F1, F2, Max, Negate]);
F4 = - AlphaBeta(P4, -Inf, [F1, F2, Max, F3, Max, Negate]);
F5 = - AlphaBeta(P5, -Inf, [F1, F2, Max, F3, Max, F4, Max, Negate]);
...
F9 = - AlphaBeta(P9, -Inf, [F1, F2, Max, F3, Max, F4, Max, F5, Max, F6, Max, F7, Max, F8, Max, Negate]);

F = [F1, F2, Max, F3, Max, F4, Max, F5, Max, F6, Max, F7, Max, F8, Max, F9, Max];

I'll take this time to explain how to evaluate a FutureExpression list: [F1, F2, Max, F3, Max, F4, Max, Negate] (this is the beta of P5). Max pops the last two values on the stack, and then pushes the larger value back onto the stack. Negate pops the last value on the stack, and then pushes the negation of that value. (-10 becomes +10, and +10 becomes -10). Overall, the FutureExpression [F1, F2, Max, F3, Max, F4, Max, Negate] will become -Max(F4, Max(F3, Max(F2, F1))) through this simple "stack machine".

Lets look at F2's calculation in more detail. Remember: F1 is executing in parallel with F2. Therefore, F1 is fully unknown. Despite being an unknown value, we can continue to traverse the Alpha-Beta tree.

Code: Select all

Evaluation of: F2 = - AlphaBeta(P2, -Inf, [F1, Negate]);
Alpha: -Infinity
Beta: [F1, Negate]

F21 = - AlphaBeta(P21, [F1, Negate, Negate], +Inf);

F21.wait(); // Assume F21 resulted in -20 Centipawns.
Beta.wait(); // Assume Beta == [F1, Negate] evaluated to +88 Centipawns.

if (F21 >= Beta)
	return Beta; 

F22 = - AlphaBeta(P22, -88 Centipawns, 20 Centipawns);

F22.wait(); // Assume F22 resulted in 153 Centipawns

if (F22 >= 88 Centipawns) 
	return  88 Centipawns
	
F2 = -88 Centipawns

Assuming P22 achieves Beta-cutoff, Sequential Alpha-Beta would only evaluate the P21 and P22 trees. Young Brothers Wait however is extremely work-inefficient, calculating all trees (P21, P22, P23, P24, P25, P26, P27, P28, and P29) due to P21 not achieving Beta-cutoff. It was unnecessary to evaluate P23 through P29.

My algorithm properly inserts a wait-state here, to ensure that work-efficiency remains on par with Sequential Alpha-Beta pruning. It is absolutely essential to wait and perform this node one-step-at-a-time if we wish to achieve parity with the Sequential version of the algorithm.

I call F2 a Type-C node. It is forced to execute sequentially because of the potential of a Beta-cutoff. There is a superficial similarity to CUT-nodes.

Lets continue seeing how things are evaluated. Lets look first at F21.

Code: Select all

Evaluation of: F21 = - AlphaBeta(P21, [F1, Negate, Negate], +Inf);
Alpha: [F1, Negate, Negate]
Beta: +Inf

F211 = - AlphaBeta(P1, -Inf, [F1, Negate, Negate, Negate]);
F212 = - AlphaBeta(P2, -Inf, [F1, Negate, Negate, Negate, F211, Max, Negate]);
F213 = - AlphaBeta(P3, -Inf, [F1, Negate, Negate, Negate, F211, Max, F212, Max, Negate]);
F214 = - AlphaBeta(P4, -Inf, [F1, Negate, Negate, Negate, F211, Max, F212, Max, F213, Max, Negate]);
F215 = - AlphaBeta(P5, -Inf, [F1, Negate, Negate, Negate, F211, Max, F212, Max, F213, Max, F214, Max, Negate]);
...
F219 = - AlphaBeta(P9, -Inf, [F1, Negate, Negate, Negate, F211, Max, F212, Max, F213, Max, F214, Max, F215, Max, F216, Max, F217, Max, F218, Max, Negate]);

F21 = [F1, Negate, Negate, Negate, F211, Max, F212, Max, F213, Max, F214, Max, F215, Max, F216, Max, F217, Max, F218, Max, F219, Max, Negate]

Notice, F211 through F219 can all be evaluated in parallel. Despite the "earlier" P21 node being fully sequential, its first child already has created 9x nodes to work in parallel. Now true, the "Beta" has become incredibly complex, but it is still composed of only FutureSymbols, Negate, and Max. I've named P21 a "Type-A" node. There is a superficial similarity to an ALL node, because ALL of its children are evaluated. The only condition needed to discover a Type-A node is to see that Beta is +Infinity.

We can see that F215 will be a Type-C node, then F2151 will be Type-A node, then F21513 will be a Type-C node. Etc. etc. Every even ply can execute in parallel, while every odd-ply must execute sequentially if we wish to keep work-efficiency.

---------

The final node type is "Type P", which is superficially similar to a PV-node. The nodes P, P1, P11, P111, P1111, P11111... are evaluated with [-Inf, +Inf] as the Alpha-Beta bounds.

Based on Knuth's Analysis, Type P and Type-A nodes roughly constitute the sqrt(n) number of notes of the Minimax Tree. Type C nodes constitute all other nodes. I assert that there are enough TypeP and TypeA nodes that this algorithm will be worthwhile to pursue and execute on a GPU with "only" 16000 threads.

Daniel Shawul · Post by **Daniel Shawul** » Tue Jun 25, 2019 3:30 pm

You keep on ignoring the multiple elephants in the room and re-post the same thing again !?

A few words from people who actually know what they are doing, i.e. Cilkchess authors

Parallel search of game-trees is difficult because the most
efficient algorithms for game-tree search are inherently
serial. We obtain parallelism by performing the tests in
parallel, but those tests may not all be necessary in a serial
execution order. In order to get any parallelism, we must
take the risk of performing extra work that a good serial
program would avoid.

See they spelled it out for you that you need to do some speculative search to get any parallelism.

And here are theorems for work-efficiency of Jamboree search (similar to YBW).

Best-ordered tree: The first move is always the best

Theorem 1 For uniform best-ordered trees of degree d and height h the following hold:
The total work performed is O(d^h/2), which is the same as serial a-b search would perform. That is,
the work efficiency is 1.

So the answer to your 'unicorn' project using YBW is ...drumroll.... a best-ordered tree ...,
a case where we don't even need to search at all.

Worst-ordered tree: Worst move considered first, second-worst is second etc..

Theorem 2 Surprisingly, for worst-ordered uniform game trees, the
speedup of Jamboree search over serial a-b search turns out
to be under 1. For large d and h, the constants work out so that the total
work performed is approximately three times as much as
the serial a-b search would perform (thus the efficiency is 1/3)

So with worst-ordered trees, it is proven that your would do atleast 3x more work no matter how many processors you use.

dragontamer5788 · Post by **dragontamer5788** » Tue Jun 25, 2019 4:17 pm

While I appreciate the link Mr. Shawul, your attitude is needlessly berating and your posts are full of simple errors in logic. I don't believe I want to read any more of your posts anymore. Fortunately, phpBB has a simple software solution that will allow me to ignore your posts. Feel free to respond, I won't be able to see your response however.

Request to everyone else: If Mr. Shawul does say something useful, please quote it so that I will be able to see it.

dragontamer5788 · Post by **dragontamer5788** » Tue Jun 25, 2019 4:58 pm

Oh, one tidbit about my methodology. Based on Knuth's Analysis of Alpha-Beta, it is clear that a complex game (like Chess) is unnecessary for studying Alpha-Beta search.

You can see on page 299 that Knuth uses the decimal digits of PI as his "game tree", as he experimented with Alpha Beta search. Pi isn't very easy to calculate however, so I'm instead using the following methodology:

Hash(Position) == Base-case evaluation of a Node. When doing Alpha-Beta by hand, I multiply the decision-tree by 31415, and then take the 5th and 4th digit as the value of the position.

So Eval(P11111) == 5th & 4th digit of (349052065) == 52.

Decimal numbers between 0-49 are positive. Decimal numbers from 50 to 99 are negative (99 is -1, 98 is -2, etc. etc). Which will mean my "hand analysis" system for Alpha-Beta search can allow easy computation by hand of various positions. I presume 3-nodes per ply should be sufficient for hand-drawing purposes.

For example, AlphaBeta(P1111, -inf, +inf), with Depth-limit 5 will evaluate to:

Code: Select all

F11111 = -AlphaBeta(P11111, -inf, +inf) == -Eval(P11111) == -52
F11112 = -AlphaBeta(P11112, -inf, +inf) == -Eval(P11112) == -83
F11113 = -AlphaBeta(P11113, -inf, +inf) == -Eval(P11113) == -14

Return max(F11111 , F11112, F11113)

Therefore: AlphaBeta(P1111, -inf, +inf) = -52 == positive 48. (The 2s complement of the 2-digit decimal number). This is all basic Alpha-Beta stuff, but I figure I might as well share my methodology, in case anyone out there is having difficulty understanding Alpha-Beta.

dragontamer5788 · Post by **dragontamer5788** » Fri Jun 28, 2019 4:07 pm

Assuming 35x moves available per node and "infinite CPUs" operating each Ply, here's the amount of work generated per Ply. I'm changing my node-notation to include a period between nodes. P.31.7 represents the 7th child of the 31th child of the root.

Code: Select all

+--------------------------------------------------------------------------------+
| <-inf, +inf> | <-inf, X> | <X, +inf> |  Ply  |   # Active   |  # Beta-Blocked  |
| #Type-P      | #Type-C   | #Type-A   |       |              |                  |
+--------------+-----------+-----------+-------+--------------+------------------+
| 1            | 0         | 0         | 0     | 1            | 0                |
| 1            | 34        | 0         | 1     | 35           | 0                |
| 1            | 34        | 34        | 2     | 69           | 34               |
| 1            | 1224      | 34        | 3     | 1259         | 68               |
| 1            | 1224      | 1224      | 4     | 2449         | 1292             |
| 1            | 42874     | 1224      | 5     | 44099        | 2516             |
| 1            | 42874     | 42874     | 6     | 47162        | 45390            |
| 1            | 1457750   | 42874     | 7     | 1500625      | 88264            |
+--------------+-----------+-----------+-------+--------------+------------------+

The formulas I used to generate the above table:

Code: Select all

Recurrence Formulas:
#Type-P(Ply) : 1
#Type-C(Ply) : #TypeA(Ply-1) * 34 + #Type-P * 34
#Type-A(Ply) : #TypeC(Ply-1)
#Active(Ply) : #Type-P(Ply) + #Type-C(Ply) + #Type-A(Ply)
#Beta-Blocked((Ply) : For(i, 0, Ply-1) summation(#Type-C(i))

The reasoning is as follows:

* The root node is of Type-P: <-inf, +inf> bounds on Alpha-Beta.

* Type-P nodes spawn 35-children: 1 of them is Type-P with <inf, +inf> bounds, while the 34 siblings <-inf, [F1, -]>, <-inf, [F1, F2, Max, -]>, <-inf, [F1, F2, Max, F3, Max, -]>... These 34-siblings are all Type-C nodes.

* TypeC nodes all spawn two children: a Beta-blocked child, and a single #Type-A node of the form <[F1], inf>. Basically,
the alpha of all TypeC nodes = -beta.

* TypeA nodes all spawn 35 children: all 35 of them are TypeC nodes.

* The "infinitely parallel" machine will reach max-depth in the same number of steps as Ply. Of course, there are no modern CPUs, or modern GPUs, with 1,500,625 cores available to execute Ply7 in parallel in a single step. In practice, a highly-parallel system would want to prioritize reaching deeper depths, while allowing parallelism to earlier-parts of the tree. (Ex: The 16384 threads of the Vega64 will become saturated on Ply5).

* The #Active amount of work at any given Ply is "limited" by (Ply ^ sqrt(branch-factor)). Any practical problem, even with tens-of-thousands of threads, will become swamped with work if you increase the Ply even just a little bit.

* The Beta-blocked nodes will execute sequentially, one-at-a-time. There seems to be no opportunity for parallelism (at least with my algorithm) to parallelize the Beta-blocked nodes. The Beta-blocked nodes need to be executed with the absolute highest priority. Fortunately, the #Beta-Blocked tasks are always smaller than the #Active tasks.

Both Beta-Blocked nodes and #Active nodes grow at approximately the rate of Ply^sqrt(branching factor) (in this case: sqrt(35)).

Note: with less-than-perfect move ordering, these Beta-Blocked nodes may need to explore extremely deep trees of their own. For example, the node P.35.1 will spawn alphaBeta(P.35.1, [max(F.1, F.2, F.3 ... F.34)], +infinity), but will be Beta-Blocked on Beta-Block(F.35.1 >= max(F.1, F.2, F.3 ... F.34)). When the code finally sequentially executes P.35.1, it will perform a sequential Alpha-Beta across the P.35.2 (both Alpha and Beta will be mapped to centipawns at this point, since F.1 through F.34 have finished executing).

Daniel Shawul · Post by **Daniel Shawul** » Fri Jun 28, 2019 5:32 pm

Here is a formula you can use to count your pv/cut/all nodes, without using recurrence formulas, for any depth d

PV = 1
CUT=b^ceil(d/2)-1
ALL=b^floor(d/2)-1
beta-blocked=Sum(cut)_d=0..n

I have perft ("node counting if you don't know the term" ) formulas even for complicated trees: less optimal move ordering, null move tree, lmr tree etc
See my paper for the formulas.

Nonethless, you are hopelessly lost on the actual job you set out to do, i.e. parallelizing in a "work-effcient" manner.

There seems to be no opportunity for parallelism (at least with my algorithm) to parallelize the Beta-blocked nodes

That is one step towards some enlightenment but it is not absolutely true. The fact of the matter is that you can parallelize even CUT (beta-blocked) nodes if you are willing to take a risk. Infact after the 4th move or so is searched this "expected CUT" node is probably a "true ALL" node and
we can parallelize it fine.

As to your 'unicorn' work-efficient algorithm, you are not going to find it because
the tree is inherently sequential at every node type not only beta-blocked (cut) nodes.

I can prove to you your search for work-efficient algorithm is impossible. Here it goes:
-----
[PV]=You can't just search all moves with open window[-inf,inf] so you have to search sequentially, huge search overhead
and hence doesn't satisfy your work-efficiency requirement => impossible
[CUT/beta-blocked] = even you understood that you had to go sequential => impossible
[ALL]=great potential for parallelization but you will be searching with a wider window than
if you went sequentially, moreover, expected ALL node could be a real CUT node => impossible
-----

So absolute work-efficiency is impossible at any node type unless you go sequential -- which defeats the purpose.

Daniel

dragontamer5788 · Post by **dragontamer5788** » Sun Jun 30, 2019 8:42 pm

Note: with less-than-perfect move ordering, these Beta-Blocked nodes may need to explore extremely deep trees of their own. For example, the node P.35.1 will spawn alphaBeta(P.35.1, [max(F.1, F.2, F.3 ... F.34)], +infinity), but will be Beta-Blocked on Beta-Block(F.35.1 >= max(F.1, F.2, F.3 ... F.34)). When the code finally sequentially executes P.35.1, it will perform a sequential Alpha-Beta across the P.35.2 (both Alpha and Beta will be mapped to centipawns at this point, since F.1 through F.34 have finished executing).

I didn't finish my thought here, lol.

Okay, P.35.2+ would have a best-case sequential-search of size 1. In the best case, the Beta-block formula (F.35.1 >= max(F.1, F.2, F.3 ... F.34)), turns out to be true, so B.35.2+ would leave immediately. However, there is a worst-case sequential search of size 1,457,750 (!!!), if perfectly-bad move ordering happened.

The average case is somewhere in between, but will probably have a large number of sequential-steps.

------

Code: Select all

+--------------------------------------------------------------------------------+
| <-inf, +inf> | <-inf, X> | <X, +inf> |  Ply  |   # Active   |  # Beta-Blocked  |
| #Type-P      | #Type-C   | #Type-A   |       |              |                  |
+--------------+-----------+-----------+-------+--------------+------------------+
| 1            | 1457750   | 42874     | 7     | 1500625      | 88264            |
+--------------------------------------------------------------------------------+

Ahmdal's law would suggest that there is a best-case speedup of 1500625 / 88264: roughly 1/2 of the average number of children. However, the amount of parallel work is more than just #Active(Ply), but its actually summation(#Active()) across all Ply. In either case, this algorithm I'm describing will have a best-case 1750% speedup on an infinitely parallel machine with no synchronization overhead.

The worst-case speedup would be when every #Beta-blocked node has to fully explore its own tree. I haven't math'd it out all the way, but we're looking at 1.0001 or less speedup here assuming no search overhead. The average-case (random move ordering) seems to be similar, because the average amount of work done inside of the #Beta-blocked nodes seems to be quite high. Roughly 35^(depth^.76), average case according to Knuth. With 35^(depth^.5) best case, and 35^depth worst-case.

With the parallel-portion of the work consisting only of 2 * (35^sqrt(depth)) work to do, the parallel-case speedup will tend towards 1 (aka: no speedup) as depth increases. The "sequential portion" grows exponentially with depth (average-case exponent of depth^.76), while the "parallel portion" only grows with exponent of depth^.5.

-----------

Interesting notes:

1. This methodology seems to be "more beneficial" at shallow depths, and odd-depths (ex: depth 3, 5, 7). That is, a better #parallel-work to #sequential-work ratio.

2. Perhaps another algorithm (ex: MTCS with virtual-loss) can be used to spawn many minimax-searches of shallow depth to take-advantage of this GPU-algorithm. Only ~64 parallel minimax-searches need to take place to fully utilize an AMD Vega64 compute unit.

3. Alternatively, the CPU could be used to "finish off" the tree, once the GPU has fully explored the parallel-portion.

4. Under the scheme described here, each node is provably only visited once. There are a very large number of memory reads however: every task will be checking the Beta-Blocked tasks and seeing which futures are available.

Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search