DTS-ification of YBWC

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

DTS-ification of YBWC

Post by mcostalba »

Here we do a DTS-ification of YBWC so that a slave thread when finishes the search instead of waiting, tries to join an active split point.

We present two patches to be applied above Stockfish 1.7.1, the first does some small rework of current SMP code and is a prerequisite for the second one that does the actual work of DTS-ification of YBWC.

The patch was started by me, but was then deeply reworked and improved by Joona that can be considered the author.

1) PREREQUISITE PATCH to be applied as first

Code: Select all

Subject: [PATCH 1/2] Prerequisite patch

Retire splitPoint->cpus field. It is needed by next
patch that does the actual work of late-joining.

---

diff --git a/src/search.cpp b/src/search.cpp
index 0fa7824..e6de59f 100644
--- a/src/search.cpp
+++ b/src/search.cpp
@@ -1905,7 +1905,6 @@ namespace {
     /* Here we have the lock still grabbed */
 
     sp->slaves[threadID] = 0;
-    sp->cpus--;
 
     lock_release(&(sp->lock));
   }
@@ -2028,7 +2027,6 @@ namespace {
     /* Here we have the lock still grabbed */
 
     sp->slaves[threadID] = 0;
-    sp->cpus--;
 
     lock_release(&(sp->lock));
   }
@@ -2700,10 +2698,18 @@ namespace {
             threads[threadID].state = THREAD_AVAILABLE;
         }
 
-        // If this thread is the master of a split point and all threads have
+        // If this thread is the master of a split point and all slaves have
         // finished their work at this split point, return from the idle loop.
-        if (waitSp != NULL && waitSp->cpus == 0)
+        int i = 0;
+        for ( ; waitSp && i < ActiveThreads && !waitSp->slaves&#91;i&#93;; i++) &#123;&#125;
+
+        if &#40;i == ActiveThreads&#41;
         &#123;
+            // Because sp->slaves&#91;&#93; is reset under lock protection,
+            // be sure sp->lock has been released before to return.
+            lock_grab&#40;&&#40;waitSp->lock&#41;);
+            lock_release&#40;&&#40;waitSp->lock&#41;);
+
             assert&#40;threads&#91;threadID&#93;.state == THREAD_AVAILABLE&#41;;
 
             threads&#91;threadID&#93;.state = THREAD_SEARCHING;
@@ -2882,9 +2888,8 @@ namespace &#123;
   // data that must be copied to the helper threads &#40;the current position and
   // search stack, alpha, beta, the search depth, etc.), and we tell our
   // helper threads that they have been assigned work. This will cause them
-  // to instantly leave their idle loops and call sp_search_pv&#40;). When all
-  // threads have returned from sp_search_pv &#40;or, equivalently, when
-  // splitPoint->cpus becomes 0&#41;, split&#40;) returns true.
+  // to instantly leave their idle loops and call sp_search&#40;). When all
+  // threads have returned from sp_search&#40;) then split&#40;) returns true.
 
   bool ThreadsManager&#58;&#58;split&#40;const Position& p, SearchStack* sstck, int ply,
              Value* alpha, const Value beta, Value* bestValue,
@@ -2931,7 +2936,6 @@ namespace &#123;
     splitPoint->master = master;
     splitPoint->mp = mp;
     splitPoint->moves = *moves;
-    splitPoint->cpus = 1;
     splitPoint->pos = &p;
     splitPoint->parentSstack = sstck;
     for &#40;int i = 0; i < ActiveThreads; i++)
@@ -2943,17 +2947,19 @@ namespace &#123;
     // If we are here it means we are not available
     assert&#40;threads&#91;master&#93;.state != THREAD_AVAILABLE&#41;;
 
+    int workersCnt = 1; // At least the master is included
+
     // Allocate available threads setting state to THREAD_BOOKED
-    for &#40;int i = 0; i < ActiveThreads && splitPoint->cpus < MaxThreadsPerSplitPoint; i++)
+    for &#40;int i = 0; i < ActiveThreads && workersCnt < MaxThreadsPerSplitPoint; i++)
         if &#40;thread_is_available&#40;i, master&#41;)
         &#123;
             threads&#91;i&#93;.state = THREAD_BOOKED;
             threads&#91;i&#93;.splitPoint = splitPoint;
             splitPoint->slaves&#91;i&#93; = 1;
-            splitPoint->cpus++;
+            workersCnt++;
         &#125;
 
-    assert&#40;splitPoint->cpus > 1&#41;;
+    assert&#40;workersCnt > 1&#41;;
 
     // We can release the lock because slave threads are already booked and master is not available
     lock_release&#40;&MPLock&#41;;
@@ -2974,8 +2980,7 @@ namespace &#123;
     // which it will instantly launch a search, because its state is
     // THREAD_WORKISWAITING.  We send the split point as a second parameter to the
     // idle loop, which means that the main thread will return from the idle
-    // loop when all threads have finished their work at this split point
-    // &#40;i.e. when splitPoint->cpus == 0&#41;.
+    // loop when all threads have finished their work at this split point.
     idle_loop&#40;master, splitPoint&#41;;
 
     // We have returned from the idle loop, which means that all threads are
diff --git a/src/thread.h b/src/thread.h
index 8697784..c70800c 100644
--- a/src/thread.h
+++ b/src/thread.h
@@ -55,7 +55,7 @@ struct SplitPoint &#123;
   Depth depth;
   bool mateThreat;
   Value beta;
-  int ply, master, slaves&#91;MAX_THREADS&#93;;
+  int ply, master;
   SearchStack sstack&#91;MAX_THREADS&#93;&#91;PLY_MAX_PLUS_2&#93;;
 
   // Const pointers to shared data
@@ -67,8 +67,8 @@ struct SplitPoint &#123;
   volatile Value alpha;
   volatile Value bestValue;
   volatile int moves;
-  volatile int cpus;
   volatile bool stopRequest;
+  volatile int slaves&#91;MAX_THREADS&#93;;
 &#125;;
 
 // ThreadState type is used to represent thread's current state
-- 
1.6.5.1.1367.gcd48



2) MAIN PATCH to be applied as second

Code: Select all

Subject&#58; &#91;PATCH 2/2&#93; Joona's late joining

In YBWC when one slave threads finishes it waits for
some master to pick it up. In DTS instead actively
searches for some work to do.

This patch is a DTS-ification of YBWC in which the slave
patch when finishes try to join an active split point instead
of going to waiting.

---

diff --git a/src/search.cpp b/src/search.cpp
index e6de59f..2bb8620 100644
--- a/src/search.cpp
+++ b/src/search.cpp
@@ -78,11 +78,13 @@ namespace &#123;
     int64_t nodes_searched&#40;) const;
     void get_beta_counters&#40;Color us, int64_t& our, int64_t& their&#41; const;
     bool available_thread_exists&#40;int master&#41; const;
+    bool HMC_criterion&#40;const SplitPoint *ancestor, const SplitPoint *sp&#41; const;
     bool thread_is_available&#40;int slave, int master&#41; const;
     bool thread_should_stop&#40;int threadID&#41; const;
     void wake_sleeping_threads&#40;);
     void put_threads_to_sleep&#40;);
     void idle_loop&#40;int threadID, SplitPoint* waitSp&#41;;
+    bool try_late_join&#40;int threadID, SplitPoint *idleSp&#41;;
     bool split&#40;const Position& pos, SearchStack* ss, int ply, Value* alpha, const Value beta, Value* bestValue,
                Depth depth, bool mateThreat, int* moves, MovePicker* mp, int master, bool pvNode&#41;;
 
@@ -1904,6 +1906,9 @@ namespace &#123;
 
     /* Here we have the lock still grabbed */
 
+    // No more work to do, so we are leaving.
+    // Prevent other threads from joining
+    sp->allowLateJoin = false;
     sp->slaves&#91;threadID&#93; = 0;
 
     lock_release&#40;&&#40;sp->lock&#41;);
@@ -2026,6 +2031,9 @@ namespace &#123;
 
     /* Here we have the lock still grabbed */
 
+    // No more work to do, so we are leaving.
+    // Prevent other threads from joining
+    sp->allowLateJoin = false;
     sp->slaves&#91;threadID&#93; = 0;
 
     lock_release&#40;&&#40;sp->lock&#41;);
@@ -2639,6 +2647,64 @@ namespace &#123;
     &#125;
   &#125;
 
+  bool ThreadsManager&#58;&#58;try_late_join&#40;int threadID, SplitPoint *idleSp&#41; &#123;
+
+    int spIdx = 0, master = 0;
+    Depth maxDepth = Depth&#40;0&#41;;
+
+    // Find active SplitPoint with highest possible depth
+    for &#40;int i = 0; i != threadID && i < ActiveThreads; i++)
+    &#123;
+        int actSp = threads&#91;i&#93;.activeSplitPoints;
+
+        for &#40;int j = 0; j < actSp; j++)
+        &#123;
+            SplitPoint* sp = &SplitPointStack&#91;i&#93;&#91;j&#93;;
+            if (   sp->depth > maxDepth
+                && sp->allowLateJoin
+                && (!idleSp || HMC_criterion&#40;idleSp, sp&#41;))
+            &#123;
+                maxDepth = sp->depth;
+                master = i;
+                spIdx = j;
+            &#125;
+        &#125;
+    &#125;
+
+    if &#40;maxDepth == Depth&#40;0&#41;)
+        return false;
+
+    // This will be our new split point
+    SplitPoint* splitPoint = &SplitPointStack&#91;master&#93;&#91;spIdx&#93;;
+
+    // Grab the SplitPoint lock to prevent splitpoint from dying
+    lock_grab&#40;&&#40;splitPoint->lock&#41;);
+
+    // Recheck under lock protection
+    if (   splitPoint->allowLateJoin
+        && (!idleSp || HMC_criterion&#40;idleSp, splitPoint&#41;))
+    &#123;
+        /* nothing */
+    &#125;
+    else
+    &#123;
+        lock_release&#40;&&#40;splitPoint->lock&#41;);
+        return false;
+    &#125;
+
+    SearchStack* sstck = splitPoint->parentSstack;
+    int ply = splitPoint->ply;
+
+    // At last copy the data and start the thread !
+    memcpy&#40;splitPoint->sstack&#91;threadID&#93; + ply - 1, sstck + ply - 1, 4 * sizeof&#40;SearchStack&#41;);
+
+    threads&#91;threadID&#93;.splitPoint = splitPoint;
+    splitPoint->slaves&#91;threadID&#93; = 1;
+    threads&#91;threadID&#93;.state = THREAD_WORKISWAITING;
+
+    lock_release&#40;&&#40;splitPoint->lock&#41;);
+    return true;
+  &#125;
 
   // idle_loop&#40;) is where the threads are parked when they have no work to do.
   // The parameter "waitSp", if non-NULL, is a pointer to an active SplitPoint
@@ -2695,7 +2761,11 @@ namespace &#123;
 
             assert&#40;threads&#91;threadID&#93;.state == THREAD_SEARCHING&#41;;
 
-            threads&#91;threadID&#93;.state = THREAD_AVAILABLE;
+            // If is not the master of this split point then try to
+            // late join some other sp immediately. Otherwise mark thread
+            // as available for future splits.
+            if &#40;ActiveThreads <= 2 || !try_late_join&#40;threadID, waitSp&#41;)
+                threads&#91;threadID&#93;.state = THREAD_AVAILABLE;
         &#125;
 
         // If this thread is the master of a split point and all slaves have
@@ -2705,11 +2775,6 @@ namespace &#123;
 
         if &#40;i == ActiveThreads&#41;
         &#123;
-            // Because sp->slaves&#91;&#93; is reset under lock protection,
-            // be sure sp->lock has been released before to return.
-            lock_grab&#40;&&#40;waitSp->lock&#41;);
-            lock_release&#40;&&#40;waitSp->lock&#41;);
-
             assert&#40;threads&#91;threadID&#93;.state == THREAD_AVAILABLE&#41;;
 
             threads&#91;threadID&#93;.state = THREAD_SEARCHING;
@@ -2854,15 +2919,24 @@ namespace &#123;
     if &#40;ActiveThreads == 2&#41;
         return true;
 
-    // Apply the "helpful master" concept if possible. Use localActiveSplitPoints
-    // that is known to be > 0, instead of threads&#91;slave&#93;.activeSplitPoints that
-    // could have been set to 0 by another thread leading to an out of bound access.
-    if &#40;SplitPointStack&#91;slave&#93;&#91;localActiveSplitPoints - 1&#93;.slaves&#91;master&#93;)
+    // Helpful master concept.
+    // Check if 'slave'-candidate is an ancestor of master.
+    const SplitPoint* ancestor = &&#40;SplitPointStack&#91;slave&#93;&#91;localActiveSplitPoints - 1&#93;);
+    SplitPoint* sp = threads&#91;master&#93;.splitPoint;
+    if &#40;HMC_criterion&#40;ancestor, sp&#41;)
         return true;
 
     return false;
   &#125;
 
+  bool ThreadsManager&#58;&#58;HMC_criterion&#40;const SplitPoint *ancestor, const SplitPoint *sp&#41; const &#123;
+
+    for (; sp != NULL; sp = sp->parent&#41;
+        if &#40;sp == ancestor&#41;
+            return true;
+
+    return false;
+  &#125;
 
   // available_thread_exists&#40;) tries to find an idle thread which is available as
   // a slave for the thread with threadID "master".
@@ -2923,6 +2997,8 @@ namespace &#123;
     // Pick the next available split point object from the split point stack
     splitPoint = &SplitPointStack&#91;master&#93;&#91;threads&#91;master&#93;.activeSplitPoints&#93;;
 
+    assert&#40;splitPoint->allowLateJoin == false&#41;;
+
     // Initialize the split point object
     splitPoint->parent = threads&#91;master&#93;.splitPoint;
     splitPoint->stopRequest = false;
@@ -2976,6 +3052,9 @@ namespace &#123;
             threads&#91;i&#93;.state = THREAD_WORKISWAITING; // This makes the slave to exit from idle_loop&#40;)
         &#125;
 
+    // Allow threads becoming idle to late join
+    splitPoint->allowLateJoin = true;
+
     // Everything is set up. The master thread enters the idle loop, from
     // which it will instantly launch a search, because its state is
     // THREAD_WORKISWAITING.  We send the split point as a second parameter to the
diff --git a/src/thread.h b/src/thread.h
index c70800c..8ac1138 100644
--- a/src/thread.h
+++ b/src/thread.h
@@ -68,6 +68,7 @@ struct SplitPoint &#123;
   volatile Value bestValue;
   volatile int moves;
   volatile bool stopRequest;
+  volatile bool allowLateJoin;
   volatile int slaves&#91;MAX_THREADS&#93;;
 &#125;;
 
-- 
1.6.5.1.1367.gcd48
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: DTS-ification of YBWC

Post by bob »

I do not understand this misuse of YBW. YBW simply says you never start a parallel search at any node until at _least_ one branch has been completely searched at that node. It says nothing about waiting for completion or anything else. It has been around since before it was called YBW or as you use, YBWC which is short for "Young Brothers Wait Criterion". DTS included that concept although it could violate it, so that instead of waiting with nothing to do, I would try to make a guess on where ALL nodes would occur and split even before YBWC was sastisfied. As a simple example, if a node at ply=N is an ALL node, then N+2 or N-2 is likely an ALL node as well. More detailed analysis can improve the reliability of that guess.

Crafty has been searching using the dynamic tree splitting approach, without the ability to look at anything but the current node, since day 1. The primary distinction of DTS is that "nobody waits". There are other details in the implementation we did, but the basic idea was that when a CPU goes idle, it immediately starts to help someone that is still busy. Nothing more.
Gian-Carlo Pascutto
Posts: 1243
Joined: Sat Dec 13, 2008 7:00 pm

Fixing YBWC implementation

Post by Gian-Carlo Pascutto »

I agree the naming here is very confusing and what is proposed has nothing to do with DTS.

I thought it was standard in YBW to just have the slave threads loop all the splitblocks continuously until work comes available. I'm amazed (and scared :-)) Stockfish didn't have this yet.
zamar
Posts: 613
Joined: Sun Jan 18, 2009 7:03 am

Re: DTS-ification of YBWC

Post by zamar »

bob wrote: Crafty has been searching using the dynamic tree splitting approach, without the ability to look at anything but the current node, since day 1. The primary distinction of DTS is that "nobody waits". There are other details in the implementation we did, but the basic idea was that when a CPU goes idle, it immediately starts to help someone that is still busy. Nothing more.
Have you measured the time threads are idle in Crafty's YBW-implementation? I bet it's very small (excluding ultra-fast games), almost meaningless. I believe the true advantage to change from YBW to DTS comes from the ability to choose the most attractive point for split, not avoiding processing staying idle for a small time.

BTW, the patch Marco posted just makes our YBW implementation a bit more DTS-like. It's very far from true DTS implementation.
Joona Kiiski
zamar
Posts: 613
Joined: Sun Jan 18, 2009 7:03 am

Re: Fixing YBWC implementation

Post by zamar »

Gian-Carlo Pascutto wrote:I agree the naming here is very confusing and what is proposed has nothing to do with DTS.

I thought it was standard in YBW to just have the slave threads loop all the splitblocks continuously until work comes available. I'm amazed (and scared :-)) Stockfish didn't have this yet.
And surprisingly this change didn't make stockfish any stronger, so the change won't enter the development branch.
Joona Kiiski
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Fixing YBWC implementation

Post by bob »

Gian-Carlo Pascutto wrote:I agree the naming here is very confusing and what is proposed has nothing to do with DTS.

I thought it was standard in YBW to just have the slave threads loop all the splitblocks continuously until work comes available. I'm amazed (and scared :-)) Stockfish didn't have this yet.
I can't say for sure how it has been implemented in any specific program.

But the term "YBWC" (Young Brothers Wait Criterion) was coined by (I believe) Feldman, based on the original concept in the PVS (Principle Variation Search) algorithm developed somewhere in the mix with Newborn, although whether he worked with anyone else I do not know. The idea was to satisfy alpha/beta by first establishing abound before doing a parallel search, otherwise you search without a bound and do significant extra work. The term YBW seemed to be a reasonable one for the idea used in PVS where you search down the left side of the tree first, and split at the tip, then back up one and split there since you now have a bound from the first move, etc. YBW was an enhancement that said one could search at _any_ node once one branch was completed. The concept did not require that everyone sit at one node together, although Feldman's implementation may well have done that...

Today, rather than establishing a bound, searching the first "brother" really is a way of weakly proving that this is an ALL node since most CUT nodes fail on the first move. Of course you do establish a bound along the left-edge, but most searches are null-window and no tighter bounds are possible.

DTS was simply a major improvement over the PVS algorithm from Newborn, and then the EPVS algorithm we developed in Cray Blitz to prevent the PVS approach where everyone sat at a single node until it was completed. That's where the "D" in "DTS" comes from, - Dynamic -> anywhere, anytime, and in CB we didn't fully enforce YBW at all, we could split anywhere. We preferred to split at YBWC-satisfied nodes, but rather than have a CPU sit idle, we would split at what we predicted would be ALL nodes if necessary.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: DTS-ification of YBWC

Post by bob »

zamar wrote:
bob wrote: Crafty has been searching using the dynamic tree splitting approach, without the ability to look at anything but the current node, since day 1. The primary distinction of DTS is that "nobody waits". There are other details in the implementation we did, but the basic idea was that when a CPU goes idle, it immediately starts to help someone that is still busy. Nothing more.
Have you measured the time threads are idle in Crafty's YBW-implementation? I bet it's very small (excluding ultra-fast games), almost meaningless. I believe the true advantage to change from YBW to DTS comes from the ability to choose the most attractive point for split, not avoiding processing staying idle for a small time.

BTW, the patch Marco posted just makes our YBW implementation a bit more DTS-like. It's very far from true DTS implementation.
When you run on 16 or 32 cores, the idle time begins to add up. That's why I added the "thread group" idea years ago, so that too many don't try to split at the same point, which is deadly in endgames or along tactical lines where you have to escape check.

As I said previously, DTS offers _two_ advantages over the recursive approach I currently use.

(1) Crafty doesn't split until the YBWC is satisfied. This can leave processors idle, since this is tested at the current node only, and at each succeeding node so long as idle processors exist. DTS can split anywhere, anytime, so that this idle time does not accumulate.

(2) DTS gives you a chance to find a _much better_ split point. I'd always prefer splitting close to the root if YBWC is satisfied there, or even at the root as I do in Crafty. But I can't split there until I back up in the search to get there. Which means that since I spend most of the time deep in the tree, I have to split at points where there is not a lot of work to do. I limit this by not splitting too near the end of the normal search, but this then increases the effect covered in (1) above. So in Crafty I end up doing a lot of splits, deep in the tree, where the split overhead becomes a significant cost. At the root, there is almost zero cost. DTS can split at the root even if the current node is at ply=20, which is a plus.

This becomes more important as more cores are used. When I make these kinds of changes (and I do modify this code from time to time) I can test it on our 70 node x 8 core cluster to see if it plays better than the old version, which high accuracy. So I don't have to "guess" whether it helps or not. A 10% speedup is not going to be 10 Elo. So it takes a ton of games so measure such. Also slower games are different than faster games when dealing with SMP issues, so long games are also important.
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: Fixing YBWC implementation

Post by mcostalba »

Gian-Carlo Pascutto wrote:I agree the naming here is very confusing and what is proposed has nothing to do with DTS.

I thought it was standard in YBW to just have the slave threads loop all the splitblocks continuously until work comes available. I'm amazed (and scared :-)) Stockfish didn't have this yet.
IMHO to loop continuously doesn't make sense becuase if a thread is available will be picked up by the master just a moment before a new split point (activated by the above master) become available.

Instead has a sense to check for split points only once as soon as the thread becomes available.

Anyhow the point is that with a QUAD we are not able to spot any difference form standard version.
Gian-Carlo Pascutto
Posts: 1243
Joined: Sat Dec 13, 2008 7:00 pm

Re: Fixing YBWC implementation

Post by Gian-Carlo Pascutto »

Maybe I didn't explain it very well but I meant that you want to exhaust the splitblock list before signaling that you are going idle, and that you then busy-wait on valid splitblocks or wait on a condition before going through the splitblock list.

What I understood is that you currently always signal that you're going idle after you finish a splitblock, without looking for any other pending stuff.

Neither of this has much to do with DTS vs YBW.
zamar
Posts: 613
Joined: Sun Jan 18, 2009 7:03 am

Re: DTS-ification of YBWC

Post by zamar »

bob wrote: When you run on 16 or 32 cores, the idle time begins to add up.
I have no experience of these systems, so you are likely right here. When these machines become more common we will pay more attention to SMP-code and consider switching to DTS. With Quad or Octa it seems that simple YBW is more than sufficient.
(1) Crafty doesn't split until the YBWC is satisfied. This can leave processors idle, since this is tested at the current node only, and at each succeeding node so long as idle processors exist. DTS can split anywhere, anytime, so that this idle time does not accumulate.

(2) DTS gives you a chance to find a _much better_ split point. I'd always prefer splitting close to the root if YBWC is satisfied there, or even at the root as I do in Crafty. But I can't split there until I back up in the search to get there. Which means that since I spend most of the time deep in the tree, I have to split at points where there is not a lot of work to do. I limit this by not splitting too near the end of the normal search, but this then increases the effect covered in (1) above. So in Crafty I end up doing a lot of splits, deep in the tree, where the split overhead becomes a significant cost. At the root, there is almost zero cost. DTS can split at the root even if the current node is at ply=20, which is a plus.

This becomes more important as more cores are used. When I make these kinds of changes (and I do modify this code from time to time) I can test it on our 70 node x 8 core cluster to see if it plays better than the old version, which high accuracy. So I don't have to "guess" whether it helps or not. A 10% speedup is not going to be 10 Elo. So it takes a ton of games so measure such. Also slower games are different than faster games when dealing with SMP issues, so long games are also important.
Agreed.
Joona Kiiski