Implementation of multithreaded search in Jazz

Evert · Post by **Evert** » Sat Apr 20, 2013 12:39 pm

Because I would have found something like this useful (and couldn't find it) for implementing a parallel search in Jazz, I decided to write down the steps I took to implement it. Much of the implementation is based on the implementation in Viper and Stockfish. The description below has many details that are specific to Jazz, but hopefully it'll still be useful to someone who wants to implement something similar.

First, some preliminary considerations about data structures. Because I originally designed Jazz as a library of chess-related functions rather than as a chess program (or engine) there is almost no global state anywhere in the program. Most functions take either a board_t struct describing a chess position, or a gamestate_t struct that holds information about the current game (including the board position, move history, repetition table, transposition table, killer lists, thinking time etc.). The only data that is actually global are "read only" tables for the move generator and evaluation function. This is fortunate, because it means that I had to change very little in the structure of the program itself.

Jazz uses pseudo-legal move generation, and I generate all moves up front in one go.
The move from the hash table, if it is legal, is then tried first and if it generates a cut-off, the search returns immediately. If it doesn't, the remaining moves are scored and sorted according to their characteristics (killer moves, winning captures, some history-like heuristics, static ordering based on piece square tables). We then loop until we've tried all moves in the list. The move list is a structure that stores the move list, as well as the score (for each move), the permutation, the number of moves and the current move number.

For the parallel search I use two extra data structures: a "split point", which holds information about the position where the parallel search is started and that threads use to communicate, and a "thread", which holds information about each thread used in the search:

Code: Select all

typedef struct split_t &#123;
   struct gamestate_t *game;
   movelist_t *movelist;
   int static_score;
   volatile uint32_t workers;
   volatile int alpha;
   volatile int beta;
   volatile int score;
   volatile int draft;
   volatile int depth;
   volatile move_t best_move;
   volatile bool stop;
   struct split_t *parent;
   lock_t lock;
&#125; split_t;

The "game" element is used to pass the state of the board to the different worker threads. The "movelist" is the above mentioned move list at the split point. The different threads will pull moves from this list until it is empty.
"static_score" is the static evaluation at the split point and is passed around
because it is used for some reduction/pruning decisions. "Workers" is a bitfield that is used to keep track of which threads are working on this split point (the corresponding bit is set to 1). Variables "alpha", "beta", "score", "draft", "depth" and "best_move" are used to control the search and return the result. "stop" becomes true in the case of a beta-cutoff, "parent" can be used to backup a split point to implement a "helpful master" (not currently used, but inherited from the Viper implementation). Finally, "lock" is a spinlock (or mutex) that is set when the split point needs to be updated.

The thread-specific data looks like this:

Code: Select all

typedef struct thread_t &#123;
   struct gamestate_t *game;     /* Work space for each thread */
   split_t *splitpoint;          /* Current split point */
   int thread_id;                /* Thread ID number */
   volatile int active_splits;   /* Number of active splits for this thread */
   volatile uint8_t state;       /* The current state &#40;flag&#41; of the thread */
   volatile bool must_die;       /* Thread will exit if this becomes true &#40;FIXME&#58; should be a state flag?) */
   pthread_t pt;                 /* pthreads thread ID &#40;for pthread_join&#41; */
   char barrier&#91;128&#93;;            /* Make sure data for different threads is on different cache lines */
&#125; thread_t;

The different thread states are:

Code: Select all

#define THREAD_WORKING     0x01  /* Thread is currently searching */
#define THREAD_SLEEPING    0x02  /* Thread is sleeping */
#define THREAD_WAITING     0x04  /* Thread is waiting to be assigned work */
#define THREAD_SCHEDULED   0x08  /* Thread has been assigned work but has not started searching yet */

When the parallel search is initialised, the worker threads have their state set to "SLEEPING" and execute the following function (waiting loop):

Code: Select all

static void *thread_func&#40;void *arg&#41;
&#123;
   /* Initialisation */
   thread_t *self = &#40;thread_t *&#41;arg;

   while (!self->must_die&#41; &#123;
      if &#40;self->state == THREAD_SLEEPING&#41; &#123;
         assert&#40;self->thread_id != 0&#41;;
         pthread_mutex_lock&#40;&sleep_mutex&#41;;
         pthread_cond_wait&#40;&sleep_condition, &sleep_mutex&#41;;
         pthread_mutex_unlock&#40;&sleep_mutex&#41;;
         self->state = THREAD_WAITING;
      &#125;

      /* Wait for work */
      while &#40;self->state == THREAD_WAITING && !self->must_die&#41;;

      /* If we're scheduled to do work, start working */
      while &#40;self->state == THREAD_SCHEDULED&#41;;

      if &#40;self->state == THREAD_WORKING&#41; &#123;
         /* Call SMP search */
         search_split&#40;self->game, self->splitpoint&#41;;

         /* Mark thread as waiting for work. */
         self->state = THREAD_WAITING;
      &#125;
   &#125;
   return NULL;
&#125;

When the threads are sleeping they wait on a condition variable to wake them up, which is done at the start of the search. This isn't crucial, but it's nice to not use the CPU when we're not thining. The rest should be fairly straightforward.

The search_split() function is a stripped-down version of the normal search
function:

Code: Select all

void search_split&#40;gamestate_t *game, split_t *split&#41;
&#123;
   /* Initialisation */
   bool check = in_check&#40;0&#41;;
   int me = game->side_to_move;

   /* Lock the split point */
   acquire_lock&#40;&split->lock&#41;;

   assert&#40;split->workers & &#40;1<<game->thread_id&#41;);

   int depth = split->depth;
   int draft = split->draft;
   int static_score = split->static_score;
   int num_moves = split->movelist->num_moves;

   game->num_moves&#91;depth&#93; = split->movelist->num_moves;
   int alpha = split->alpha;
   int beta  = split->beta;
   bool pv_node = beta > alpha+1;

   while&#40;split->movelist->cur_move < num_moves&#41; &#123;
      if &#40;thread_should_stop&#40;game->thread_id&#41;) break;
      if &#40;alpha >= beta&#41; break;
      move_t move = get_next_move&#40;split->movelist&#41;;
      int move_score = get_move_score&#40;split->movelist&#41;;

      /* Release the lock while we don't need it */
      release_lock&#40;&split->lock&#41;;

      int local_extension = check*PLY; /* Check extension */

      /* Play move and evaluate the resulting position recursively -
       * assuming the move was legal.
       */
      playmove&#40;game, move&#41;;
      if (!check && move_leaves_player_in_check&#40;game, move, me&#41;) &#123;  /* Illegal move! */
         takeback&#40;game&#41;;
         acquire_lock&#40;&split->lock&#41;;
         continue;
      &#125;
      assert&#40;!player_in_check&#40;game, me&#41;);
      moves_searched++;

      int score = -alphabeta&#40;game, depth+1, draft+local_extension-PLY, -alpha-1, -alpha&#41;;

      /* Research moves that were reduced but now fail high &#40;improve alpha&#41;
       * on normal depth &#40;confirm the fail high&#41;.
       */
      if &#40;score > alpha && local_extension < 0&#41; &#123;
         score = -alphabeta&#40;game, depth+1, draft-PLY, -beta, -alpha&#41;;
      &#125;

      takeback&#40;game&#41;;

      /* Acquire the lock so we can update the split point */
      acquire_lock&#40;&split->lock&#41;;

      /* TODO&#58; keep track of maximum nodes searched in a sub tree and the
       * move that goes with it.
       * For updating ALL-nodes.
       */

      if &#40;abort_search&#41;
         stop_split_search&#40;split&#41;;

      if (!thread_should_stop&#40;game->thread_id&#41;) &#123;
         if &#40;score > split->score&#41; &#123;
            split->score = score;

            if &#40;score > split->alpha&#41; &#123;
               split->alpha = score;

               /* Store best move */
               split->best_move = move;

               /* Beta cutoff, terminate all workers */
               if &#40;split->alpha >= split->beta&#41; &#123;
                  stop_split_search&#40;split&#41;;
                  break;
               &#125;
            &#125;
         &#125;

         alpha = split->alpha;
      &#125;
   &#125;

   /* We're done with this split point */
   assert&#40;split->workers & &#40;1<<game->thread_id&#41;);
   split->workers ^= 1<<game->thread_id;

   /* Update node counters */
   split->game->positions_evaluated    += game->positions_evaluated;
   split->game->moves_searched         += game->moves_searched;
   split->game->positions_in_hashtable += game->positions_in_hashtable;
   split->game->branches_pruned        += game->branches_pruned;
   split->game->branches_pruned_1st    += game->branches_pruned_1st;

   release_lock&#40;&split->lock&#41;;
&#125;

Finally, the split() function from which the split point is created is

Code: Select all

void split&#40;gamestate_t *game, movelist_t *movelist, int static_score, int *alpha, int *beta, int depth, int draft, int *score, move_t *best_move&#41;
&#123;
   int me = game->thread_id;
   int n;

   if &#40;threads_created <= 1&#41; return;

   /* We have to manipulate the split point stack */
   acquire_lock&#40;&split_mutex&#41;;

   /* Maximum number of split points reached? */
   if &#40;thread&#91;me&#93;.active_splits == MAX_SPLIT&#41; goto done;

   /* Test if there is an idle thread waiting */
   if (!thread_available&#40;me&#41;) goto done;

   /* There are threads available, initialise a split point */
   split_t *sp = &splitpoint&#91;me&#93;&#91;thread&#91;me&#93;.active_splits&#93;;
   thread&#91;me&#93;.active_splits++;

   /* Copy data */
   TRACE&#40;"Creating split point, ply = %d depth = %d &#91;a,b&#93;=&#91;%d, %d&#93;\n", depth, draft/PLY, *alpha, *beta&#41;;
   copy_game_state_to_splitpoint&#40;game, sp->game&#41;;
   sp->movelist = movelist;
   sp->parent = thread&#91;me&#93;.splitpoint;
   sp->alpha = *alpha;
   sp->beta  = *beta;
   sp->score = *score;
   sp->depth = depth;
   sp->draft = draft;
   sp->best_move = *best_move;
   sp->stop  = false;
   sp->workers = 0;

   thread&#91;me&#93;.splitpoint = sp;

   /* Assign all workers */
   int num_workers = 0;
   for &#40;n=0; n<threads_created; n++) &#123;
      sp->workers |= &#40;1<<me&#41;;
      if &#40;thread_can_help&#40;n, me&#41; || n == me&#41; &#123;
         TRACE&#40;"Scheduling work for thread %d\n", n&#41;;
         thread&#91;n&#93;.state = THREAD_SCHEDULED;
         thread&#91;n&#93;.splitpoint = sp;
         copy_game_state_from_splitpoint&#40;sp->game, thread&#91;n&#93;.game&#41;;
         sp->workers |= &#40;1<<n&#41;;
         num_workers++;
      &#125;
   &#125;

   /* We should have more than one worker available, otherwise we would have exited
    * this function at the top.
    */
   assert&#40;num_workers > 1&#41;;

   /* We can release the split lock since workers have already been marked as
    * "scheduled"
    */
   release_lock&#40;&split_mutex&#41;;

   /* Now kick all workers into action */
   TRACE&#40;"Starting threads\n");
   for &#40;n=0; n<threads_created; n++) &#123;
      if &#40;sp->workers & &#40;1<<n&#41;) &#123;
         thread&#91;n&#93;.state = THREAD_WORKING;
      &#125;
   &#125;

   /* Enter thread loop for the master thread */
   TRACE&#40;"Entering loop (#%d&#41;\n", me&#41;;
   search_split&#40;thread&#91;me&#93;.game, thread&#91;me&#93;.splitpoint&#41;;
   assert&#40; &#40;sp->workers & &#40;1<<me&#41;) == 0&#41;;
   while &#40;sp->workers&#41;;
   TRACE&#40;"Exit loop (#%d&#41;\n", me&#41;;

   /* Update variables from the split point and return */
   acquire_lock&#40;&split_mutex&#41;;
   *alpha = sp->alpha;
   *beta  = sp->beta;
   *score = sp->score;
   *best_move = sp->best_move;
   thread&#91;me&#93;.active_splits--;
   thread&#91;me&#93;.splitpoint = sp->parent;
   TRACE&#40;"Joined split point, ply = %d depth = %d &#91;a,b&#93;=&#91;%d, %d&#93;\n", depth, draft/PLY, *alpha, *beta&#41;;

   game->positions_evaluated    += sp->game->positions_evaluated;
   game->moves_searched         += sp->game->moves_searched;
   game->positions_in_hashtable += sp->game->positions_in_hashtable;
   game->branches_pruned        += sp->game->branches_pruned;
   game->branches_pruned_1st    += sp->game->branches_pruned_1st;

done&#58;
   release_lock&#40;&split_mutex&#41;;
   return;
&#125;

This is called from the main search after the first move has been searched:

Code: Select all

   /* Search remaining moves if no cut-off yet */
   if &#40;alpha < beta&#41; &#123;
      /* Get the static evaluation score - if we haven't calculated it
       * already and are not in check.
       */
      if (!in_check&#40;0&#41; && static_score == -CHECKMATE&#41;
         static_score = static_evaluation&#40;game, me, alpha, beta&#41;;

      sort_moves&#40;movelist.perm, movelist.num_moves, movelist.score&#41;;
      get_next_move&#40;&movelist&#41;;  /* Skip the first move, already searched */

      assert&#40;best == movelist.perm&#91;0&#93;);

      if (!pv_node && split_ok && draft > 4*PLY&#41;
         split&#40;game, &movelist, static_score, &alpha, &beta, depth, draft, &best_score, &hash_move&#41;;

      while&#40;movelist.cur_move < movelist.num_moves && alpha < beta&#41; &#123;
         /* Search moves */
      &#125;
   &#125;

The loop over all moves is only entered if the split was not taken (otherwise the move list will be empty, or a cut-off has occurred so alpha>=beta) and actually contains another call to split() at the end of the loop, in case the split can be taken after some of the moves have been searched (because threads became available).

the function that checks for available threads is largely taken from Viper:

Code: Select all

static bool thread_can_help&#40;int helper, int master&#41;
&#123;
   if &#40;helper == master&#41; return false;                            /* Can't help self         */
   if &#40;thread&#91;helper&#93;.state != THREAD_WAITING&#41; return false;      /* Can't help if busy      */

   int active_splits = thread&#91;helper&#93;.active_splits;
   if &#40;active_splits == 0&#41; return true;            /* Can always help if idle */

   /* If we're "the other thread" we can always help. */
   if &#40;threads_created == 2&#41; return true;

   /* TODO&#58; if the helper has active splits, but they are on the same branch as the
    * master is searching, then it can still help.
    */
   if &#40;splitpoint&#91;helper&#93;&#91;active_splits - 1&#93;.workers & &#40;1<<master&#41;)
      return true;

   return false;
&#125;

/* Returns true if there are any threads available to help the master thread */
static bool thread_available&#40;int master&#41;
&#123;
   int n;
   for &#40;n=0; n<threads_created; n++) &#123;
      if &#40;thread_can_help&#40;n, master&#41;)
         return true;
   &#125;
   return false;
&#125;

For completeness sake, here are the initialisation and some utility functions:

Code: Select all

static thread_t thread&#91;MAX_THREADS&#93;;
static int threads_created = 0;

static split_t splitpoint&#91;MAX_THREADS&#93;&#91;MAX_SPLIT&#93;;

/* Condition variable for threads to go to sleep */
pthread_cond_t sleep_condition;
pthread_mutex_t sleep_mutex;
lock_t split_mutex;

/* Return the total number of cores on the current machine.
 * NB&#58; returns the number of "logical" cores on hyper-threading machines.
 */
int get_number_of_cores&#40;void&#41;
&#123;
#ifdef WIN32
    SYSTEM_INFO sysinfo;
    GetSystemInfo&#40;&sysinfo&#41;;
    return sysinfo.dwNumberOfProcessors;
#else
    return sysconf&#40;_SC_NPROCESSORS_ONLN&#41;;
#endif
&#125;

/* Kick all threads into action */
void wake_all_threads&#40;void&#41;
&#123;
   pthread_mutex_lock&#40;&sleep_mutex&#41;;
   pthread_cond_broadcast&#40;&sleep_condition&#41;;
   pthread_mutex_unlock&#40;&sleep_mutex&#41;;
&#125;

/* Tell all threads to go to sleep */
void sleep_all_threads&#40;void&#41;
&#123;
   int n;
   for &#40;n=1; n<threads_created; n++)
      thread&#91;n&#93;.state = THREAD_SLEEPING;
&#125;

/* Initialise data structures for the specified number of threads */
void init_threads&#40;int num_threads&#41;
&#123;
   int n;
   if &#40;num_threads > MAX_THREADS&#41; num_threads = MAX_THREADS;

   if &#40;threads_created&#41;
      kill_threads&#40;);

   threads_created = 0;
   if &#40;num_threads < 2&#41; return;

   /* Initialise global condition and mutex variables */
   pthread_cond_init&#40;&sleep_condition, NULL&#41;;
   pthread_mutex_init&#40;&sleep_mutex, NULL&#41;;

   init_lock&#40;&split_mutex&#41;;

   /* Initialise split points */
   for &#40;n=0; n<num_threads; n++) &#123;
      for &#40;int k=0; k<MAX_SPLIT; k++) &#123;
         splitpoint&#91;n&#93;&#91;k&#93;.game = create_game&#40;);

         init_lock&#40;&splitpoint&#91;n&#93;&#91;k&#93;.lock&#41;;
      &#125;
   &#125;

   /* Set options for created threads&#58; run them detached from the main thread */
   pthread_attr_t attributes;
   pthread_attr_init&#40;&attributes&#41;; 
   pthread_attr_setdetachstate&#40;&attributes, PTHREAD_CREATE_DETACHED&#41;; 

   /* Initialise threads */
   memset&#40;&thread, 0, sizeof thread&#41;;
   for &#40;n=0; n<num_threads; n++) &#123;
      gamestate_t *game = create_game&#40;);
      thread&#91;n&#93;.thread_id = n;
      thread&#91;n&#93;.must_die = false;
      thread&#91;n&#93;.game = game;

      if &#40;n == 0&#41; continue;

      thread&#91;n&#93;.state = THREAD_SLEEPING;
      if &#40;pthread_create&#40;&thread&#91;n&#93;.pt, NULL, thread_func, &thread&#91;n&#93;) != 0&#41; &#123;
         printf&#40;"Failed to launch thread %d\n", n&#41;;
         exit&#40;0&#41;;
      &#125;
   &#125;
   thread&#91;0&#93;.state = THREAD_WORKING;
   thread&#91;0&#93;.must_die = false;
   threads_created = num_threads;
&#125;

/* Terminate all threads */
void kill_threads&#40;void&#41;
&#123;
   int n;

   for &#40;n=1; n<MAX_THREADS; n++) &#123;
      thread&#91;n&#93;.must_die = true;
   &#125;

   wake_all_threads&#40;);

   for &#40;n=1; n<threads_created; n++) &#123;
      pthread_join&#40;thread&#91;n&#93;.pt, NULL&#41;;
      thread&#91;n&#93;.game->transposition_table = NULL;     /* The transposition table is shared */
      thread&#91;n&#93;.game->eval_hash = NULL;     /* The transposition table is shared */
      end_game&#40;thread&#91;n&#93;.game&#41;;
   &#125;
   threads_created = 0;
&#125;

/* Set status word&#58; terminate search at the secified split point, in case of a
 * beta-cutoff.
 */
void stop_split_search&#40;split_t *split&#41;
&#123;
   split->stop = true;
&#125;

Most data is local to each thread. The only shared resources are the main transposition table and the evaluation hash table (both implemented using a lock-less scheme).

I'm sure it's not the best implementation in the world and there is much that can be improved, but at least it seems to be working so far and hopefully this write-up will be useful to someone later on.

lucasart · Post by **lucasart** » Tue Apr 23, 2013 5:46 am

Thank you very much for this long and detailed write up. This kind of post really stands out from the average forum spam. You might be disappointed to see no replies yet. But I can tell you that I have read this with great interest, and put it on my todo list.

Implementing SMP is a really daunting task, that I've always postponed. But eventually, I'll have to get to work and do an SMP version of DiscoCheck. Computers, Tablets and Phones now have more and more cores, and it's a waste not to use them.

velmarin · Post by **velmarin** » Tue Apr 23, 2013 8:41 am

lucasart wrote:Thank you very much for this long and detailed write up. This kind of post really stands out from the average forum spam. You might be disappointed to see no replies yet. But I can tell you that I have read this with great interest, and put it on my todo list.

Implementing SMP is a really daunting task, that I've always postponed. But eventually, I'll have to get to work and do an SMP version of DiscoCheck. Computers, Tablets and Phones now have more and more cores, and it's a waste not to use them.

So I like more and is according to their intelligence,

Otherwise agree, thank you very much Everd, for days to save that page, see the separate code is easier for neophytes like me.

I hope not Spam ....,
Thank you very much.

edwardyu · Post by **edwardyu** » Tue Apr 23, 2013 10:33 am

Thanks for the detail !
I try to implement lazy SMP (split at the root) usng Microsoft's VS2010 C++ PPL (Parallel Programming Library). I use some code like :

Code: Select all

parallel_invoke&#40;
&#91;&&#93; &#123;
    	    Board* spboard;           
          Board spboardref;
					spboardref = board;
					spboard = &spboardref;
         	PPL_searchRoot&#40;2, spboard, alpha2, beta, size, m_depth, movetab, incheck, 
         	                   m_bestmove2,
         	                   old_alpha,  root_nodes, 
                             nUnchanged, pv_i2 //, m_timeout, m_nodes   
         	                   ); 
         	                  &#125;,
&#91;&&#93; &#123;         	                  
          Board* spboard;           
          Board spboardref;
					spboardref = board;
					spboard = &spboardref;
         	PPL_searchRoot&#40;1, spboard, alpha, beta, size, m_depth, movetab, incheck, 
         	                   m_bestmove,
         	                   old_alpha,  root_nodes, 
                             nUnchanged, pv_i //, m_timeout, m_nodes   
         	                   ); 
         	                  &#125;
               );

but the resulting speed is ALWAYS much slower than the single core version !!! I try Intel's TBB (Thread Building Block) and the result is worser than MS's VS2010. I have asked the experts in the MS Forum but they said the parallel overhead is larger than the gain by multi-core computing. Have you tried or heard any engine writer using this latest C++ X11 high-level parallel programming?

hgm · Post by **hgm** » Tue Apr 23, 2013 11:16 am

I had a similar experience when I tried to implement SMP by running two (otherwise unmodified) processes with the hash table in a shared memory area. This caused such a large slowdown that none of the tricks I later added to prevent them from doing the same thing could overcome it. Even my most advanced solution was slower than searching with the single process.

I never found out why that was. (But I spent only 3 days on it; after that the engine had to play in the tourney I was preparing it for (the Olympiad 2010), and I never got back to it afterwards.)

lucasart · Post by **lucasart** » Tue Apr 23, 2013 2:16 pm

edwardyu wrote:Thanks for the detail !
I try to implement lazy SMP (split at the root) usng Microsoft's VS2010 C++ PPL (Parallel Programming Library). I use some code like :
Code: Select all
parallel_invoke&#40;
&#91;&&#93; &#123;
    	    Board* spboard;           
          Board spboardref;
					spboardref = board;
					spboard = &spboardref;
         	PPL_searchRoot&#40;2, spboard, alpha2, beta, size, m_depth, movetab, incheck, 
         	                   m_bestmove2,
         	                   old_alpha,  root_nodes, 
                             nUnchanged, pv_i2 //, m_timeout, m_nodes   
         	                   ); 
         	                  &#125;,
&#91;&&#93; &#123;         	                  
          Board* spboard;           
          Board spboardref;
					spboardref = board;
					spboard = &spboardref;
         	PPL_searchRoot&#40;1, spboard, alpha, beta, size, m_depth, movetab, incheck, 
         	                   m_bestmove,
         	                   old_alpha,  root_nodes, 
                             nUnchanged, pv_i //, m_timeout, m_nodes   
         	                   ); 
         	                  &#125;
               );  
but the resulting speed is ALWAYS much slower than the single core version !!! I try Intel's TBB (Thread Building Block) and the result is worser than MS's VS2010. I have asked the experts in the MS Forum but they said the parallel overhead is larger than the gain by multi-core computing. Have you tried or heard any engine writer using this latest C++ X11 high-level parallel programming?

This simplistic approach cannot work. You cannot search all the root moves independantly. The key thing is that once the first one has been searched, alpha is raised, and so on. You can't just treat the problem as N independant tasks (where N is the number of root moves). Unless of course you're doing a minmax and not using alpha/beta pruning, but I don't think you'll go far without alpha/beta prunuing.

hgm · Post by **hgm** » Tue Apr 23, 2013 3:03 pm

lucasart wrote:This simplistic approach cannot work. You cannot search all the root moves independantly. The key thing is that once the first one has been searched, alpha is raised, and so on. You can't just treat the problem as N independant tasks (where N is the number of root moves). Unless of course you're doing a minmax and not using alpha/beta pruning, but I don't think you'll go far without alpha/beta prunuing.

This is only partly true: if you use an aspiration search, the alpha you start with could be high enough that most moves fail low against it, even without raising it to the level of the best move.

Even when you search with fully open window the first move you search will take only about 30-40% of the total time of the iteration. Which typically raises alpha to its final value (as usually this move remains the best), allowing the remaining moves to be searched in 60-70% of the time. Even if you would search all these other moves also with a fully open window, they would each also take about 30-40% of the total time. So if you have as many cores as moves, it could still reduce the total time to 30% of what a single-threaded search would need.

lucasart · Post by **lucasart** » Tue Apr 23, 2013 3:29 pm

hgm wrote:This is only partly true: if you use an aspiration search, the alpha you start with could be high enough that most moves fail low against it, even without raising it to the level of the best move.

You're taking an edge case. What you should look at is the general case instead. In the general case, the first move is best, and the score is within the aspiration window's bounds.

hgm wrote: Even when you search with fully open window the first move you search will take only about 30-40% of the total time of the iteration. Which typically raises alpha to its final value (as usually this move remains the best), allowing the remaining moves to be searched in 60-70% of the time. Even if you would search all these other moves also with a fully open window, they would each also take about 30-40% of the total time. So if you have as many cores as moves, it could still reduce the total time to 30% of what a single-threaded search would need.

I don't know about other programs, but my engine spends much more time on the first root move than on all the others on average. In particular, I use LMR at the root, exactly like in any other node. But yes, perhaps in a really basic alpha/beta full depth searcher there could be a small gain.

Evert · Post by **Evert** » Tue Apr 23, 2013 3:32 pm

lucasart wrote:Thank you very much for this long and detailed write up. This kind of post really stands out from the average forum spam. You might be disappointed to see no replies yet. But I can tell you that I have read this with great interest, and put it on my todo list.

Thanks! Glad you found it useful.

I wasn't necessarily expecting a lot of discussion/replies, but it's good to hear that people find it useful (or not, as the case may be).

Implementing SMP is a really daunting task, that I've always postponed. But eventually, I'll have to get to work and do an SMP version of DiscoCheck. Computers, Tablets and Phones now have more and more cores, and it's a waste not to use them.

Indeed. I actually found the process less daunting than I had initially thought, but part of that is due to Jazz' lack of global variables. I'm very glad I did that.

I had to make two changes: the first was to store score and permutation data in the move list (they weren't in there before) and the second was to do away with the 64k repetition hash table (I didn't bench it, but building it up from scratch at each split point seemed like a certain way to make things extremely slow). Both of these changes are transparent and I implemented (and tested) both of these before adding the parallel search.

My initial trouble was largely due to a bug I mentioned in the other thread ("parallel search slowdown?"): setting the window as (1-alpha, -alpha). I now assert(alpha<beta) at the top of my search...

I did run into two points after the above post: the first is to check (in the normal search) whether the thread has been cancelled (due to a cutoff higher up in the tree); this can be handled the same as with the check for the clock. Without this you waste time waiting for threads to finish their search and back up to the split point before they realise that a cut-off has occurred.
The second concerns time control and is something I need to go and fix properly: I use node counts to decide when to check the clock, specifically (slightly simplified, but you get the idea)

Code: Select all

static const int clock_positions = 0x00007FFF;
if (&#40;game->moves_searched & clock_positions&#41;==0&#41;
   abort_search |= check_clock&#40;game&#41;

This works poorly for a parallel search because workers may terminate before reaching the required number of positions and the master is unlikely to satisfy the condition when its node-counter is updated with data from child nodes. The result is that multi-core Jazz tends to lose many games on time. There are several ways to fix it of course, but I haven't decided on the one I like most...

Evert · Post by **Evert** » Tue Apr 23, 2013 3:41 pm

edwardyu wrote:but the resulting speed is ALWAYS much slower than the single core version !!! I try Intel's TBB (Thread Building Block) and the result is worser than MS's VS2010. I have asked the experts in the MS Forum but they said the parallel overhead is larger than the gain by multi-core computing. Have you tried or heard any engine writer using this latest C++ X11 high-level parallel programming?

I'm afraid I can offer little help here, I'm not even sure how to read your code.

Note that I don't split at the root (Jazz has a separate tree-search/root-search) but I don't think there's any real reason I couldn't do that, but it seemed a minor gain compared to splitting during the search (someone correct me if I'm wrong).

A few general points then: Make sure that the cost of creating a split point is as small as possible. If there's bits of state that you don't really need, don't copy them. A second point is to create the worker threads ahead of time instead of at the split point. If the search is deep enough overhead from launching the threads should be small, but at shallower depths it could hurt badly. I don't know how easy it is to control that using compiler-generated multi-threaded code. I considered using OpenMP rather than pthreads directly, but decided against it in the end. More manual control over what is happening seemed like a good thing.
I may explore OpenMP as an option in the future if and when I'm more comfortable with parallel search.

EDIT: one more thing I wanted to say: print the tree of the parallel search and compare it to the single threaded case. This is most useful for a low depth search (the lowest that makes a difference, if you use a condition on the split depth). That way you can monitor exactly what is being searched and verify that it makes sense. Also, add loads of assertions to the code to help catch bugs.

Implementation of multithreaded search in Jazz

Implementation of multithreaded search in Jazz

Re: Implementation of multithreaded search in Jazz

Re: Implementation of multithreaded search in Jazz

Re: Implementation of multithreaded search in Jazz

Re: Implementation of multithreaded search in Jazz

Re: Implementation of multithreaded search in Jazz

Re: Implementation of multithreaded search in Jazz

Re: Implementation of multithreaded search in Jazz

Re: Implementation of multithreaded search in Jazz

Re: Implementation of multithreaded search in Jazz