I tried to work out how it "should" work first, then I checked my understanding against the implementations in Viper and Stockfish, so it's possible there are some peculiarities that I missed. I'll give a quick overview of what I've implemented, but the first issue I need some help with/clarification on is more general and is at the end (so feel free to skip ahead to that point).
At startup, I launch N-1 threads (using pthreads) so I have N threads in total (including the main thread). The extra threads have a status flag, which is initialised to "WAITING". During the search, after the first move has been searched, I call a "split" function (if constraints on remaining depth are satisfied), passing in the current position, move history, search bounds, move list, remaining depth, ply number. This function checks (behind a mutex) if there are threads that are marked as WAITING. If there are the function first copies data about the position to a "split point" structure, it records what threads are available to do work, records those in the split point, and sets their status to SCHEDULED. The mutex is released, the scheduled threads are switched to WORKING and the thread that entered the split function joins the parallel search at the split point.
In the parallel search function, threads pull a move from the move list (under a mutex), make the move and call the normal search function. Search bounds and best move are updated at the split point under a mutex. Once a thread is done with a split point (either because of a cut-off or because all moves have been searched) it unregisters from the split point and changes its status to WAITING.
Once all threads have unregistered from the split point, the master thread returns with the results from the parallel search. The normal search then finishes the thread, taking care of TT and killer updates.
It works properly in the sense that it runs (in debug mode) without triggering any assertions, meaning the board and move lists are passed correctly and if I track nodes searched the different threads all pick up work as they should. However, it does not work properly in the sense that the search is slower than the single-threaded search.
The way I measure the performance is simply to run the benchmark command with a given number of active threads. I then look at the total time spent and/or NPS reported. The node count changes, of course (it goes up by ~10% compared to the single threaded search). The time to complete the benchmark search increases by ~25-30% in the multi-threaded case. I can make it break even (roughly) if I set the maximum split depth large enough (>8-10 ply from the horizon) and let the program search deep enough, but I haven't been able to do any better so far.
Looking at a profile report suggests that the problem is in the idle loop, which looks like this:
Code: Select all
static void thread_loop(thread_t *self, split_t *sp)
{
while (!self->must_die) {
if (self->state == THREAD_SLEEPING) {
assert(self->thread_id != 0);
pthread_mutex_lock(&sleep_mutex);
pthread_cond_wait(&sleep_condition, &sleep_mutex);
pthread_mutex_unlock(&sleep_mutex);
self->state = THREAD_WAITING;
}
/* Wait for work */
//while (!sp && self->state == THREAD_WAITING && ! self->must_die);
/* If we're scheduled to do work, start working */
//while (self->state == THREAD_SCHEDULED);
if (self->state == THREAD_WORKING) {
/* Call SMP search */
search_split(self->game, self->splitpoint);
/* Mark thread as waiting for work. */
self->state = THREAD_WAITING;
}
/* If the current thread is the master of a split point, wait for all helpers
* to finish, then return.
*/
if (sp) {
//while (sp->workers);
if (sp->workers == 0)
return;
}
}
}
To get a better feel for what is going on, I commented out the split call in the main search, so the code never enters the parallel search. I restricted the parallel search to 1 extra thread (so 2 threads in total). With this setup the parallel thread is just launched and then busy-waits until the program exits. The thing is: if I run the benchmark test the code is still 25-30% slower than if no threads are running in the background.
I've verified this on two test systems: a dual-core i7 MacBook (running OS X) and a quad-core i7 desktop (running Linux), running one extra thread (which is never called on to do any work) each time. No other programs are using a significant amount of CPU time on either system at the time of the test.
The conclusion is that the thread running in the background is taking away processor time from the main thread, even if nothing else is running and there are idle cores available (on the quad, the dual only has two cores of course, although both have hyper-threading).
Is this normal? Am I missing something?
For completeness, the threads are stared with
Code: Select all
pthread_create(&thread[n].pt, NULL, thread_func, &thread[n]
Code: Select all
static void *thread_func(void *arg)
{
thread_loop(arg, NULL);
return NULL;
}