Parallel search slowdown?

Evert · Post by **Evert** » Fri Apr 12, 2013 5:55 pm

I decided to try to add a parallel search to Jazz, but I'm running into a few issues (it's ok, I didn't expect it to work immediately).

I tried to work out how it "should" work first, then I checked my understanding against the implementations in Viper and Stockfish, so it's possible there are some peculiarities that I missed. I'll give a quick overview of what I've implemented, but the first issue I need some help with/clarification on is more general and is at the end (so feel free to skip ahead to that point).

At startup, I launch N-1 threads (using pthreads) so I have N threads in total (including the main thread). The extra threads have a status flag, which is initialised to "WAITING". During the search, after the first move has been searched, I call a "split" function (if constraints on remaining depth are satisfied), passing in the current position, move history, search bounds, move list, remaining depth, ply number. This function checks (behind a mutex) if there are threads that are marked as WAITING. If there are the function first copies data about the position to a "split point" structure, it records what threads are available to do work, records those in the split point, and sets their status to SCHEDULED. The mutex is released, the scheduled threads are switched to WORKING and the thread that entered the split function joins the parallel search at the split point.

In the parallel search function, threads pull a move from the move list (under a mutex), make the move and call the normal search function. Search bounds and best move are updated at the split point under a mutex. Once a thread is done with a split point (either because of a cut-off or because all moves have been searched) it unregisters from the split point and changes its status to WAITING.

Once all threads have unregistered from the split point, the master thread returns with the results from the parallel search. The normal search then finishes the thread, taking care of TT and killer updates.

It works properly in the sense that it runs (in debug mode) without triggering any assertions, meaning the board and move lists are passed correctly and if I track nodes searched the different threads all pick up work as they should. However, it does not work properly in the sense that the search is slower than the single-threaded search.

The way I measure the performance is simply to run the benchmark command with a given number of active threads. I then look at the total time spent and/or NPS reported. The node count changes, of course (it goes up by ~10% compared to the single threaded search). The time to complete the benchmark search increases by ~25-30% in the multi-threaded case. I can make it break even (roughly) if I set the maximum split depth large enough (>8-10 ply from the horizon) and let the program search deep enough, but I haven't been able to do any better so far.

Looking at a profile report suggests that the problem is in the idle loop, which looks like this:

Code: Select all

static void thread_loop(thread_t *self, split_t *sp)
{
   while (!self->must_die) {
      if (self->state == THREAD_SLEEPING) {
         assert(self->thread_id != 0);
         pthread_mutex_lock(&sleep_mutex);
         pthread_cond_wait(&sleep_condition, &sleep_mutex);
         pthread_mutex_unlock(&sleep_mutex);
         self->state = THREAD_WAITING;
      }

      /* Wait for work */
      //while (!sp && self->state == THREAD_WAITING && ! self->must_die);

      /* If we're scheduled to do work, start working */
      //while (self->state == THREAD_SCHEDULED);

      if (self->state == THREAD_WORKING) {

         /* Call SMP search */
         search_split(self->game, self->splitpoint);

         /* Mark thread as waiting for work. */
         self->state = THREAD_WAITING;
      }

      /* If the current thread is the master of a split point, wait for all helpers
       * to finish, then return.
       */
      if (sp) {
         //while (sp->workers);
         if (sp->workers == 0)
            return;
      }
   }
}

In other words, it busy-waits until the thread is assigned work.

To get a better feel for what is going on, I commented out the split call in the main search, so the code never enters the parallel search. I restricted the parallel search to 1 extra thread (so 2 threads in total). With this setup the parallel thread is just launched and then busy-waits until the program exits. The thing is: if I run the benchmark test the code is still 25-30% slower than if no threads are running in the background.

I've verified this on two test systems: a dual-core i7 MacBook (running OS X) and a quad-core i7 desktop (running Linux), running one extra thread (which is never called on to do any work) each time. No other programs are using a significant amount of CPU time on either system at the time of the test.

The conclusion is that the thread running in the background is taking away processor time from the main thread, even if nothing else is running and there are idle cores available (on the quad, the dual only has two cores of course, although both have hyper-threading).

Is this normal? Am I missing something?

For completeness, the threads are stared with

Code: Select all

pthread_create(&thread[n].pt, NULL, thread_func, &thread[n]

where

Code: Select all

static void *thread_func(void *arg)
{
   thread_loop(arg, NULL);
   return NULL;
}

bob · Post by **bob** » Fri Apr 12, 2013 6:57 pm

One possible quirk that I don't remember specific details about.

Some systems only use ONE process, and it interleaves execution of the created threads, using that one process. It will NEVER run faster for that reason. There is a parameter that you have set to NULL, the thread attribute value. You can set various attributes (I use this in Crafty, which runs just fine on my macbook dual-core i7)...

Declaration of the attributes variable:
data.c: pthread_attr_t attributes;

initialize the attributes variable and set detached mode:
init.c: pthread_attr_init(&attributes);
init.c: pthread_attr_setdetachstate(&attributes, PTHREAD_CREATE_DETACHED);

create thread:
iterate.c: pthread_create(&pt, &attributes, ThreadInit, (void *) proc);

To verify you have a problem related to this, on your macbook i7, start top in one terminal window. Run your program in another window, no threads. You should see a 25% load average (since the i7 has hyper threading it appears as 4 cpus to top). Now kill that and re-start, but set it to use 2 threads and start it searching. Now the load average should quickly reach 50%. If it sticks at 25% you are running one real process which won't help you. The code I gave should fix this.

Evert · Post by **Evert** » Fri Apr 12, 2013 7:52 pm

Ok, I tried that, but unfortunately it doesn't seem to make any difference... single threaded the CPU use is reported as 100%, with two threads it's reported as 200% in both cases (with the background thread idling in the background).

I'll keep the code anyway (and maybe experiment a bit with other possible options).

bob · Post by **bob** » Fri Apr 12, 2013 8:32 pm

Evert wrote:Ok, I tried that, but unfortunately it doesn't seem to make any difference... single threaded the CPU use is reported as 100%, with two threads it's reported as 200% in both cases (with the background thread idling in the background).

I'll keep the code anyway (and maybe experiment a bit with other possible options).

Sounds like the second thread never does anything useful. You might try printing out a message each time a thread actually starts to search, just to make sure that is happening. The way you are doing it at present, this doesn't happen very many times during an iteration so the output will not be overwhelming.

syzygy · Post by **syzygy** » Fri Apr 12, 2013 9:05 pm

On your Linux system you could try using i7z to check the frequencies at which the cores are running.

bob · Post by **bob** » Fri Apr 12, 2013 9:11 pm

syzygy wrote:On your Linux system you could try using i7z to check the frequencies at which the cores are running.

Ugh. Good catch that has burned me previously.

Using one core, it will overclock (turbo-boost). Using two it generally will not. I've always disabled turboboost when possible (not possible on a mac it seems). That will certainly screw up parallel search timing analysis.

Evert · Post by **Evert** » Fri Apr 12, 2013 9:43 pm

bob wrote: Sounds like the second thread never does anything useful. You might try printing out a message each time a thread actually starts to search, just to make sure that is happening. The way you are doing it at present, this doesn't happen very many times during an iteration so the output will not be overwhelming.

Well, in that test it wasn't supposed to do anything useful, that was just the overhead of having the second thread running in the background.

However, doing actual work and logging what happens, I get the following:

Code: Select all

Started 2 threads
Creating split point, ply = 0 depth = 5 [a,b]=[-13, 28]
Scheduling work for thread 0
> 0 0
Scheduling work for thread 1
> 1 0
Starting threads
< 0 0
< 1 0
Entering loop (#0)
Thread 1 accepted work
Thread 0 accepted work
Thread 1 contributed 19 nodes
Thread 0 contributed 0 nodes
Exit loop (#0)
Joined split point, ply = 0 depth = 5 [a,b]=[-13, 28]
Creating split point, ply = 1 depth = 5 [a,b]=[18, 63]
Scheduling work for thread 0
> 0 0
Scheduling work for thread 1
> 1 0
Starting threads
< 0 0
< 1 0
Entering loop (#0)
Thread 1 accepted work
Thread 0 accepted work
Thread 1 contributed 26 nodes
Thread 0 contributed 0 nodes
Exit loop (#0)
Joined split point, ply = 1 depth = 5 [a,b]=[18, 63]
Creating split point, ply = 0 depth = 6 [a,b]=[-18, 37]
Scheduling work for thread 0
> 0 0
etc.

The interesting thing here is that thread 0 (the main thread) apparently doesn't do anything at all! This isn't completely true since there are points where it does contribute some work, but it's always much less than thread 1. Looks like I should take another look at my split_search()...

Rein Halbersma · Post by **Rein Halbersma** » Fri Apr 12, 2013 9:57 pm

I have little to no experience with parallel search myself, but I'm just curious to the large number of mutex calls in your description. Is it really necessary to pull a move under a mutex, since you already copied the position and history and other stuff? Why not copy the successor position directly? And do other programs also communicate an updated alpha to sibling nodes or do they only broadcast a beta cutoff?

Perhaps there is a lot of locking overhead. To test this, I would be curious to know the speed penalty of a single threaded parallel search (that does no split point copying, but that does do all the mutex locking) compared to the serial search.

Evert · Post by **Evert** » Fri Apr 12, 2013 9:59 pm

syzygy wrote:On your Linux system you could try using i7z to check the frequencies at which the cores are running.

Good point.

They idle at 1600 MHz (using the "ondemand" governor) but switch to about 3700 MHz once the program starts. Then there is some variation on the level of 3650-3700 MHz until they clock back down to 1600 once the program finishes.

Evert · Post by **Evert** » Fri Apr 12, 2013 10:22 pm

Rein Halbersma wrote:I have little to no experience with parallel search myself, but I'm just curious to the large number of mutex calls in your description. Is it really necessary to pull a move under a mutex, since you already copied the position and history and other stuff?

Well, there are two places where mutexes (mutices?) are used: there is one mutex in the split() function for when the split point is created and threads are assigned; I don't see a way around that since manipulating the split point stack and assigning the threads should not be interrupted by another thread coming in. The second mutex is specific for each split point and is used whenever the splitpoint gets updated.

I did it this way at least partly because that's what Viper and Stockfish (at least the old version I have here) do, so there may be other ways to do it. I don't really see a way to avoid it though (if I did I wouldn't have done it).

Pulling the move from the move list is done under a mutex, but it wouldn't have to be if "my_move_number = split_point->current_move_number++" would be an atomic operation, but I don't think it is. It wouldn't be an issue if the worst that can happen is that two threads search the same move (since the hash table will help), but if the current move is incremented twice without being read in between a move will be skipped in the search, which could be a problem (perhaps not in practice? It'd be a form of random pruning, but most moves are bad anyway and it may be rare enough that it doesn't matter too much).

Why not copy the successor position directly? And do other programs also communicate an updated alpha to sibling nodes or do they only broadcast a beta cutoff?

Since we're past the first move, all moves are searched with a null-window, so updated alpha = beta cutoff. I don't think it's necessary to update alpha under a mutex though.

Perhaps there is a lot of locking overhead. To test this, I would be curious to know the speed penalty of a single threaded parallel search (that does no split point copying, but that does do all the mutex locking) compared to the serial search.

I'll try to reduce the amount of locking and see what happens. As I said though, the profiler suggests that most time is wasted in the idle loop, not in the split_search. Perhaps I'm not interpreting that correctly (ie, perhaps what's happening is that due to the locking I never do anything useful in the split_search and waste all my time in the idle loop).

Parallel search slowdown?

Parallel search slowdown?

Re: Parallel search slowdown?

Re: Parallel search slowdown?

Re: Parallel search slowdown?

Re: Parallel search slowdown?

Re: Parallel search slowdown?

Re: Parallel search slowdown?

Re: Parallel search slowdown?

Re: Parallel search slowdown?

Re: Parallel search slowdown?