Parallel search slowdown?

syzygy · Post by **syzygy** » Fri Apr 12, 2013 10:56 pm

Evert wrote:
syzygy wrote:On your Linux system you could try using i7z to check the frequencies at which the cores are running.
Good point.

They idle at 1600 MHz (using the "ondemand" governor) but switch to about 3700 MHz once the program starts. Then there is some variation on the level of 3650-3700 MHz until they clock back down to 1600 once the program finishes.

Just to be sure: this clock frequency is independent of whether you run with 1 thread or 2 threads, right? (Except that with 2 threads you see 2 cores at about 3700 Mhz and with 1 thread only 1 core.) Then turboboost is not the culprit and you probably have a bug somewhere (for lack of a better explanation).

Pulling the move from the move list is done under a mutex, but it wouldn't have to be if "my_move_number = split_point->current_move_number++" would be an atomic operation, but I don't think it is.

With gcc you can use the __sync_fetch_and_add() builtin for this. Alternatively, use a spinlock instead of a mutex.

From your initial post:

This function checks (behind a mutex) if there are threads that are marked as WAITING.

Maybe you do this already, but it is more efficient to first check for idle threads (e.g. by checking an idle thread counter) and only if there are, take the mutex and check again.

jdart · Post by **jdart** » Fri Apr 12, 2013 11:05 pm

Your implementation looks reasonable at first glance.

The other thing that can hurt multi-threaded performance is memory contention. You do not want two threads accessing the same global memory to the extent possible, or memory structures that lie in the same cache line for the processor (the hashtable is an unavoidable exception to this). But this won't totally kill your performance at least on 2-4 threads, it will just make scaling worse.

Really, the best thing is to measure where the bottleneck is. If you have an Intel box (not AMD) then Intel's VTune Analyzer is a great tool for this. It is free for non-commercial use on Linux. You can get a limited-time free trial version for Windows. That will tell you pretty rapidly if you are waiting too often on a mutex, or have memory issues, or whatever.

Also you need to realize that once you have multiple threads then the time to complete a search is going to vary somewhat randomly. So if you are using this to measure progress you need to average over multiple invocations.

jdart · Post by **jdart** » Fri Apr 12, 2013 11:09 pm

> In other words, it busy-waits until the thread is assigned work.

You should really use a synchronization object and wake up the thread when it is assigned work.

Also a sophisticated implementation will use the main thread to aid slave threads once the main thread's work is finished, although that will not give you a large amount of extra performance.

--Jon

Evert · Post by **Evert** » Sat Apr 13, 2013 10:32 am

syzygy wrote: Just to be sure: this clock frequency is independent of whether you run with 1 thread or 2 threads, right? (Except that with 2 threads you see 2 cores at about 3700 Mhz and with 1 thread only 1 core.) Then turboboost is not the culprit and you probably have a bug somewhere (for lack of a better explanation).

Correct.

With gcc you can use the __sync_fetch_and_add() builtin for this. Alternatively, use a spinlock instead of a mutex.

Thanks! I'll give that a shot.

Maybe you do this already, but it is more efficient to first check for idle threads (e.g. by checking an idle thread counter) and only if there are, take the mutex and check again.

I actually don't do that, but I'll make a note to do that. I suspect it's not what's causing the problem now though...

Evert · Post by **Evert** » Sat Apr 13, 2013 10:37 am

jdart wrote:Your implementation looks reasonable at first glance.

Damn. I was hoping for something obvious...

Really, the best thing is to measure where the bottleneck is. If you have an Intel box (not AMD) then Intel's VTune Analyzer is a great tool for this. It is free for non-commercial use on Linux.

I can't seem to find a download link for the non-commercial version on the Intel developer website. I had some trouble finding the link for their free non-commercial Linux compiler a while back too, so I may just need to look a bit better...

How much better is it than compiling with gcc -p -pg?

Also you need to realize that once you have multiple threads then the time to complete a search is going to vary somewhat randomly. So if you are using this to measure progress you need to average over multiple invocations.

Sure. Actually the search time for single-threaded search varies slightly as well (much less so, obviously).

EDIT: by the way, I tried using spinlocks instead of mutexes (in Linux, pthreads on OS X doesn't implement spinlocks) but it didn't make any difference. I also read that mutexes may behave first like spinlocks on modern systems before actually sleeping the thread, which may be why...

Rein Halbersma · Post by **Rein Halbersma** » Sat Apr 13, 2013 10:43 am

Evert wrote: I can't seem to find a download link for the non-commercial version on the Intel developer website. I had some trouble finding the link for their free non-commercial Linux compiler a while back too, so I may just need to look a bit better...

http://software.intel.com/en-us/non-com ... evelopment

Evert · Post by **Evert** » Sat Apr 13, 2013 10:51 am

Thanks!
For some reason I couldn't find it after being logged in (maybe because the system assumes that because I have an academic licence for their FORTRAN compiler I'll want one for their other products as well?).

jdart · Post by **jdart** » Sun Apr 14, 2013 1:34 am

How much better is it than compiling with gcc -p -pg?

It can do what gnu profiling does but does more also. It makes heavy use of Intel-specific performance counters built into their chips. For best results you have to run it on a Sandy Bridge or Ivy Bridge CPU because some of the counters are not there in earlier versions.

--Jon

bob · Post by **bob** » Sun Apr 14, 2013 7:24 am

Evert wrote:
syzygy wrote: Just to be sure: this clock frequency is independent of whether you run with 1 thread or 2 threads, right? (Except that with 2 threads you see 2 cores at about 3700 Mhz and with 1 thread only 1 core.) Then turboboost is not the culprit and you probably have a bug somewhere (for lack of a better explanation).
Correct.

With gcc you can use the __sync_fetch_and_add() builtin for this. Alternatively, use a spinlock instead of a mutex.
Thanks! I'll give that a shot.

Maybe you do this already, but it is more efficient to first check for idle threads (e.g. by checking an idle thread counter) and only if there are, take the mutex and check again.
I actually don't do that, but I'll make a note to do that. I suspect it's not what's causing the problem now though...

MUTEXes like to block and introduce lots of wait/idle time. A spin lock (you can borrow the code from Crafty, I borrowed it from the Linux kernel) will solve that cleanly, if that is an issue. The more times you acquire a lock, the worse the posix thread mutexes perform.

Evert · Post by **Evert** » Thu Apr 18, 2013 3:29 pm

Ok, one mystery solved.

I tried spinlocks in place of mutexes, but it didn't seem to make any difference. I looked at the output from the vtune analyser, which didn't make things any clearer.

Then I finally found a bug (I'd like to say the bug, but I can't just yet): when I call the split function I make a copy of the move list at the split point and all workers then pull moves from this list to search. Once the move list is empty the split function returns the score to the alpha-beta search, which either finalises the search (update PV, killer tables and TT) or continues the search if the split was unsuccessful. It does this test by checking if there are moves left in the move list. However, since I copied the move list when entering the split point, it doesn't "see" that any moves have been searched at all and it starts searching them all again.

Replacing the copy of the move list by just passing around a pointer resolves that and I now get a neat speed-up by doing a parallel search. The bad news is that the score that is backed up is completely bogus...

Still, it's progress (of sorts).

Parallel search slowdown?

Re: Parallel search slowdown?

Re: Parallel search slowdown?

Re: Parallel search slowdown?

Re: Parallel search slowdown?

Re: Parallel search slowdown?

Re: Parallel search slowdown?

Re: Parallel search slowdown?

Re: Parallel search slowdown?

Re: Parallel search slowdown?

Re: Parallel search slowdown?