Crazy SMP

Joost Buijs · Post by **Joost Buijs** » Mon Jun 20, 2016 4:30 pm

hgm wrote:Indeed, this puzzled me too. But I figured it might switch to 34x when you use all 8 hyper threads,

Anyway, the speed differences are not shocking, and the drop from 3,000 to 2,600 knps when going from 2 to 4 cores must be mainly due to other factors. 4 x 2.6 = 10.4, so in principle I should still be able to get a 3x speedup with 4 cores. If I manage to do that before the Olympiad starts next Monday, I would be pretty happy.

Hopefully you manage to get it working before the Olympiad.

Maybe I will drop by one day to see what is going on there in Leiden.
Since I have to travel by public transportation it takes me at least 1.5 hour to get there (when there are no delays), this holds me back a little.

bob · Post by **bob** » Mon Jun 20, 2016 4:39 pm

hgm wrote:Thanks for the feedback, everyone. I gather the NPS slowdown is not normal, so I looked a bit into that. The laptop is an Intel i3, which is not supposed to have turbo boost. But I know that there are chipsets for laptops that throttle down memory access (because memory is often poorly cooled in laptops). I use no locks (and did not even bother to validate TT probes trhough lockless hashing).

When I repeated the 1-code search while there was a second, independent version of the engine analyzing, the NPS suffered 14-16% on the laptop! So I took the .exe to my i7 machine (which is also not supposed to have turbo boost), and I got the following results:
Code: Select all
depth 1-core 1-core 2-core 4-core
             loaded
 6        12     12      7      6
 7        20     20     10     11
 8        34     34     18     17
 9        67     65     37     29
10       131    132     90     62
11       354    352    205    146
12      1165   1156    692    427
13      2252   2248   2042   1633
14      5099   5174   8484   7124
15     16136  16559  23551  18495

NPS&#40;12-ply&#41;   2,939  3,034  2,600
Now there is basically no difference between the NPS with 1 core when I run a second instance of the engine ('loaded'), and also not when I switch to 2-core SMP. Because of that the speedup seems much nicer: the 2-core searches take about 60% of the time of a 1-core search. This seems to agree with what I have heard from others for this simple unsynchronized approach. (At depth > 12 the PVs are completely different, so I am not sure how to compare the numbers there in a sensible way.)

Of course this unsynchronized approch scales very poorly: when I try 4 cores, the speedup compared to 2 cores is much less than when going from 1 to 2 cores. This is partly due to a 13% drop in NPS (per core). I had noticed such an NPS drop before when I run more than 2 independent games in parallel, so I suppose there is not much that can be done about it, on the hardware I have. If it is due to the memory bottleneck not probing TT in QS might prevent it, but on a single thread this reduced time-to-depth as well. Perhaps a small in-cache private TT for QS only could be a compromise.

I already have a lot of patches to better 'synchronize' the search processes, and these did speed up the 2-core search in earlier tests. But at that time it was all for naught, because I could not get the basics right, and the unsynchronized 2-core search slowed things down nearly a factor 2, which even the smartest algorithm I could think off couldn't earn back. (And I did not have more than 2 cores at the time.) So I will now start to apply those, to see how much I can improve the scaling. They basically work by labeling the hash entries as 'busy' when the search enters the node, rather than waiting with the update until the search on it completed. And when encountering a 'busy' node which is not a 'must have' in the parent (i.e. an early move that is likely to be a cut move) immediately return with an invalid result, so that the parent can reschedule the search of the move until after it has searched another move. Only when there is nothing else to search the remaining moves become 'must haves', and the search will try them again, to help in their search.

Another inconvenience is that this SMP code is Windows-specific, while in the mean time I moved my development to Linux. Launching the slave processes is simple enough in Linx (through pipe(2) and fork(2)), but I would have to figure out how I can share memory between processes on Linux to also make it work there.

If you use fork(), nothing is shared and you get to resort to the system V stuff like shmget() (allocate a block of shared memory) and shmat() (attach a block of shared memory to current process).

Or you can switch to threads which will share everything but local data automatically...

hgm · Post by **hgm** » Mon Jun 20, 2016 7:05 pm

Yes, I would use fork(). (CreateProcess() in Windows API.) The program I am converting makes rather heavy use of global variables, and it would be a lot of work to put all of those in arrays indexed by thread number. By making them separate processes I can be sure nothing is shared that I don't want shared.

What I did is make a pipeline of processes, created by the 'cores N' command: when N > 1 it forks off a daughter, and sends it 'cores N-1'. All commands not having to do with timing (level, time, otim), pondering (easy, hard) or interface control (post, nopost) are passed along the pipeline, so that they reach all processes. That includes input moves. Only the first process in the pipeline (the 'master' process) communicates with the GUI; when the master process starts thinking it sends an 'analyze' command along the pipeline, when it stops thinking, it sends 'exit'. If it moves as a result of thinking it sends that move to the GUI (of course), but also along the pipeline, to keep all slaves in sync. Apart from analyzing the slaves are always kept in force mode, so commands that would make them play black or white are either not relayed at all (such as 'go') or relayed, followed by 'force' (such as 'new').

That really requires a minimal modification of the protocol driver. The only other thing I have to change is the allocation of TT memory. In Windows I use CreateFileMapping(INVALID_HANDLE_VALUE, ...) for that, mapping the system's paging file. I willlook into the shmget() function; thanks for the pointer.

bob · Post by **bob** » Tue Jun 21, 2016 2:09 am

Joost Buijs wrote:It seems the turbo scheme for the i7-2600 is 4,3,2,1 (x additional bus clocks ???) this translates to 3.8,3.7,3.6,3.5 GHz.
Strange that there is already a 100 MHz. boost with all 4 cores under load, this makes it effectively a 3500 MHz. processor.

These things are a bit odd. I have a dual 2660 xeon system with the basic clock frequency of 2.6ghz, 10 cores per chip.

Running crafty with all 20 cores running, it consistently runs at 2.9ghz all day long. However, if I run a code that uses AVX stuff, it throttles right back to 2.6ghz.

And with Crafty, it runs at 2.9ghz with 20 cores, but using just one core it ramps up to 3.3gh7, 700 MHz above the rated clock speed. Makes it a pain to benchmark.

hgm · Post by **hgm** » Tue Jun 21, 2016 10:26 am

I wonder how SMP schemes solve the following issue:

Now and then a search process wants to search a late move that is already being searched by another process. To avoid walking the same tree I then postpone the search of that move, and start searching another late move. In due time I will come back to the move, to pick up the result the otherprocess found for it from the hash. Only if no other moves are left to work on, I will enter the sub-tree of the move already being searched, to help in that search.

But occasionally the move that was being searched by the other process will fail high, and cause a beta cutoff. That means I have been wasting my time searching the other move, but that cannot be helped. What can be helped, however, is that I continue wasting time finishing that search, to only find out afterwards (when I retrieve the score of the postponed move from the TT) that it was in vain.

So it seems any shrinking of the search window in a node on the current branch due to activity of other search agents should be noticed as soon as possible, so that the window can be adapted to it, and I won't miss cutoffs that I would miss with the original window. How is this typically done? Is it a good idea to have each search process keep a stack with pointers to all hash entries of moves that it postponed? And then check all of those periodically to see if they in the mean time have received a score, deleting them from the stack when they did receive a score that did not affect the window, and adapt the window in the current node when they had a score that did?

Such an adjustment of window bounds could give rise to an immediate cutoff without furtehr searching of moves, e.g. when alpha gets adjusted to above beta. So it seems I should do that before every recursive call to Search(). I suppose the stack of postponed moves never gets very large, and that when I check them before every call of Search(), the relevant hash entries would persist in the L1 cache in the shared state (as they came in that way in the original hash probe by reading after another process had written in them to mark them 'busy'). So checking them should be pretty fast, similar to testing for repeats. Only when the process searching them would have finished that search and stored a score there the entry would be yanked from my L1, and would have to be transferred from the other process its cache. But sooner or later I would have to do that anyway, to fetch the score. (Unless the search I started to do instead would give me a cutoff by itself,which is unlikely for alate move.)

bob · Post by **bob** » Fri Jun 24, 2016 12:12 am

hgm wrote:I wonder how SMP schemes solve the following issue:

Now and then a search process wants to search a late move that is already being searched by another process. To avoid walking the same tree I then postpone the search of that move, and start searching another late move. In due time I will come back to the move, to pick up the result the otherprocess found for it from the hash. Only if no other moves are left to work on, I will enter the sub-tree of the move already being searched, to help in that search.

But occasionally the move that was being searched by the other process will fail high, and cause a beta cutoff. That means I have been wasting my time searching the other move, but that cannot be helped. What can be helped, however, is that I continue wasting time finishing that search, to only find out afterwards (when I retrieve the score of the postponed move from the TT) that it was in vain.

So it seems any shrinking of the search window in a node on the current branch due to activity of other search agents should be noticed as soon as possible, so that the window can be adapted to it, and I won't miss cutoffs that I would miss with the original window. How is this typically done? Is it a good idea to have each search process keep a stack with pointers to all hash entries of moves that it postponed? And then check all of those periodically to see if they in the mean time have received a score, deleting them from the stack when they did receive a score that did not affect the window, and adapt the window in the current node when they had a score that did?

Such an adjustment of window bounds could give rise to an immediate cutoff without furtehr searching of moves, e.g. when alpha gets adjusted to above beta. So it seems I should do that before every recursive call to Search(). I suppose the stack of postponed moves never gets very large, and that when I check them before every call of Search(), the relevant hash entries would persist in the L1 cache in the shared state (as they came in that way in the original hash probe by reading after another process had written in them to mark them 'busy'). So checking them should be pretty fast, similar to testing for repeats. Only when the process searching them would have finished that search and stored a score there the entry would be yanked from my L1, and would have to be transferred from the other process its cache. But sooner or later I would have to do that anyway, to fetch the score. (Unless the search I started to do instead would give me a cutoff by itself,which is unlikely for alate move.)

In general, and not in the lazy SMP way of course, I handle this directly. A fail high instantly ends everything at the current ply, whatever is happening. If it is a LMR search that fails high, the entire LMR search is aborted, we return to the same point and re-search to normal depth, again in parallel.

Crazy SMP

Re: Crazy SMP

Re: Crazy SMP

Re: Crazy SMP

Re: Crazy SMP

Re: Crazy SMP

Re: Crazy SMP