Lazy SMP, part 2

dchoman · Post by **dchoman** » Sat Jan 12, 2013 6:16 pm

I was motivated by the previous "Lazy SMP" thread to start working on an smp version of EXchess. So far, I've implemented the simplest possible smp search... just parallel threads searching the same alpha-beta window from the root position and communicating only through the various hash tables. I had understood that this approach was not very effective, but I thought it was a good place to start. To my surprise, it worked much better than I expected with very little effort.

Does anyone have experience with this simple form of SMP? If so, what kind of results did you get? My early results are below, and I would be interested to know if they are atypical.

Below are some preliminary results for elo improvement for 1 thread, 2 threads, and 4 threads versus a gauntlet of 10 opponents. For comparison, I've included the 1 thread version run with twice the search time to give a baseline for expected improvement for an exact doubling of search speed. So they are named as follows:

vsmp1 through vsmp 4 --> 1 through 4 threads respectively
vsmp1x2 --> 1 thread version run with twice search time

Code: Select all

Rank Name              Elo    +    - games score oppo. draws 
   1 EXchess vsmp4     140   16   15  1456   69%     1   22% 
   3 EXchess vsmp1x2   105   18   18  1000   64%    -1   24% 
   5 EXchess vsmp2      70   18   18  1000   60%    -1   24% 
   7 EXchess vsmp1       0    7    7  6900   50%    -1   24%

So far, I haven't run that many games, and I'll do that once I've refined the implementation a little more, but I was shocked at the amount of improvement for just hash-table communication between threads! Clearly 2 cores with this technique is not as good as a pure speed doubling, but given that a full-blown smp implementation might give 1.8 or 1.9 speed increase, I was pretty happy with this simple approach.

One additional fact that surprised me is that the time-to-depth improvement was not nearly as good as the actual playing results. Below are the time-to-depth results for various numbers of cores.

Code: Select all

Results for time to complete iteration depth 12 for 317 positions
------------------------------------------------------------------
EXchess_vsmp1: Total NPS = 845331  <depth> = 12 <time to depth> = 0.305s
EXchess_vsmp2: Total NPS = 1637837 <depth> = 12 <time to depth> = 0.233s
EXchess_vsmp4: Total NPS = 3037585 <depth> = 12 <time to depth> = 0.185s

So even 4 cores does not give even a factor of 2 improvement in time to depth, but 4 cores with this technique clearly plays better than a pure x2 search speed improvement (see playing results above). This seems contradictory, and I am trying to understand it.

A couple of ideas that might explain this...
(1) My implementation actually has half of the threads searching at the specified iteration depth, and the other half search at iteration depth+1, so the average depths in the above table are an under-estimate of the quality of the search for the smp2 and smp4 cases which have some threads returning hash table results from 1 ply deeper in some positions.

(2) EXchess prunes extensively, and this is influenced to a great extent by the history table, reply table, and killer moves. All of these things are independent between the threads, so different threads can, in principle, prune quite differently. The results are communicated back through the various hash tables, including the combination move hash table that I described in a post a few months ago.... so perhaps I am seeing some additional benefit from the threads searching different moves and thus being less likely for all of them to prune a key move from a position?

- Dan

Michel · Post by **Michel** » Sat Jan 12, 2013 7:52 pm

Does anyone have experience with this simple form of SMP? If so, what kind of results did you get? My early results are below, and I would be interested to know if they are atypical.

In Toga the speedup is about 40 Elo/CPU doubling (at the CCRL 40/40).

dchoman · Post by **dchoman** » Sat Jan 12, 2013 9:44 pm

Michel wrote:
Does anyone have experience with this simple form of SMP? If so, what kind of results did you get? My early results are below, and I would be interested to know if they are atypical.
In Toga the speedup is about 40 Elo/CPU doubling (at the CCRL 40/40).

Thanks! Looking at this list, it looks like several top programs gain a similar amount from their SMP implementations (about 80-100 elo going from single cpu to 4 cpu, so about 40-50 per cpu doubling), so perhaps this simple approach is not as bad as I had previously thought.

By the way, the elo gains I report in my post above are for ultra-fast bullet... 6s+0.1s per game.

Sven · Post by **Sven** » Sat Jan 12, 2013 11:43 pm

dchoman wrote:One additional fact that surprised me is that the time-to-depth improvement was not nearly as good as the actual playing results. Below are the time-to-depth results for various numbers of cores.
Code: Select all
Results for time to complete iteration depth 12 for 317 positions
------------------------------------------------------------------
EXchess_vsmp1: Total NPS = 845331  <depth> = 12 <time to depth> = 0.305s
EXchess_vsmp2: Total NPS = 1637837 <depth> = 12 <time to depth> = 0.233s
EXchess_vsmp4: Total NPS = 3037585 <depth> = 12 <time to depth> = 0.185s
So even 4 cores does not give even a factor of 2 improvement in time to depth, but 4 cores with this technique clearly plays better than a pure x2 search speed improvement (see playing results above). This seems contradictory, and I am trying to understand it.

When trying to understand your problem, the first thing I did was to explain to myself why your technique with 4 cores gives less than factor 2 improvement in <time to depth>, without considering playing strength at all. Here is my result:

<time to depth> = <total nodes searched> / <total NPS>.
Nodes searched to finish depth 12:

Code: Select all

EXchess_vsmp1: Total nodes = ( 845331 * 0.305) = 257826 (100%)
EXchess_vsmp2: Total nodes = (1637837 * 0.233) = 381616 (148%)
EXchess_vsmp4: Total nodes = (3037585 * 0.185) = 561953 (218%)

"vsmp1x2" will search twice as many nodes with the same NPS when given twice as much time, which implies that your 1 thread version with double NPS would roughly have the following <time to depth> value:

Code: Select all

EXchess vsmp1x2: Total NPS = ( 845331 * 2) = 1690662 <depth> = 12 <time to depth> = (0.305s / 2) = 0.1525s

Now to address the "contradictory" part (which is what you mainly asked for) I would like to understand more about your first point:

dchoman wrote:(1) My implementation actually has half of the threads searching at the specified iteration depth, and the other half search at iteration depth+1, so the average depths in the above table are an under-estimate of the quality of the search for the smp2 and smp4 cases which have some threads returning hash table results from 1 ply deeper in some positions.

Can you explain what that means? My first guess was that 50% of your threads search until they finish depth 12 while the other 50% stop only after finishing depth 13. But this does not sound plausible for me since the 50% "depth 12" threads would stop working at some time while the other 50% would continue. So what is it that you are actually doing?

I am asking since I somehow got the impression that your <time to depth> calculation might be incorrect for your smp2/smp4 versions if they do not stop searching after having reached depth 12. If this is the case then you also have an explanation for the drastically increasing number of nodes needed to "finish depth 12", as shown in my table above.

Sven

dchoman · Post by **dchoman** » Sun Jan 13, 2013 1:49 am

dchoman wrote:(1) My implementation actually has half of the threads searching at the specified iteration depth, and the other half search at iteration depth+1, so the average depths in the above table are an under-estimate of the quality of the search for the smp2 and smp4 cases which have some threads returning hash table results from 1 ply deeper in some positions.
Can you explain what that means? My first guess was that 50% of your threads search until they finish depth 12 while the other 50% stop only after finishing depth 13. But this does not sound plausible for me since the 50% "depth 12" threads would stop working at some time while the other 50% would continue. So what is it that you are actually doing?

I am asking since I somehow got the impression that your <time to depth> calculation might be incorrect for your smp2/smp4 versions if they do not stop searching after having reached depth 12. If this is the case then you also have an explanation for the drastically increasing number of nodes needed to "finish depth 12", as shown in my table above.

Sven

Good question. The way I have implemented this, I am only concerned with the result returned by the main search thread.

Let's take a 2 thread example. If the main search thread launches to search iteration depth=12 with alpha=-MATE and beta=+MATE, then before beginning that search, it signals the second thread to also do exactly the same search but with iteration depth=13. From that point forward, the first thread ignores the second thread, except for the information the second thread is putting into the various hash tables. If the second thread happened to complete before the first thread, the outcome of that search would also be ignored (I can obviously improve on that!! but at the moment, this is what I am doing). However, it is likely the second thread will still be running when the first thread completes. When the first thread does complete, it signals the second thread to also terminate and I then consider iteration 12 to be done.

So the time-to-depth statistic I collected is the time required for the first thread to complete iteration 12. The other threads likely did not complete their iterations, although they may have. In the 4 thread case, I have the third thread also search iteration 12, rather than 13, but the overall idea is still the same... I am currently ignoring any returned results from threads other than the main one if they should happen to complete before it.

One of my planned improvements is to make use of returned results from threads that complete their iterations early.

- Dan

jd1 · Post by **jd1** » Sun Jan 13, 2013 6:51 am

Hi,

That sounds like a simple way to improve a shared hash smp implementation. I will try it in Toga, it is simple to implement. One question, how do you ensure the alpha, beta window for the "depth+1" thread is the same as the main thread? I was thinking that if you rely on the previous iteration's best value for the window this may not work?

Thanks,
Jerry

dchoman · Post by **dchoman** » Sun Jan 13, 2013 1:17 pm

jd1 wrote:Hi,

That sounds like a simple way to improve a shared hash smp implementation. I will try it in Toga, it is simple to implement. One question, how do you ensure the alpha, beta window for the "depth+1" thread is the same as the main thread? I was thinking that if you rely on the previous iteration's best value for the window this may not work?

Thanks,
Jerry

I just force alpha and beta to be the same value at the root for all threads, and then I let them go and don't worry about it until that iteration is complete. At the start of the next iteration, I set them at the root again. The additional depth in some of the threads noticeably speeds up the main thread by 10% or so. There is probably a more optimum configuration for how to assign depths to the other threads, but I haven't had enough time to explore all the options.

- Dan

Joerg Oster · Post by **Joerg Oster** » Sun Jan 13, 2013 7:13 pm

Just out of curiosity, I did a similar test with Stockfish. Same tc, 3 different opponents.

Code: Select all

Rank Name          Elo    +    - games score oppo. draws 
   1 SF-1T-TCx2     78   18   18   699   65%   -45   32% 
   3 SF-4T          74   18   18   699   65%   -45   35% 
   5 SF-1T         -17   18   17   699   53%   -45   32%

SF-1T: Single-Thread
SF-4T: Four-Threads
SF-1T-TCx2: Single version with doubled time

About 80-90 Elo gain for the four-thread version over single-thread version.
Too less games, I know, but it already gives a hint.

I hope you/others find this interesting.

dchoman · Post by **dchoman** » Sun Jan 13, 2013 8:48 pm

Thanks! This is very interesting! I would have expected a significant gain for 4 processors relative to 1 processor at TCx2 search. Not sure what SMP implementation Stockfish is using, but I assume it is state-of-the-art. Perhaps the simple SMP version I am using at the moment is good enough? Not sure if that would be true for more processors or non-bullet time controls.

- Dan

jd1 · Post by **jd1** » Mon Jan 14, 2013 4:59 am

Thanks Dan.

Just to make sure I understand your method correctly, at every iteration you start the even threads at depth+1, and stop them when the iteration finishes.

This is then quite different to the usual shared hash table smp implementation, which launches x threads at the beginning of the search, as http://chessprogramming.wikispaces.com/ ... sh%20Table

I am testing right now.

Jerry

Lazy SMP, part 2

Lazy SMP, part 2

Re: Lazy SMP, part 2

Re: Lazy SMP, part 2

Re: Lazy SMP, part 2

Re: Lazy SMP, part 2

Re: Lazy SMP, part 2

Re: Lazy SMP, part 2

Re: Lazy SMP, part 2

Re: Lazy SMP, part 2

Re: Lazy SMP, part 2