Lazy SMP in Cheng

lucasart · Post by **lucasart** » Wed Feb 04, 2015 11:32 am

Michel wrote:
So, it seems instead of parking threads when they have no work to do, it's better to spin them on the same search tree, since the hashtable was already mostly filled out. This is not trivial to implement in SF (at least for me Smile, but the idea seems quite powerful. Especially with the quite skinny tree that SF currently searches in endgames. I think the idea would be something like create an additional split point at the root, and have the nodes check in every X nodes or so to see if there is actual work to be done. If not, they can continue searching in their parallel split point.

Another possibility is to have the concept of second-best move search. We know we are searching the PV moves very deeply, so instead, search the second best move relatively deeply (deeper than it would have been searched in normal search). An easy way to approximate this is by using the exclude move feature, to remove the best move from the possibilities. Also, perhaps search the second best with the PV search pruning parameters. If the score turns out super bad after X ply or something bail, and try another split point with the best move excluded.

There are lots of possibilities! Implementing them is a fair bit of work, and especially testing is non-trivial, but it seems there is a lot to be gained here for high core computers.
But why not just implement lazy SMP in SF? It has been pointed out by many people already (including me at the SF forum, referring to the CCRL rating list) that there is no serious evidence that SF's current SMP implementation is better than lazy SMP.

So far, what we see is that Cheng's scaling is not much worse than SF. It's not better either. SF gets 152 elo from 4 threads, where Cheng got 132, if I understood correctly. So SF is still a bit better, especially considering that gaining 100 elo at SF level is not the same as gaining 100 elo at Cheng level (eg. elo gain of halfing TTD decreases with absolute elo).

But, admittedly, lazy SMP wins the tradeoff. It's not much worse, yet the implementation is significantly simpler. If someone is ready to do the work (not just talk) and rewrite SF's SMP implementation to use Cheng's lazy SMP, and that proves to be close enough to the original, then the patch will be welcome. But unless someone does the work, it's just hot air. You can say all you want on the SF forum or here, it won't happen until someone (you?) writes the code.

Joerg Oster · Post by **Joerg Oster** » Wed Feb 04, 2015 11:49 am

petero2 wrote:
Joerg Oster wrote:
petero2 wrote:
Joerg Oster wrote:
mar wrote:I'm not sure it would scale better, but certainly it would be interesting to compare.
At least lazy smp wouldn't suffer from 2 limitations as the current implementation.
1. takes some time to fully kick in because of the min split depth parameter
2. not enough workload especially for threads >= 8 due to heavy pruning and reductions
It seems 1 is not a big problem for YBWC either. I ran some tests with stockfish 6 at time control 1s+0.08s/move, 4 cores vs 1 core:
Code: Select all
Rank Name     Elo    +    - games score oppo. draws 
   1 sf6_4c    70    7    7  2282   73%   -70   46% 
   2 sf6      -70    7    7  2282   27%    70   46% 
Thank you, Peter.
Yes, 4-core performance is very good, but how does it look like with 8 and 16 cores?
I got the following result:
Code: Select all
Rank Name     Elo    +    - games score oppo. draws 
   1 sf6_8c     8    4    3 11154   53%    -8   63% 
   2 sf6_4c    -8    3    4 11154   47%     8   63% 

Rank Name      Elo    +    - games score oppo. draws 
   1 sf6_8c      9    7    8  1465   53%    -9   63% 
   2 sf6_16c    -9    8    7  1465   47%     9   63% 

Thanks.
I guess now it's a bit clearer why I mentioned point 1.

Here is a first preliminary result with a modified SF:

Code: Select all

Rank Name        Elo    +    - games score oppo. draws 
   1 SF-XP3-8T    78   13   13   500   75%   -78   42% 
   2 SF-XP3-1T   -78   13   13   500   25%    78   42%

Compared to my first result

Code: Select all

Rank Name     Elo    +    - games score oppo. draws
   1 SF6-8T    71   14   13   500   72%   -71   39%
   2 SF6-1T   -71   13   14   500   28%    71   39%

it looks like a small improvement.
Of course, further test are needed.

Michel · Post by **Michel** » Wed Feb 04, 2015 12:40 pm

So far, what we see is that Cheng's scaling is not much worse than SF. It's not better either. SF gets 152 elo from 4 threads, where Cheng got 132, if I understood correctly. So SF is still a bit better, especially considering that gaining 100 elo at SF level is not the same as gaining 100 elo at Cheng level (eg. elo gain of halfing TTD decreases with absolute elo).

I must confess I have not really understood the time control of the tests. Is it 1s/game or 1s/move? Note that the CCRL uses much longer time contol either way (but the error bars are much larger also). So I don't think anything has been proved.

But unless someone does the work, it's just hot air. You can say all you want on the SF forum or here, it won't happen until someone (you?) writes the code.

Well I won't write code for SF. You know perfectly why!

petero2 · Post by **petero2** » Wed Feb 04, 2015 7:49 pm

zullil wrote:
petero2 wrote: I got the following result:
Code: Select all
Rank Name     Elo    +    - games score oppo. draws 
   1 sf6_8c     8    4    3 11154   53%    -8   63% 
   2 sf6_4c    -8    3    4 11154   47%     8   63% 

Rank Name      Elo    +    - games score oppo. draws 
   1 sf6_8c      9    7    8  1465   53%    -9   63% 
   2 sf6_16c    -9    8    7  1465   47%     9   63% 
So 8 cores barely bested 4 cores, while 16 cores was beaten by 8 cores?

What hardware was this on? If this result is representative, it seems SF's parallelization needs some serious reworking.

The same hardware as here. 16 core Dell PowerEdge T620 computer. Both hyperthreading and turbo boost are enabled. The computer runs Fedora 19. The hash table size is 128MB and pondering is disabled.

However, I don't think these tests can show that SF's parallelization needs serious reworking, because the time control is 1s + 0.08s/move, which is a lot faster than what people generally are interested in. This time control was chosen to provide some data related to this claim:

At least lazy smp wouldn't suffer from 2 limitations as the current implementation.
1. takes some time to fully kick in because of the min split depth parameter

Joerg Oster · Post by **Joerg Oster** » Wed Feb 04, 2015 9:20 pm

petero2 wrote:
zullil wrote:
petero2 wrote: I got the following result:
Code: Select all
Rank Name     Elo    +    - games score oppo. draws 
   1 sf6_8c     8    4    3 11154   53%    -8   63% 
   2 sf6_4c    -8    3    4 11154   47%     8   63% 

Rank Name      Elo    +    - games score oppo. draws 
   1 sf6_8c      9    7    8  1465   53%    -9   63% 
   2 sf6_16c    -9    8    7  1465   47%     9   63% 
So 8 cores barely bested 4 cores, while 16 cores was beaten by 8 cores?

What hardware was this on? If this result is representative, it seems SF's parallelization needs some serious reworking.
The same hardware as here. 16 core Dell PowerEdge T620 computer. Both hyperthreading and turbo boost are enabled. The computer runs Fedora 19. The hash table size is 128MB and pondering is disabled.

However, I don't think these tests can show that SF's parallelization needs serious reworking, because the time control is 1s + 0.08s/move, which is a lot faster than what people generally are interested in. This time control was chosen to provide some data related to this claim:
At least lazy smp wouldn't suffer from 2 limitations as the current implementation.
1. takes some time to fully kick in because of the min split depth parameter

Though not completely wrong, I must admit that this doesn't seem to be a real problem.

bob · Post by **bob** » Thu Feb 05, 2015 2:14 am

Joerg Oster wrote:
bob wrote:
Joerg Oster wrote:
petero2 wrote:
Joerg Oster wrote:
mar wrote:I'm not sure it would scale better, but certainly it would be interesting to compare.
At least lazy smp wouldn't suffer from 2 limitations as the current implementation.
1. takes some time to fully kick in because of the min split depth parameter
2. not enough workload especially for threads >= 8 due to heavy pruning and reductions
It seems 1 is not a big problem for YBWC either. I ran some tests with stockfish 6 at time control 1s+0.08s/move, 4 cores vs 1 core:
Code: Select all
Rank Name     Elo    +    - games score oppo. draws 
   1 sf6_4c    70    7    7  2282   73%   -70   46% 
   2 sf6      -70    7    7  2282   27%    70   46% 
I ran the same experiment, here the results.
(I know, only 500 games each, nevertheless it gives an impression.)

2 Threads against 1, tc=1+0.08
Code: Select all
Rank Name     Elo    +    - games score oppo. draws 
   1 SF6-2T    49   13   13   500   66%   -49   44% 
   2 SF6-1T   -49   13   13   500   34%    49   44% 
4 Threads against 1, tc=1+0.08
Code: Select all
Rank Name     Elo    +    - games score oppo. draws 
   1 SF6-4T    76   13   13   500   73%   -76   38% 
   2 SF6-1T   -76   13   13   500   27%    76   38% 
8 Threads against 1, tc=1+0.08
Code: Select all
Rank Name     Elo    +    - games score oppo. draws 
   1 SF6-8T    71   14   13   500   72%   -71   39% 
   2 SF6-1T   -71   13   14   500   28%    71   39% 
No gain from 4 to 8 Threads. I know this is to be expected, but also a bit disappointing.

My idea to improve on this is to allow 2 threads to kick in earlier, and then continuously adding more threads with increasing depth. I don't know if this will work, but I think it is worth a try.
The problem is, WHERE do you "add them"? YBW is still necessary for any good parallel search.
Question is not WHERE, but WHEN.

Min Split Depth for 4 threads can certainly be smaller than for 32 threads. But instead of running with 1 thread only until we reach that point where we start to split with 32 threads, why not earlier start with a 2 thread search, then use 4 threads, and so on.
Doing this, at least some af the available capacity would be used.

I don't claim this is better, but it seems more logical to me.

The min split depth is a compromise between getting the processors busy and limiting split overhead. Splitting is not free. In Crafty the cost of splitting os paid by the individual processors as they split. And the issue is, you do NOT want to spend more time splitting than actually searching, for obvious reasons. There are points where it is actually reasonable to use just one CPU, when you factor in costs and gains... min_split_depth is NOT going to improve search speed, if you reduce it. It is going to hurt search speed because overhead is going up quickly. It is also not a function of the time control being used, it is solely a function of the time it requires to actually do a split.

bob · Post by **bob** » Thu Feb 05, 2015 2:18 am

petero2 wrote:
zullil wrote:
petero2 wrote: I got the following result:
Code: Select all
Rank Name     Elo    +    - games score oppo. draws 
   1 sf6_8c     8    4    3 11154   53%    -8   63% 
   2 sf6_4c    -8    3    4 11154   47%     8   63% 

Rank Name      Elo    +    - games score oppo. draws 
   1 sf6_8c      9    7    8  1465   53%    -9   63% 
   2 sf6_16c    -9    8    7  1465   47%     9   63% 
So 8 cores barely bested 4 cores, while 16 cores was beaten by 8 cores?

What hardware was this on? If this result is representative, it seems SF's parallelization needs some serious reworking.
The same hardware as here. 16 core Dell PowerEdge T620 computer. Both hyperthreading and turbo boost are enabled. The computer runs Fedora 19. The hash table size is 128MB and pondering is disabled.

However, I don't think these tests can show that SF's parallelization needs serious reworking, because the time control is 1s + 0.08s/move, which is a lot faster than what people generally are interested in. This time control was chosen to provide some data related to this claim:
At least lazy smp wouldn't suffer from 2 limitations as the current implementation.
1. takes some time to fully kick in because of the min split depth parameter

If you allow turbo-boost your numbers are worthless. That single CPU will run quite a bit faster than one of the CPUs when you run 8 threads. My iMac goes from 3.1ghz to 4.0ghz, with turbo-boost enabled. But I have a simple utility to disable it when I am testing...

The test quoted above (1,4 and 8 threads) is particularly bad, because one thread will max out turbo-boost. 4 threads will come close. But running 8 will see turbo-boost disabled. You can't compare the numbers until you turn that off.

bob · Post by **bob** » Thu Feb 05, 2015 2:20 am

Michel wrote:
So, it seems instead of parking threads when they have no work to do, it's better to spin them on the same search tree, since the hashtable was already mostly filled out. This is not trivial to implement in SF (at least for me Smile, but the idea seems quite powerful. Especially with the quite skinny tree that SF currently searches in endgames. I think the idea would be something like create an additional split point at the root, and have the nodes check in every X nodes or so to see if there is actual work to be done. If not, they can continue searching in their parallel split point.

Another possibility is to have the concept of second-best move search. We know we are searching the PV moves very deeply, so instead, search the second best move relatively deeply (deeper than it would have been searched in normal search). An easy way to approximate this is by using the exclude move feature, to remove the best move from the possibilities. Also, perhaps search the second best with the PV search pruning parameters. If the score turns out super bad after X ply or something bail, and try another split point with the best move excluded.

There are lots of possibilities! Implementing them is a fair bit of work, and especially testing is non-trivial, but it seems there is a lot to be gained here for high core computers.
But why not just implement lazy SMP in SF? It has been pointed out by many people already (including me at the SF forum, referring to the CCRL rating list) that there is no serious evidence that SF's current SMP implementation is better than lazy SMP.

Unfortunately most of the data published is bogus, because of turbo-boost...

Lazy SMP in Cheng

Re: Lazy SMP in Cheng

Re: Lazy SMP in Cheng

Re: Lazy SMP in Cheng

Re: Lazy SMP in Cheng

Re: Lazy SMP in Cheng

Re: Lazy SMP in Cheng

Re: Lazy SMP in Cheng

Re: Lazy SMP in Cheng