Lazy SMP in Cheng

Joerg Oster · Post by **Joerg Oster** » Tue Feb 03, 2015 12:42 pm

petero2 wrote:
Joerg Oster wrote:
mar wrote:I'm not sure it would scale better, but certainly it would be interesting to compare.
At least lazy smp wouldn't suffer from 2 limitations as the current implementation.
1. takes some time to fully kick in because of the min split depth parameter
2. not enough workload especially for threads >= 8 due to heavy pruning and reductions
It seems 1 is not a big problem for YBWC either. I ran some tests with stockfish 6 at time control 1s+0.08s/move, 4 cores vs 1 core:
Code: Select all
Rank Name     Elo    +    - games score oppo. draws 
   1 sf6_4c    70    7    7  2282   73%   -70   46% 
   2 sf6      -70    7    7  2282   27%    70   46% 

I ran the same experiment, here the results.
(I know, only 500 games each, nevertheless it gives an impression.)

2 Threads against 1, tc=1+0.08

Code: Select all

Rank Name     Elo    +    - games score oppo. draws 
   1 SF6-2T    49   13   13   500   66%   -49   44% 
   2 SF6-1T   -49   13   13   500   34%    49   44%

4 Threads against 1, tc=1+0.08

Code: Select all

Rank Name     Elo    +    - games score oppo. draws 
   1 SF6-4T    76   13   13   500   73%   -76   38% 
   2 SF6-1T   -76   13   13   500   27%    76   38%

8 Threads against 1, tc=1+0.08

Code: Select all

Rank Name     Elo    +    - games score oppo. draws 
   1 SF6-8T    71   14   13   500   72%   -71   39% 
   2 SF6-1T   -71   13   14   500   28%    71   39%

No gain from 4 to 8 Threads. I know this is to be expected, but also a bit disappointing.

My idea to improve on this is to allow 2 threads to kick in earlier, and then continuously adding more threads with increasing depth. I don't know if this will work, but I think it is worth a try.

petero2 · Post by **petero2** » Wed Feb 04, 2015 12:18 am

Joerg Oster wrote:
petero2 wrote:
Joerg Oster wrote:
mar wrote:I'm not sure it would scale better, but certainly it would be interesting to compare.
At least lazy smp wouldn't suffer from 2 limitations as the current implementation.
1. takes some time to fully kick in because of the min split depth parameter
2. not enough workload especially for threads >= 8 due to heavy pruning and reductions
It seems 1 is not a big problem for YBWC either. I ran some tests with stockfish 6 at time control 1s+0.08s/move, 4 cores vs 1 core:
Code: Select all
Rank Name     Elo    +    - games score oppo. draws 
   1 sf6_4c    70    7    7  2282   73%   -70   46% 
   2 sf6      -70    7    7  2282   27%    70   46% 
Thank you, Peter.
Yes, 4-core performance is very good, but how does it look like with 8 and 16 cores?

I got the following result:

Code: Select all

Rank Name     Elo    +    - games score oppo. draws 
   1 sf6_8c     8    4    3 11154   53%    -8   63% 
   2 sf6_4c    -8    3    4 11154   47%     8   63% 

Rank Name      Elo    +    - games score oppo. draws 
   1 sf6_8c      9    7    8  1465   53%    -9   63% 
   2 sf6_16c    -9    8    7  1465   47%     9   63%

zullil · Post by **zullil** » Wed Feb 04, 2015 1:23 am

petero2 wrote: I got the following result:

Code: Select all

Rank Name     Elo    +    - games score oppo. draws 
   1 sf6_8c     8    4    3 11154   53%    -8   63% 
   2 sf6_4c    -8    3    4 11154   47%     8   63% 

Rank Name      Elo    +    - games score oppo. draws 
   1 sf6_8c      9    7    8  1465   53%    -9   63% 
   2 sf6_16c    -9    8    7  1465   47%     9   63%

So 8 cores barely bested 4 cores, while 16 cores was beaten by 8 cores?

What hardware was this on? If this result is representative, it seems SF's parallelization needs some serious reworking.

lucasart · Post by **lucasart** » Wed Feb 04, 2015 2:27 am

IMHO, you need to change the min split depth, before running these tests at super fast tc.

bob · Post by **bob** » Wed Feb 04, 2015 3:04 am

Joerg Oster wrote:
petero2 wrote:
Joerg Oster wrote:
mar wrote:I'm not sure it would scale better, but certainly it would be interesting to compare.
At least lazy smp wouldn't suffer from 2 limitations as the current implementation.
1. takes some time to fully kick in because of the min split depth parameter
2. not enough workload especially for threads >= 8 due to heavy pruning and reductions
It seems 1 is not a big problem for YBWC either. I ran some tests with stockfish 6 at time control 1s+0.08s/move, 4 cores vs 1 core:
Code: Select all
Rank Name     Elo    +    - games score oppo. draws 
   1 sf6_4c    70    7    7  2282   73%   -70   46% 
   2 sf6      -70    7    7  2282   27%    70   46% 
I ran the same experiment, here the results.
(I know, only 500 games each, nevertheless it gives an impression.)

2 Threads against 1, tc=1+0.08
Code: Select all
Rank Name     Elo    +    - games score oppo. draws 
   1 SF6-2T    49   13   13   500   66%   -49   44% 
   2 SF6-1T   -49   13   13   500   34%    49   44% 
4 Threads against 1, tc=1+0.08
Code: Select all
Rank Name     Elo    +    - games score oppo. draws 
   1 SF6-4T    76   13   13   500   73%   -76   38% 
   2 SF6-1T   -76   13   13   500   27%    76   38% 
8 Threads against 1, tc=1+0.08
Code: Select all
Rank Name     Elo    +    - games score oppo. draws 
   1 SF6-8T    71   14   13   500   72%   -71   39% 
   2 SF6-1T   -71   13   14   500   28%    71   39% 
No gain from 4 to 8 Threads. I know this is to be expected, but also a bit disappointing.

My idea to improve on this is to allow 2 threads to kick in earlier, and then continuously adding more threads with increasing depth. I don't know if this will work, but I think it is worth a try.

The problem is, WHERE do you "add them"? YBW is still necessary for any good parallel search.

gladius · Post by **gladius** » Wed Feb 04, 2015 3:05 am

zullil wrote:
petero2 wrote: I got the following result:
Code: Select all
Rank Name     Elo    +    - games score oppo. draws 
   1 sf6_8c     8    4    3 11154   53%    -8   63% 
   2 sf6_4c    -8    3    4 11154   47%     8   63% 

Rank Name      Elo    +    - games score oppo. draws 
   1 sf6_8c      9    7    8  1465   53%    -9   63% 
   2 sf6_16c    -9    8    7  1465   47%     9   63% 
So 8 cores barely bested 4 cores, while 16 cores was beaten by 8 cores?

What hardware was this on? If this result is representative, it seems SF's parallelization needs some serious reworking.

It scales well up to 8 cores, but past that, yes, it does need some thought about how to best use those other 8 cores.

I'll repost some of my thinking on this from the fishcooking forum:

So, it seems instead of parking threads when they have no work to do, it's better to spin them on the same search tree, since the hashtable was already mostly filled out. This is not trivial to implement in SF (at least for me

, but the idea seems quite powerful. Especially with the quite skinny tree that SF currently searches in endgames. I think the idea would be something like create an additional split point at the root, and have the nodes check in every X nodes or so to see if there is actual work to be done. If not, they can continue searching in their parallel split point.

Another possibility is to have the concept of second-best move search. We know we are searching the PV moves very deeply, so instead, search the second best move relatively deeply (deeper than it would have been searched in normal search). An easy way to approximate this is by using the exclude move feature, to remove the best move from the possibilities. Also, perhaps search the second best with the PV search pruning parameters. If the score turns out super bad after X ply or something bail, and try another split point with the best move excluded.

There are lots of possibilities! Implementing them is a fair bit of work, and especially testing is non-trivial, but it seems there is a lot to be gained here for high core computers.

mathmoi · Post by **mathmoi** » Wed Feb 04, 2015 5:54 am

Hi Martin,

As you know I just recently found out that lazy SMP is a valid alternative to YBWC. Since then I read a it about it on CCC.

The idea is that if you have, say, two threads searching a depth d you also launch two threads at depth d+1. When one of the thread searching at depth d finish it's search you report the best move at this depth. Stop all thread and start them again at depth d+1 and d+2 for the next iteration.

Why do you stop all threads at this point? You already have two threads searching at d+1. Why not keep these two ones running, stop the ones running at depth d and start them at d+2? Is it because the TT will make sure the threads will pick up where they were when they were first stopped?

mar · Post by **mar** » Wed Feb 04, 2015 7:22 am

mathmoi wrote:Why do you stop all threads at this point?

Because you have to continue aspiration/ID loop.

Michel · Post by **Michel** » Wed Feb 04, 2015 10:04 am

So, it seems instead of parking threads when they have no work to do, it's better to spin them on the same search tree, since the hashtable was already mostly filled out. This is not trivial to implement in SF (at least for me Smile, but the idea seems quite powerful. Especially with the quite skinny tree that SF currently searches in endgames. I think the idea would be something like create an additional split point at the root, and have the nodes check in every X nodes or so to see if there is actual work to be done. If not, they can continue searching in their parallel split point.

Another possibility is to have the concept of second-best move search. We know we are searching the PV moves very deeply, so instead, search the second best move relatively deeply (deeper than it would have been searched in normal search). An easy way to approximate this is by using the exclude move feature, to remove the best move from the possibilities. Also, perhaps search the second best with the PV search pruning parameters. If the score turns out super bad after X ply or something bail, and try another split point with the best move excluded.

There are lots of possibilities! Implementing them is a fair bit of work, and especially testing is non-trivial, but it seems there is a lot to be gained here for high core computers.

But why not just implement lazy SMP in SF? It has been pointed out by many people already (including me at the SF forum, referring to the CCRL rating list) that there is no serious evidence that SF's current SMP implementation is better than lazy SMP.

Joerg Oster · Post by **Joerg Oster** » Wed Feb 04, 2015 11:30 am

bob wrote:
Joerg Oster wrote:
petero2 wrote:
Joerg Oster wrote:
mar wrote:I'm not sure it would scale better, but certainly it would be interesting to compare.
At least lazy smp wouldn't suffer from 2 limitations as the current implementation.
1. takes some time to fully kick in because of the min split depth parameter
2. not enough workload especially for threads >= 8 due to heavy pruning and reductions
It seems 1 is not a big problem for YBWC either. I ran some tests with stockfish 6 at time control 1s+0.08s/move, 4 cores vs 1 core:
Code: Select all
Rank Name     Elo    +    - games score oppo. draws 
   1 sf6_4c    70    7    7  2282   73%   -70   46% 
   2 sf6      -70    7    7  2282   27%    70   46% 
I ran the same experiment, here the results.
(I know, only 500 games each, nevertheless it gives an impression.)

2 Threads against 1, tc=1+0.08
Code: Select all
Rank Name     Elo    +    - games score oppo. draws 
   1 SF6-2T    49   13   13   500   66%   -49   44% 
   2 SF6-1T   -49   13   13   500   34%    49   44% 
4 Threads against 1, tc=1+0.08
Code: Select all
Rank Name     Elo    +    - games score oppo. draws 
   1 SF6-4T    76   13   13   500   73%   -76   38% 
   2 SF6-1T   -76   13   13   500   27%    76   38% 
8 Threads against 1, tc=1+0.08
Code: Select all
Rank Name     Elo    +    - games score oppo. draws 
   1 SF6-8T    71   14   13   500   72%   -71   39% 
   2 SF6-1T   -71   13   14   500   28%    71   39% 
No gain from 4 to 8 Threads. I know this is to be expected, but also a bit disappointing.

My idea to improve on this is to allow 2 threads to kick in earlier, and then continuously adding more threads with increasing depth. I don't know if this will work, but I think it is worth a try.
The problem is, WHERE do you "add them"? YBW is still necessary for any good parallel search.

Question is not WHERE, but WHEN.

Min Split Depth for 4 threads can certainly be smaller than for 32 threads. But instead of running with 1 thread only until we reach that point where we start to split with 32 threads, why not earlier start with a 2 thread search, then use 4 threads, and so on.
Doing this, at least some af the available capacity would be used.

I don't claim this is better, but it seems more logical to me.

Lazy SMP in Cheng

Re: Lazy SMP in Cheng

Re: Lazy SMP in Cheng

Re: Lazy SMP in Cheng

Re: Lazy SMP in Cheng

Re: Lazy SMP in Cheng

Re: Lazy SMP in Cheng

Re: Lazy SMP in Cheng

Re: Lazy SMP in Cheng

Re: Lazy SMP in Cheng

Re: Lazy SMP in Cheng