Lazy SMP in Cheng

Discussion of chess software programming and technical issues.

Moderators: hgm, Harvey Williamson, bob

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
Joerg Oster
Posts: 683
Joined: Fri Mar 10, 2006 3:29 pm
Location: Germany

Re: Lazy SMP in Cheng

Post by Joerg Oster » Tue Feb 03, 2015 11:42 am

petero2 wrote:
Joerg Oster wrote:
mar wrote:I'm not sure it would scale better, but certainly it would be interesting to compare.
At least lazy smp wouldn't suffer from 2 limitations as the current implementation.
1. takes some time to fully kick in because of the min split depth parameter
2. not enough workload especially for threads >= 8 due to heavy pruning and reductions
It seems 1 is not a big problem for YBWC either. I ran some tests with stockfish 6 at time control 1s+0.08s/move, 4 cores vs 1 core:

Code: Select all

Rank Name     Elo    +    - games score oppo. draws 
   1 sf6_4c    70    7    7  2282   73%   -70   46% 
   2 sf6      -70    7    7  2282   27%    70   46% 
I ran the same experiment, here the results.
(I know, only 500 games each, nevertheless it gives an impression.)

2 Threads against 1, tc=1+0.08

Code: Select all

Rank Name     Elo    +    - games score oppo. draws 
   1 SF6-2T    49   13   13   500   66%   -49   44% 
   2 SF6-1T   -49   13   13   500   34%    49   44% 
4 Threads against 1, tc=1+0.08

Code: Select all

Rank Name     Elo    +    - games score oppo. draws 
   1 SF6-4T    76   13   13   500   73%   -76   38% 
   2 SF6-1T   -76   13   13   500   27%    76   38% 
8 Threads against 1, tc=1+0.08

Code: Select all

Rank Name     Elo    +    - games score oppo. draws 
   1 SF6-8T    71   14   13   500   72%   -71   39% 
   2 SF6-1T   -71   13   14   500   28%    71   39% 

No gain from 4 to 8 Threads. I know this is to be expected, but also a bit disappointing. :)

My idea to improve on this is to allow 2 threads to kick in earlier, and then continuously adding more threads with increasing depth. I don't know if this will work, but I think it is worth a try.
Jörg Oster

petero2
Posts: 580
Joined: Mon Apr 19, 2010 5:07 pm
Location: Sweden
Contact:

Re: Lazy SMP in Cheng

Post by petero2 » Tue Feb 03, 2015 11:18 pm

Joerg Oster wrote:
petero2 wrote:
Joerg Oster wrote:
mar wrote:I'm not sure it would scale better, but certainly it would be interesting to compare.
At least lazy smp wouldn't suffer from 2 limitations as the current implementation.
1. takes some time to fully kick in because of the min split depth parameter
2. not enough workload especially for threads >= 8 due to heavy pruning and reductions
It seems 1 is not a big problem for YBWC either. I ran some tests with stockfish 6 at time control 1s+0.08s/move, 4 cores vs 1 core:

Code: Select all

Rank Name     Elo    +    - games score oppo. draws 
   1 sf6_4c    70    7    7  2282   73%   -70   46% 
   2 sf6      -70    7    7  2282   27%    70   46% 
Thank you, Peter.
Yes, 4-core performance is very good, but how does it look like with 8 and 16 cores?
I got the following result:

Code: Select all

Rank Name     Elo    +    - games score oppo. draws 
   1 sf6_8c     8    4    3 11154   53%    -8   63% 
   2 sf6_4c    -8    3    4 11154   47%     8   63% 

Rank Name      Elo    +    - games score oppo. draws 
   1 sf6_8c      9    7    8  1465   53%    -9   63% 
   2 sf6_16c    -9    8    7  1465   47%     9   63% 

zullil
Posts: 5562
Joined: Mon Jan 08, 2007 11:31 pm
Location: PA USA
Full name: Louis Zulli

Re: Lazy SMP in Cheng

Post by zullil » Wed Feb 04, 2015 12:23 am

petero2 wrote: I got the following result:

Code: Select all

Rank Name     Elo    +    - games score oppo. draws 
   1 sf6_8c     8    4    3 11154   53%    -8   63% 
   2 sf6_4c    -8    3    4 11154   47%     8   63% 

Rank Name      Elo    +    - games score oppo. draws 
   1 sf6_8c      9    7    8  1465   53%    -9   63% 
   2 sf6_16c    -9    8    7  1465   47%     9   63% 
So 8 cores barely bested 4 cores, while 16 cores was beaten by 8 cores?

What hardware was this on? If this result is representative, it seems SF's parallelization needs some serious reworking.

User avatar
lucasart
Posts: 3036
Joined: Mon May 31, 2010 11:29 am
Full name: lucasart
Contact:

Re: Lazy SMP in Cheng

Post by lucasart » Wed Feb 04, 2015 1:27 am

IMHO, you need to change the min split depth, before running these tests at super fast tc.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.

bob
Posts: 20417
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: Lazy SMP in Cheng

Post by bob » Wed Feb 04, 2015 2:04 am

Joerg Oster wrote:
petero2 wrote:
Joerg Oster wrote:
mar wrote:I'm not sure it would scale better, but certainly it would be interesting to compare.
At least lazy smp wouldn't suffer from 2 limitations as the current implementation.
1. takes some time to fully kick in because of the min split depth parameter
2. not enough workload especially for threads >= 8 due to heavy pruning and reductions
It seems 1 is not a big problem for YBWC either. I ran some tests with stockfish 6 at time control 1s+0.08s/move, 4 cores vs 1 core:

Code: Select all

Rank Name     Elo    +    - games score oppo. draws 
   1 sf6_4c    70    7    7  2282   73%   -70   46% 
   2 sf6      -70    7    7  2282   27%    70   46% 
I ran the same experiment, here the results.
(I know, only 500 games each, nevertheless it gives an impression.)

2 Threads against 1, tc=1+0.08

Code: Select all

Rank Name     Elo    +    - games score oppo. draws 
   1 SF6-2T    49   13   13   500   66%   -49   44% 
   2 SF6-1T   -49   13   13   500   34%    49   44% 
4 Threads against 1, tc=1+0.08

Code: Select all

Rank Name     Elo    +    - games score oppo. draws 
   1 SF6-4T    76   13   13   500   73%   -76   38% 
   2 SF6-1T   -76   13   13   500   27%    76   38% 
8 Threads against 1, tc=1+0.08

Code: Select all

Rank Name     Elo    +    - games score oppo. draws 
   1 SF6-8T    71   14   13   500   72%   -71   39% 
   2 SF6-1T   -71   13   14   500   28%    71   39% 

No gain from 4 to 8 Threads. I know this is to be expected, but also a bit disappointing. :)

My idea to improve on this is to allow 2 threads to kick in earlier, and then continuously adding more threads with increasing depth. I don't know if this will work, but I think it is worth a try.
The problem is, WHERE do you "add them"? YBW is still necessary for any good parallel search.

gladius
Posts: 538
Joined: Tue Dec 12, 2006 9:10 am

Re: Lazy SMP in Cheng

Post by gladius » Wed Feb 04, 2015 2:05 am

zullil wrote:
petero2 wrote: I got the following result:

Code: Select all

Rank Name     Elo    +    - games score oppo. draws 
   1 sf6_8c     8    4    3 11154   53%    -8   63% 
   2 sf6_4c    -8    3    4 11154   47%     8   63% 

Rank Name      Elo    +    - games score oppo. draws 
   1 sf6_8c      9    7    8  1465   53%    -9   63% 
   2 sf6_16c    -9    8    7  1465   47%     9   63% 
So 8 cores barely bested 4 cores, while 16 cores was beaten by 8 cores?

What hardware was this on? If this result is representative, it seems SF's parallelization needs some serious reworking.
It scales well up to 8 cores, but past that, yes, it does need some thought about how to best use those other 8 cores.

I'll repost some of my thinking on this from the fishcooking forum:

So, it seems instead of parking threads when they have no work to do, it's better to spin them on the same search tree, since the hashtable was already mostly filled out. This is not trivial to implement in SF (at least for me :), but the idea seems quite powerful. Especially with the quite skinny tree that SF currently searches in endgames. I think the idea would be something like create an additional split point at the root, and have the nodes check in every X nodes or so to see if there is actual work to be done. If not, they can continue searching in their parallel split point.

Another possibility is to have the concept of second-best move search. We know we are searching the PV moves very deeply, so instead, search the second best move relatively deeply (deeper than it would have been searched in normal search). An easy way to approximate this is by using the exclude move feature, to remove the best move from the possibilities. Also, perhaps search the second best with the PV search pruning parameters. If the score turns out super bad after X ply or something bail, and try another split point with the best move excluded.

There are lots of possibilities! Implementing them is a fair bit of work, and especially testing is non-trivial, but it seems there is a lot to be gained here for high core computers.

mathmoi
Posts: 265
Joined: Mon Mar 13, 2006 4:23 pm
Location: Québec
Contact:

Re: Lazy SMP in Cheng

Post by mathmoi » Wed Feb 04, 2015 4:54 am

Hi Martin,

As you know I just recently found out that lazy SMP is a valid alternative to YBWC. Since then I read a it about it on CCC.

The idea is that if you have, say, two threads searching a depth d you also launch two threads at depth d+1. When one of the thread searching at depth d finish it's search you report the best move at this depth. Stop all thread and start them again at depth d+1 and d+2 for the next iteration.

Why do you stop all threads at this point? You already have two threads searching at d+1. Why not keep these two ones running, stop the ones running at depth d and start them at d+2? Is it because the TT will make sure the threads will pick up where they were when they were first stopped?

mar
Posts: 1987
Joined: Fri Nov 26, 2010 1:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Lazy SMP in Cheng

Post by mar » Wed Feb 04, 2015 6:22 am

mathmoi wrote:Why do you stop all threads at this point?
Because you have to continue aspiration/ID loop.

Michel
Posts: 2038
Joined: Sun Sep 28, 2008 11:50 pm

Re: Lazy SMP in Cheng

Post by Michel » Wed Feb 04, 2015 9:04 am

So, it seems instead of parking threads when they have no work to do, it's better to spin them on the same search tree, since the hashtable was already mostly filled out. This is not trivial to implement in SF (at least for me Smile, but the idea seems quite powerful. Especially with the quite skinny tree that SF currently searches in endgames. I think the idea would be something like create an additional split point at the root, and have the nodes check in every X nodes or so to see if there is actual work to be done. If not, they can continue searching in their parallel split point.

Another possibility is to have the concept of second-best move search. We know we are searching the PV moves very deeply, so instead, search the second best move relatively deeply (deeper than it would have been searched in normal search). An easy way to approximate this is by using the exclude move feature, to remove the best move from the possibilities. Also, perhaps search the second best with the PV search pruning parameters. If the score turns out super bad after X ply or something bail, and try another split point with the best move excluded.

There are lots of possibilities! Implementing them is a fair bit of work, and especially testing is non-trivial, but it seems there is a lot to be gained here for high core computers.
But why not just implement lazy SMP in SF? It has been pointed out by many people already (including me at the SF forum, referring to the CCRL rating list) that there is no serious evidence that SF's current SMP implementation is better than lazy SMP.
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.

Joerg Oster
Posts: 683
Joined: Fri Mar 10, 2006 3:29 pm
Location: Germany

Re: Lazy SMP in Cheng

Post by Joerg Oster » Wed Feb 04, 2015 10:30 am

bob wrote:
Joerg Oster wrote:
petero2 wrote:
Joerg Oster wrote:
mar wrote:I'm not sure it would scale better, but certainly it would be interesting to compare.
At least lazy smp wouldn't suffer from 2 limitations as the current implementation.
1. takes some time to fully kick in because of the min split depth parameter
2. not enough workload especially for threads >= 8 due to heavy pruning and reductions
It seems 1 is not a big problem for YBWC either. I ran some tests with stockfish 6 at time control 1s+0.08s/move, 4 cores vs 1 core:

Code: Select all

Rank Name     Elo    +    - games score oppo. draws 
   1 sf6_4c    70    7    7  2282   73%   -70   46% 
   2 sf6      -70    7    7  2282   27%    70   46% 
I ran the same experiment, here the results.
(I know, only 500 games each, nevertheless it gives an impression.)

2 Threads against 1, tc=1+0.08

Code: Select all

Rank Name     Elo    +    - games score oppo. draws 
   1 SF6-2T    49   13   13   500   66%   -49   44% 
   2 SF6-1T   -49   13   13   500   34%    49   44% 
4 Threads against 1, tc=1+0.08

Code: Select all

Rank Name     Elo    +    - games score oppo. draws 
   1 SF6-4T    76   13   13   500   73%   -76   38% 
   2 SF6-1T   -76   13   13   500   27%    76   38% 
8 Threads against 1, tc=1+0.08

Code: Select all

Rank Name     Elo    +    - games score oppo. draws 
   1 SF6-8T    71   14   13   500   72%   -71   39% 
   2 SF6-1T   -71   13   14   500   28%    71   39% 

No gain from 4 to 8 Threads. I know this is to be expected, but also a bit disappointing. :)

My idea to improve on this is to allow 2 threads to kick in earlier, and then continuously adding more threads with increasing depth. I don't know if this will work, but I think it is worth a try.
The problem is, WHERE do you "add them"? YBW is still necessary for any good parallel search.
Question is not WHERE, but WHEN.

Min Split Depth for 4 threads can certainly be smaller than for 32 threads. But instead of running with 1 thread only until we reach that point where we start to split with 32 threads, why not earlier start with a 2 thread search, then use 4 threads, and so on.
Doing this, at least some af the available capacity would be used.

I don't claim this is better, but it seems more logical to me.
Jörg Oster

Post Reply