So, it seems instead of parking threads when they have no work to do, it's better to spin them on the same search tree, since the hashtable was already mostly filled out. This is not trivial to implement in SF (at least for me Smile, but the idea seems quite powerful. Especially with the quite skinny tree that SF currently searches in endgames. I think the idea would be something like create an additional split point at the root, and have the nodes check in every X nodes or so to see if there is actual work to be done. If not, they can continue searching in their parallel split point.
Another possibility is to have the concept of second-best move search. We know we are searching the PV moves very deeply, so instead, search the second best move relatively deeply (deeper than it would have been searched in normal search). An easy way to approximate this is by using the exclude move feature, to remove the best move from the possibilities. Also, perhaps search the second best with the PV search pruning parameters. If the score turns out super bad after X ply or something bail, and try another split point with the best move excluded.
There are lots of possibilities! Implementing them is a fair bit of work, and especially testing is non-trivial, but it seems there is a lot to be gained here for high core computers.
But why not just implement lazy SMP in SF? It has been pointed out by many people already (including me at the SF forum, referring to the CCRL rating list) that there is no serious evidence that SF's current SMP implementation is better than lazy SMP.
So far, what we see is that Cheng's scaling is not much worse than SF. It's not better either. SF gets 152 elo from 4 threads, where Cheng got 132, if I understood correctly. So SF is still a bit better, especially considering that gaining 100 elo at SF level is not the same as gaining 100 elo at Cheng level (eg. elo gain of halfing TTD decreases with absolute elo).
But, admittedly, lazy SMP wins the tradeoff. It's not much worse, yet the implementation is significantly simpler. If someone is ready to do the work (not just talk) and rewrite SF's SMP implementation to use Cheng's lazy SMP, and that proves to be close enough to the original, then the patch will be welcome. But unless someone does the work, it's just hot air. You can say all you want on the SF forum or here, it won't happen until someone (you?) writes the code.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
mar wrote:I'm not sure it would scale better, but certainly it would be interesting to compare.
At least lazy smp wouldn't suffer from 2 limitations as the current implementation.
1. takes some time to fully kick in because of the min split depth parameter
2. not enough workload especially for threads >= 8 due to heavy pruning and reductions
It seems 1 is not a big problem for YBWC either. I ran some tests with stockfish 6 at time control 1s+0.08s/move, 4 cores vs 1 core:
So far, what we see is that Cheng's scaling is not much worse than SF. It's not better either. SF gets 152 elo from 4 threads, where Cheng got 132, if I understood correctly. So SF is still a bit better, especially considering that gaining 100 elo at SF level is not the same as gaining 100 elo at Cheng level (eg. elo gain of halfing TTD decreases with absolute elo).
I must confess I have not really understood the time control of the tests. Is it 1s/game or 1s/move? Note that the CCRL uses much longer time contol either way (but the error bars are much larger also). So I don't think anything has been proved.
But unless someone does the work, it's just hot air. You can say all you want on the SF forum or here, it won't happen until someone (you?) writes the code.
Well I won't write code for SF. You know perfectly why!
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.
Rank Name Elo + - games score oppo. draws
1 sf6_8c 8 4 3 11154 53% -8 63%
2 sf6_4c -8 3 4 11154 47% 8 63%
Rank Name Elo + - games score oppo. draws
1 sf6_8c 9 7 8 1465 53% -9 63%
2 sf6_16c -9 8 7 1465 47% 9 63%
So 8 cores barely bested 4 cores, while 16 cores was beaten by 8 cores?
What hardware was this on? If this result is representative, it seems SF's parallelization needs some serious reworking.
The same hardware as here. 16 core Dell PowerEdge T620 computer. Both hyperthreading and turbo boost are enabled. The computer runs Fedora 19. The hash table size is 128MB and pondering is disabled.
However, I don't think these tests can show that SF's parallelization needs serious reworking, because the time control is 1s + 0.08s/move, which is a lot faster than what people generally are interested in. This time control was chosen to provide some data related to this claim:
At least lazy smp wouldn't suffer from 2 limitations as the current implementation.
1. takes some time to fully kick in because of the min split depth parameter
Rank Name Elo + - games score oppo. draws
1 sf6_8c 8 4 3 11154 53% -8 63%
2 sf6_4c -8 3 4 11154 47% 8 63%
Rank Name Elo + - games score oppo. draws
1 sf6_8c 9 7 8 1465 53% -9 63%
2 sf6_16c -9 8 7 1465 47% 9 63%
So 8 cores barely bested 4 cores, while 16 cores was beaten by 8 cores?
What hardware was this on? If this result is representative, it seems SF's parallelization needs some serious reworking.
The same hardware as here. 16 core Dell PowerEdge T620 computer. Both hyperthreading and turbo boost are enabled. The computer runs Fedora 19. The hash table size is 128MB and pondering is disabled.
However, I don't think these tests can show that SF's parallelization needs serious reworking, because the time control is 1s + 0.08s/move, which is a lot faster than what people generally are interested in. This time control was chosen to provide some data related to this claim:
At least lazy smp wouldn't suffer from 2 limitations as the current implementation.
1. takes some time to fully kick in because of the min split depth parameter
Though not completely wrong, I must admit that this doesn't seem to be a real problem.
mar wrote:I'm not sure it would scale better, but certainly it would be interesting to compare.
At least lazy smp wouldn't suffer from 2 limitations as the current implementation.
1. takes some time to fully kick in because of the min split depth parameter
2. not enough workload especially for threads >= 8 due to heavy pruning and reductions
It seems 1 is not a big problem for YBWC either. I ran some tests with stockfish 6 at time control 1s+0.08s/move, 4 cores vs 1 core:
Rank Name Elo + - games score oppo. draws
1 SF6-8T 71 14 13 500 72% -71 39%
2 SF6-1T -71 13 14 500 28% 71 39%
No gain from 4 to 8 Threads. I know this is to be expected, but also a bit disappointing.
My idea to improve on this is to allow 2 threads to kick in earlier, and then continuously adding more threads with increasing depth. I don't know if this will work, but I think it is worth a try.
The problem is, WHERE do you "add them"? YBW is still necessary for any good parallel search.
Question is not WHERE, but WHEN.
Min Split Depth for 4 threads can certainly be smaller than for 32 threads. But instead of running with 1 thread only until we reach that point where we start to split with 32 threads, why not earlier start with a 2 thread search, then use 4 threads, and so on.
Doing this, at least some af the available capacity would be used.
I don't claim this is better, but it seems more logical to me.
The min split depth is a compromise between getting the processors busy and limiting split overhead. Splitting is not free. In Crafty the cost of splitting os paid by the individual processors as they split. And the issue is, you do NOT want to spend more time splitting than actually searching, for obvious reasons. There are points where it is actually reasonable to use just one CPU, when you factor in costs and gains... min_split_depth is NOT going to improve search speed, if you reduce it. It is going to hurt search speed because overhead is going up quickly. It is also not a function of the time control being used, it is solely a function of the time it requires to actually do a split.
Rank Name Elo + - games score oppo. draws
1 sf6_8c 8 4 3 11154 53% -8 63%
2 sf6_4c -8 3 4 11154 47% 8 63%
Rank Name Elo + - games score oppo. draws
1 sf6_8c 9 7 8 1465 53% -9 63%
2 sf6_16c -9 8 7 1465 47% 9 63%
So 8 cores barely bested 4 cores, while 16 cores was beaten by 8 cores?
What hardware was this on? If this result is representative, it seems SF's parallelization needs some serious reworking.
The same hardware as here. 16 core Dell PowerEdge T620 computer. Both hyperthreading and turbo boost are enabled. The computer runs Fedora 19. The hash table size is 128MB and pondering is disabled.
However, I don't think these tests can show that SF's parallelization needs serious reworking, because the time control is 1s + 0.08s/move, which is a lot faster than what people generally are interested in. This time control was chosen to provide some data related to this claim:
At least lazy smp wouldn't suffer from 2 limitations as the current implementation.
1. takes some time to fully kick in because of the min split depth parameter
If you allow turbo-boost your numbers are worthless. That single CPU will run quite a bit faster than one of the CPUs when you run 8 threads. My iMac goes from 3.1ghz to 4.0ghz, with turbo-boost enabled. But I have a simple utility to disable it when I am testing...
The test quoted above (1,4 and 8 threads) is particularly bad, because one thread will max out turbo-boost. 4 threads will come close. But running 8 will see turbo-boost disabled. You can't compare the numbers until you turn that off.
So, it seems instead of parking threads when they have no work to do, it's better to spin them on the same search tree, since the hashtable was already mostly filled out. This is not trivial to implement in SF (at least for me Smile, but the idea seems quite powerful. Especially with the quite skinny tree that SF currently searches in endgames. I think the idea would be something like create an additional split point at the root, and have the nodes check in every X nodes or so to see if there is actual work to be done. If not, they can continue searching in their parallel split point.
Another possibility is to have the concept of second-best move search. We know we are searching the PV moves very deeply, so instead, search the second best move relatively deeply (deeper than it would have been searched in normal search). An easy way to approximate this is by using the exclude move feature, to remove the best move from the possibilities. Also, perhaps search the second best with the PV search pruning parameters. If the score turns out super bad after X ply or something bail, and try another split point with the best move excluded.
There are lots of possibilities! Implementing them is a fair bit of work, and especially testing is non-trivial, but it seems there is a lot to be gained here for high core computers.
But why not just implement lazy SMP in SF? It has been pointed out by many people already (including me at the SF forum, referring to the CCRL rating list) that there is no serious evidence that SF's current SMP implementation is better than lazy SMP.
Unfortunately most of the data published is bogus, because of turbo-boost...