Best Stockfish NPS scaling yet

bob · Post by **bob** » Sat Mar 07, 2015 5:58 pm

syzygy wrote:I think the point is that the longer longer / deeper you search, the smaller the effect of inherent synchronisation / serialisation within the search algorithm becomes.

For example, with YBWC there is a clear synchronisation point as you get close to completing the search of the PV node at depth d. For a brief moment all but one thread will have run out of work and must wait for that last thread to finish. But the amount of "lost" search time due to this serialisation does not continue to grow for a single depth as the remaining depth of the tree gets bigger and bigger. In other words, the total time lost (at the PV nodes at all depths) is more or less linear with the total depth of the search tree. At the same time, the total size of the search tree grows exponentially. The ratio between time lost and time spent crunching nodes therefore goes to 0.

I think this is true ONLY if you don't use YBW. Minimum split depth is a killer. The tree grows exponentially, yes. But as it grows, the number of nodes <= min_split_depth ALSO grows exponentially, and those hurt. If you toss YBW and do something like DTS, you can squeeze out everything BUT the lock/synchronization overhead. Which will never go to zero, but can get reasonably close if you are careful with locks.

syzygy · Post by **syzygy** » Sat Mar 07, 2015 6:32 pm

bob wrote:
syzygy wrote:I think the point is that the longer longer / deeper you search, the smaller the effect of inherent synchronisation / serialisation within the search algorithm becomes.

For example, with YBWC there is a clear synchronisation point as you get close to completing the search of the PV node at depth d. For a brief moment all but one thread will have run out of work and must wait for that last thread to finish. But the amount of "lost" search time due to this serialisation does not continue to grow for a single depth as the remaining depth of the tree gets bigger and bigger. In other words, the total time lost (at the PV nodes at all depths) is more or less linear with the total depth of the search tree. At the same time, the total size of the search tree grows exponentially. The ratio between time lost and time spent crunching nodes therefore goes to 0.
I think this is true ONLY if you don't use YBW. Minimum split depth is a killer. The tree grows exponentially, yes. But as it grows, the number of nodes <= min_split_depth ALSO grows exponentially, and those hurt.

Minimum split depth only hurts if there are no split nodes at higher depth left. In YBW that is only necessarily the case when you get close to finishing searching the "current" PV node. In all other cases it is sufficient to ensure the presence of a sufficient number of (candidate) split nodes. Of course not all implementations will succeed here, but in principle it can be done.

kbhearn · Post by **kbhearn** » Sat Mar 07, 2015 7:09 pm

I see 3 times in a PVS YBW implementation where splitting at/near min depth would become a significant fraction

1) the start of the iteration when higher nodes do not yet satisfy YBW criteria
2) the end of the search when there's no higher nodes left
(both of these diminish of course as the total iteration gets longer)

and
3) after a scout search fails all the way back to the pv which mimics the same condition as (1).

It intuitively feels like this should still lead to a shrinking split overhead the longer the search, but perhaps some fixed % may remain...

syzygy · Post by **syzygy** » Sat Mar 07, 2015 11:43 pm

kbhearn wrote:I see 3 times in a PVS YBW implementation where splitting at/near min depth would become a significant fraction

1) the start of the iteration when higher nodes do not yet satisfy YBW criteria
2) the end of the search when there's no higher nodes left
(both of these diminish of course as the total iteration gets longer)

I would say both are manifestations of the serialisation point where the search completes searching the PV node at d+1 ply from the root (thereby completing the search of the first move of the PV node at d ply from the root) and starts searching the remaining moves of the PV node at d ply from the root.

At the start of the iteration, d is equal to the iteration depth and there is no parallelism. As d decreases, parallelism increases. The "serial loss" for each value of d seems to be bounded by a constant.

and
3) after a scout search fails all the way back to the pv which mimics the same condition as (1).

Yes, it is as if the iteration was restarted.

bob · Post by **bob** » Sun Mar 08, 2015 12:10 am

syzygy wrote:
bob wrote:
syzygy wrote:I think the point is that the longer longer / deeper you search, the smaller the effect of inherent synchronisation / serialisation within the search algorithm becomes.

For example, with YBWC there is a clear synchronisation point as you get close to completing the search of the PV node at depth d. For a brief moment all but one thread will have run out of work and must wait for that last thread to finish. But the amount of "lost" search time due to this serialisation does not continue to grow for a single depth as the remaining depth of the tree gets bigger and bigger. In other words, the total time lost (at the PV nodes at all depths) is more or less linear with the total depth of the search tree. At the same time, the total size of the search tree grows exponentially. The ratio between time lost and time spent crunching nodes therefore goes to 0.
I think this is true ONLY if you don't use YBW. Minimum split depth is a killer. The tree grows exponentially, yes. But as it grows, the number of nodes <= min_split_depth ALSO grows exponentially, and those hurt.
Minimum split depth only hurts if there are no split nodes at higher depth left. In YBW that is only necessarily the case when you get close to finishing searching the "current" PV node. In all other cases it is sufficient to ensure the presence of a sufficient number of (candidate) split nodes. Of course not all implementations will succeed here, but in principle it can be done.

Not quite. You ALWAYS run into min split depth on the left-hand side of the tree (all the PV nodes). And you run into them at any point where a processor is in that part of the tree and another processor goes idle and the first is unable to do a split. Some quick math will show you that a program spends FAR more of its time inside the min split depth window than it does in nodes where it can split instantly. Just compute how large a search space that last 6 plies + q-search is, compared to the first N-6 plies. Major difference.

I agree it can be done, as DTS does it quite nicely. But even DTS will have problems on the PC platform since there is a measurable cost to split, something we didn't have on the cray, or at least it was not measurable on the Cray.

If you look at current crafty's output, when it says 95% utilization on a 12 core box, that means 5% of 12 cores was wasted waiting on a split operation by one of the busy processors. If you take 12 cores as 1200% possible usage, this represents 50% of one processor lost, which is not bad, but it is not really great. The new code I am running looks absolutely perfect with 4 cpus. I am seeing 100% utilization with a low of 99%. I need to test some on 8, 12 and 16 cores when I have time to see how that turns out. I doubt it will remain 100% beyond 8 however, although I would happily accept it if it does.

bob · Post by **bob** » Sun Mar 08, 2015 12:12 am

kbhearn wrote:I see 3 times in a PVS YBW implementation where splitting at/near min depth would become a significant fraction

1) the start of the iteration when higher nodes do not yet satisfy YBW criteria
2) the end of the search when there's no higher nodes left
(both of these diminish of course as the total iteration gets longer)

and
3) after a scout search fails all the way back to the pv which mimics the same condition as (1).

It intuitively feels like this should still lead to a shrinking split overhead the longer the search, but perhaps some fixed % may remain...

There's another case.

Think about the search space that that last 6 plies includes (total nodes including q-search that is probably 50% of total nodes searched on average) assuming a minimum split depth of 6 plies. Compare the total nodes there to the total nodes in the complete n-6 tree. Those nodes with remaining depth < 6 plies really dominates the tree. Any time a processor goes idle, and another processor is in that part of the tree, it can't split. This can add up quickly.

syzygy · Post by **syzygy** » Sun Mar 08, 2015 12:25 am

bob wrote:
syzygy wrote:Minimum split depth only hurts if there are no split nodes at higher depth left. In YBW that is only necessarily the case when you get close to finishing searching the "current" PV node. In all other cases it is sufficient to ensure the presence of a sufficient number of (candidate) split nodes. Of course not all implementations will succeed here, but in principle it can be done.
Not quite. You ALWAYS run into min split depth on the left-hand side of the tree (all the PV nodes). And you run into them at any point where a processor is in that part of the tree and another processor goes idle and the first is unable to do a split.

I'm assuming the split points have been created already.

In my engine basically any node is a (potential) split node. Creating a split node does not involve copying the position but merely saving the values of alpha and beta and the moves from the root to the current node so that other threads have enough information to recreate that position.

So it's just fine that threads are spending a lot of time near the leaves. It's even better, as it means that other threads can perform the split and briefly lock the split node (only for copying the moves) without the master thread noticing anything.

bob · Post by **bob** » Sun Mar 08, 2015 2:32 am

syzygy wrote:
bob wrote:
syzygy wrote:Minimum split depth only hurts if there are no split nodes at higher depth left. In YBW that is only necessarily the case when you get close to finishing searching the "current" PV node. In all other cases it is sufficient to ensure the presence of a sufficient number of (candidate) split nodes. Of course not all implementations will succeed here, but in principle it can be done.
Not quite. You ALWAYS run into min split depth on the left-hand side of the tree (all the PV nodes). And you run into them at any point where a processor is in that part of the tree and another processor goes idle and the first is unable to do a split.
I'm assuming the split points have been created already.

In my engine basically any node is a (potential) split node. Creating a split node does not involve copying the position but merely saving the values of alpha and beta and the moves from the root to the current node so that other threads have enough information to recreate that position.

So it's just fine that threads are spending a lot of time near the leaves. It's even better, as it means that other threads can perform the split and briefly lock the split node (only for copying the moves) without the master thread noticing anything.

Doesn't sound like a half-bad idea. Someone mentioned this years ago, but I never had much luck. The overhead of spitting out these split points was not minuscule for me. For example, when splitting at the root, spitting out split points farther out in the tree is almost always pointless until right at the end of the search. In any case, that code by itself is an example of overhead since many of them are wasted effort. Or at least they were for me.

Best Stockfish NPS scaling yet

Re: Best Stockfish NPS scaling yet

Re: Best Stockfish NPS scaling yet

Re: Best Stockfish NPS scaling yet

Re: Best Stockfish NPS scaling yet

Re: Best Stockfish NPS scaling yet

Re: Best Stockfish NPS scaling yet

Re: Best Stockfish NPS scaling yet

Re: Best Stockfish NPS scaling yet