## SMP speed up

Discussion of chess software programming and technical issues.

Moderators: hgm, Dann Corbit, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
bob
Posts: 20923
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

### Re: SMP speed up

Rein Halbersma wrote:
michiguel wrote: For DTS (S = speedup) I suggest

S = 35 * n / (34 + n)

is fairly simple.

This can also be expressed as the reciprocal of the speed up is linear to the reciprocal of the CPUs.
1/S = a * 1/CPUs + b

This is not an approximation, it is what it is supposed to be if Amdahl's apply. The good thing is that it has also "some" predicting power over numbers you did not test.

Parallel programing is not my expertise, but the basic math is identical to enzyme kinetics or simple electricity problems (when you have resistors connected in parallel).

Miguel
Could it be that your example of Amdahl's law reflects the YBW paradigm for the average chess positions's branching factor of 35? If this relationship for up to N=16 can be extrapolated, the upper bound on the speedup is equal to S=35 for N=infinity. That would be very poor scaling behavior for the DTS algorithm for large N, considering that chess search trees have ample parallelism. Didn't Cilkchess show linear speedup for a much larger range of processors?

Rein
There are implementation details involved. Some split at a single point with all processors. This has significant bearing on speedup. Obviously 256 cpus is not going to help as much as expected because there are no positions that have 256 legal moves to search. This can be reduced. In Crafty, I allow no more than 4. And occasionally get burned after a check where there is only two legal moves, the first satisfies YBW, the second can't be split among 4 threads. There is work left to do there. Probably splitting when in check is bad. But at times critical, as when searching down the left-hand-side (PV) of the tree, you either split when you back up to such a position, or have everyone sit until you get further down into the tree. There you might pick a bad split point, where we know that the PV node is ALL.

Overhead piles up everywhere, and it is quite difficult to reduce it even by a small amount. It is annoying to see nps go up by 8x, and only search some odd position 2x faster. But it happens. Not often, thankfully, but it happens.

bob
Posts: 20923
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

### Re: SMP speed up

michiguel wrote:
Rein Halbersma wrote:
michiguel wrote: For DTS (S = speedup) I suggest

S = 35 * n / (34 + n)

is fairly simple.

This can also be expressed as the reciprocal of the speed up is linear to the reciprocal of the CPUs.
1/S = a * 1/CPUs + b

This is not an approximation, it is what it is supposed to be if Amdahl's apply. The good thing is that it has also "some" predicting power over numbers you did not test.

Parallel programing is not my expertise, but the basic math is identical to enzyme kinetics or simple electricity problems (when you have resistors connected in parallel).

Miguel
Could it be that your example of Amdahl's law reflects the YBW paradigm for the average chess positions's branching factor of 35? If this relationship for up to N=16 can be extrapolated, the upper bound on the speedup is equal to S=35 for N=infinity. That would be very poor scaling behavior for the DTS algorithm for large N, considering that chess search trees have ample parallelism. Didn't Cilkchess show linear speedup for a much larger range of processors?

Rein
If Amdahl's law applies and the formula is correct, 35 means that 1/35 is the fraction of the program that cannot be made parallel. In this case, 1/35 ~ 0.028 or 2.8%. In my humble experience of making Gaviota parallel (which is not efficient BTW), I observed that non-parallel fraction of the program is bigger at low depths. This is because a big chunk of non-parallelism comes from the fact that the PV and its last x plies never split. The deeper you search, the less significant this becomes. I measured it and it could go from 15% at the beginning to 1% or less, but It never goes to zero. So, how well these algorithms scale depend on the depth reached. If I am not mistaken, The DTS experiments may have been done at a reasonably "low" depths by today's standards. Maybe it would be even better today.

To make things more complicated, and maybe the cause of a non-Amdahl behavior is that, as Bob pointed out, each processor added makes the tree bigger (this may end up looking like it is a non parallel fraction of the tree, but the concept is different). OTOH, there is a positive effect because there is a chance to find the solution faster by peaking to successful late moves earlier, like making a "fast-forward" of the process.

That is why the problem may look like Amdahl, but the k=35 may not mean what we think it means but an average of all the issues at stake.

Miguel
Since we are on a different tangent now, some inefficient points...

(1) split cost. Non-trivial copying a lot of data to various split blocks. Totally unavoidable, one just has to try to limit the number of splits.

(2) search overhead. Every time you do a split, and one of the threads gets a fail high, some (or all) of what the other threads have done is wasted. How much overhead is difficult to project. If you generate 40 moves, and for some reason the fail-high move is at slot 20, you search move one serially to satisfy the YBWC, then you split and start the search. If any moves past slot 20 are searched, partially or completely, by the time move 20 fails high, that is wasted effort.

(3) synchronization. Locks are generally used. It is possible to write lockless code, and various algorithms are given in parallel programming books, assuming you can guarantee the order of memory reads and writes. This is not true for all processors today, and there is no guarantee it will be true even for Intel in the future. So it is dangerous. But in any case, locks are sequential, by definition, and factor directly into Amdahl's law, where the first two items do not necessarily.

(4) hardware. If you share things, and they end up close in memory, they can share a cache block. And you can generate a ton of cache coherency traffic keeping up with who has the most current copy of something. This is difficult to measure unless you have some software tools. AMD has some slick stuff, but it is non-public, or at least was the last time they ran it for me. There are other issues such as memory conflicts, bus conflicts, does each processor have its own cache (L1 and L2) or a local L1 and a shared L2, or a local L1 and a L2 shared between 2 or 4 cores? Each changes performance by adding some unknown serialization.

There's a lot to this, as can be seen.

Then there is non-deterministic behaviour, where one thread just has time to store something in hash before another thread needs it. The next time it is just too late and the tree changes if you repeat the search. One thread "gets ahead" since moves don't have to be searched one at a time, and you get the occasionally infamous "super-linear speedup". And of course different threads update things in different orders. Which further alters the speedup data.

Milos
Posts: 4089
Joined: Wed Nov 25, 2009 12:47 am

### Re: SMP speed up

bob wrote:(4) hardware. If you share things, and they end up close in memory, they can share a cache block. And you can generate a ton of cache coherency traffic keeping up with who has the most current copy of something. This is difficult to measure unless you have some software tools. AMD has some slick stuff, but it is non-public, or at least was the last time they ran it for me. There are other issues such as memory conflicts, bus conflicts, does each processor have its own cache (L1 and L2) or a local L1 and a shared L2, or a local L1 and a L2 shared between 2 or 4 cores? Each changes performance by adding some unknown serialization.
This is not a valid point any more. All AMD SMP processors and new Intels (i3, i5, i7) have separate (non-shared) L2 cache (in addition to local L1 cache).

bob
Posts: 20923
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

### Re: SMP speed up

Milos wrote:
bob wrote:(4) hardware. If you share things, and they end up close in memory, they can share a cache block. And you can generate a ton of cache coherency traffic keeping up with who has the most current copy of something. This is difficult to measure unless you have some software tools. AMD has some slick stuff, but it is non-public, or at least was the last time they ran it for me. There are other issues such as memory conflicts, bus conflicts, does each processor have its own cache (L1 and L2) or a local L1 and a shared L2, or a local L1 and a L2 shared between 2 or 4 cores? Each changes performance by adding some unknown serialization.
This is not a valid point any more. All AMD SMP processors and new Intels (i3, i5, i7) have separate (non-shared) L2 cache (in addition to local L1 cache).
Eh? Intel certainly claims Nehalem has a shared L3 cache (the 4 core version shares L2 between all cores) there are a ton of existing core-2 processors out there that do have a shared L2 cache. As a note, L2 on Nehalem is tiny compared to core-2 boxes, so L3 is important.

And apparently, once again, you didn't grasp the significance of the cache coherency comment. Shared L2 had _less_ coherency traffic than private L2 caches. with 8 cores, on core 2 you have 4 shared L2 caches (2 per chip). You have 4 caches chatting back and forth about ownership, etc. With nehalem, and 8 private L2,s you have 2x the cache coherency traffic. That adds delays. As any architecture person will tell you. And it adds more "hidden" loss to the parallel algorithm if you do a lot of shared memory modification, such as in a shared split block, shared locks, shared hash table, etc.

michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 7:30 pm
Location: Chicago, Illinois, USA
Contact:

### Re: SMP speed up

bob wrote:
michiguel wrote: For DTS (S = speedup) I suggest

S = 35 * n / (34 + n)

is fairly simple.

This can also be expressed as the reciprocal of the speed up is linear to the reciprocal of the CPUs.
1/S = a * 1/CPUs + b

This is not an approximation, it is what it is supposed to be if Amdahl's apply. The good thing is that it has also "some" predicting power over numbers you did not test.

Parallel programing is not my expertise, but the basic math is identical to enzyme kinetics or simple electricity problems (when you have resistors connected in parallel).

Miguel
OK, a few quick tests:

Code: Select all

``````Ncpus      S/U
1             1
2             1.95
4             3,68
8             6.67
16           11.2
``````
Not bad, assuming you are talking about the DTS algorithm and data from C90 Cray Blitz. It is too high early for current YBW in Crafty, although it drops into line around 16.
Of course, it should have a different constant for YBW, not 35.

however, for simplicity, my original formula is easier to calculate, and works better for the above numbers... However, this is a non-linear function being approximated by (now) a second degree polynomial, so it should be able to produce more accuracy with tweaking for the current implementation. And a 3rd degree would be better. Etc...
As I said, what I show is not an approximation. It is a hyperbolic equation that models the Amdah'ls hypothesis.

Miguel

bob
Posts: 20923
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

### Re: SMP speed up

michiguel wrote:
bob wrote:
michiguel wrote: For DTS (S = speedup) I suggest

S = 35 * n / (34 + n)

is fairly simple.

This can also be expressed as the reciprocal of the speed up is linear to the reciprocal of the CPUs.
1/S = a * 1/CPUs + b

This is not an approximation, it is what it is supposed to be if Amdahl's apply. The good thing is that it has also "some" predicting power over numbers you did not test.

Parallel programing is not my expertise, but the basic math is identical to enzyme kinetics or simple electricity problems (when you have resistors connected in parallel).

Miguel
OK, a few quick tests:

Code: Select all

``````Ncpus      S/U
1             1
2             1.95
4             3,68
8             6.67
16           11.2
``````
Not bad, assuming you are talking about the DTS algorithm and data from C90 Cray Blitz. It is too high early for current YBW in Crafty, although it drops into line around 16.
Of course, it should have a different constant for YBW, not 35.

however, for simplicity, my original formula is easier to calculate, and works better for the above numbers... However, this is a non-linear function being approximated by (now) a second degree polynomial, so it should be able to produce more accuracy with tweaking for the current implementation. And a 3rd degree would be better. Etc...
As I said, what I show is not an approximation. It is a hyperbolic equation that models the Amdah'ls hypothesis.

Miguel
In chess, I think the true serial code is probably on the order of 0.1% at worst. The problem is the 92% fail high number. That means that If you satisfy the YBWC (Young Brothers Wait Criterion for those not familiar with the acronym) then 92% of the time you can be sure this is not a CUT node since you just searched that first move without a fail high. But 8% of the time, you do a split and waste some time. This is the killer provides most of the loss. If you look at some of the data I have posted previously, the number of nodes is noticable.
For example, two different positions. Look at the node counts. The first two lines are from the same position, with 1 cpu and then with 8. both to the _same_ fixed depth. The first position actually has fewer nodes in the parallel search. lucky hash information, or lucky out-of-order searching to prune away parts of the tree that normally would have to be searched. For an example, suppose you have a position where at the root, move one appears to be best, but the real best move is down at slot 9. Move 2 is very complicated and produces a big tree. Then you work your way down to 3, 4, and finally get to 9, fail high, and then move. With the parallel search, we search the first move, then the next 8 in parallel. If the mate is easy to find, we get that fail high back way before the 2nd complex move is searched, and after finding the mate move, we have to finish the 2nd move to make sure all are done, and we get a quick cutoff. The second position adds a few million extra nodes, which guarantees no optimal speedup can happen:

Code: Select all

``````log.001&#58;              time=1&#58;33  mat=0  n=282932478  fh=87%  nps=3.0M
log.002&#58;              time=14.58  mat=0  n=201479975  fh=88%  nps=13.8M

log.001&#58;              time=28.34  mat=1  n=92890080  fh=88%  nps=3.3M
log.002&#58;              time=6.12  mat=1  n=99881486  fh=88%  nps=16.3M

``````
SIgnificant variability. One tree is smaller than normal and can potentially produce a super-linear speedup (> 8 although the case above s only 6x). The other tree is larger and produces a speedup a little less than 5. Note that both are short searches for a smp search, and deeper would probably improve the speedups...

michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 7:30 pm
Location: Chicago, Illinois, USA
Contact:

### Re: SMP speed up

bob wrote:
michiguel wrote:
bob wrote:
michiguel wrote: For DTS (S = speedup) I suggest

S = 35 * n / (34 + n)

is fairly simple.

This can also be expressed as the reciprocal of the speed up is linear to the reciprocal of the CPUs.
1/S = a * 1/CPUs + b

This is not an approximation, it is what it is supposed to be if Amdahl's apply. The good thing is that it has also "some" predicting power over numbers you did not test.

Parallel programing is not my expertise, but the basic math is identical to enzyme kinetics or simple electricity problems (when you have resistors connected in parallel).

Miguel
OK, a few quick tests:

Code: Select all

``````Ncpus      S/U
1             1
2             1.95
4             3,68
8             6.67
16           11.2
``````
Not bad, assuming you are talking about the DTS algorithm and data from C90 Cray Blitz. It is too high early for current YBW in Crafty, although it drops into line around 16.
Of course, it should have a different constant for YBW, not 35.

however, for simplicity, my original formula is easier to calculate, and works better for the above numbers... However, this is a non-linear function being approximated by (now) a second degree polynomial, so it should be able to produce more accuracy with tweaking for the current implementation. And a 3rd degree would be better. Etc...
As I said, what I show is not an approximation. It is a hyperbolic equation that models the Amdah'ls hypothesis.

Miguel
In chess, I think the true serial code is probably on the order of 0.1% at worst. The problem is the 92% fail high number. That means that If you satisfy the YBWC (Young Brothers Wait Criterion for those not familiar with the acronym) then 92% of the time you can be sure this is not a CUT node since you just searched that first move without a fail high. But 8% of the time, you do a split and waste some time. This is the killer provides most of the loss. If you look at some of the data I have posted previously, the number of nodes is noticable.
For example, two different positions. Look at the node counts. The first two lines are from the same position, with 1 cpu and then with 8. both to the _same_ fixed depth. The first position actually has fewer nodes in the parallel search. lucky hash information, or lucky out-of-order searching to prune away parts of the tree that normally would have to be searched. For an example, suppose you have a position where at the root, move one appears to be best, but the real best move is down at slot 9. Move 2 is very complicated and produces a big tree. Then you work your way down to 3, 4, and finally get to 9, fail high, and then move. With the parallel search, we search the first move, then the next 8 in parallel. If the mate is easy to find, we get that fail high back way before the 2nd complex move is searched, and after finding the mate move, we have to finish the 2nd move to make sure all are done, and we get a quick cutoff. The second position adds a few million extra nodes, which guarantees no optimal speedup can happen:

Code: Select all

``````log.001&#58;              time=1&#58;33  mat=0  n=282932478  fh=87%  nps=3.0M
log.002&#58;              time=14.58  mat=0  n=201479975  fh=88%  nps=13.8M

log.001&#58;              time=28.34  mat=1  n=92890080  fh=88%  nps=3.3M
log.002&#58;              time=6.12  mat=1  n=99881486  fh=88%  nps=16.3M

``````
SIgnificant variability. One tree is smaller than normal and can potentially produce a super-linear speedup (> 8 although the case above s only 6x). The other tree is larger and produces a speedup a little less than 5. Note that both are short searches for a smp search, and deeper would probably improve the speedups...
I derived an equation based of the following model: every time an extra processor is added, it will work on a fraction of the tree that will become bigger by x % (on average) than the original processor would take in a non-parallel fashion. For instance, when CPU=1 and the tree is 1000 knodes, if x = 50, increasing number of CPUs will make the tree grow up to 1500 knodes. I addition there is a fraction of the time that is not parallel (Amdahl's serial fraction).

The equation is

Code: Select all

``````               ts + tp
S = ----------------------------------
ts + tp * &#91;&#40;n-1&#41; * f + 1&#93; / n^2``````
n = number of cpus
S = speed up
ts = serial time
tp = parallelizable time
f = inefficiency factor added for each processor, for instance if x = 50% from the example above then f = 1.5

With this equation and ts/tp = 0.001 (which is 0.1%) and f = 1.3, the type of behavior is curved, but very closed to linear up to 64 cpus.

This indicates that, as I suspected when I asked the question, the inefficiency is not dominated by Amdahl's law until a very very high numbers of processors is used. Possibly, the behavior of DTS is close to Amdalh's because f may be smaller.

If you have current data on Crafty it would be interesting to see if the equation above, thought simplistic, fits it.

This is very relevant because it may mean that there is hope that smp programs could scale better than we think at really high number of cpus. Otherwise, if Amdahl dominates early on, the whole thing is doom to failure. See below that if the serial time increases, the efficiency degrades rapidly. If "f" increases, the efficiency is killed at the beginning, but it scales linearly. If ts increases, the efficiency is not hurt at low n, but destroyed at higher n.

Miguel

Code: Select all

``````ts = 0.1
tp = 100
f = 1.3

n       S
1	 1.00
2	 1.70
4	 3.16
8	 6.09
16	11.92
32	23.36
64	45.48

ts = 0.5
tp = 100
f = 1.3

n       S
1	 1.00
2	 1.70
4	 3.13
8	 5.97
16	11.42
32	21.45
64	38.64

ts = 2
tp = 100
f = 1.3

n       S
1	 1.00
2	 1.68
4	 3.04
8 	5.57
16 	9.90
32	16.49
64	24.87``````

bob
Posts: 20923
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

### Re: SMP speed up

michiguel wrote:
bob wrote:
michiguel wrote:
bob wrote:
michiguel wrote: For DTS (S = speedup) I suggest

S = 35 * n / (34 + n)

is fairly simple.

This can also be expressed as the reciprocal of the speed up is linear to the reciprocal of the CPUs.
1/S = a * 1/CPUs + b

This is not an approximation, it is what it is supposed to be if Amdahl's apply. The good thing is that it has also "some" predicting power over numbers you did not test.

Parallel programing is not my expertise, but the basic math is identical to enzyme kinetics or simple electricity problems (when you have resistors connected in parallel).

Miguel
OK, a few quick tests:

Code: Select all

``````Ncpus      S/U
1             1
2             1.95
4             3,68
8             6.67
16           11.2
``````
Not bad, assuming you are talking about the DTS algorithm and data from C90 Cray Blitz. It is too high early for current YBW in Crafty, although it drops into line around 16.
Of course, it should have a different constant for YBW, not 35.

however, for simplicity, my original formula is easier to calculate, and works better for the above numbers... However, this is a non-linear function being approximated by (now) a second degree polynomial, so it should be able to produce more accuracy with tweaking for the current implementation. And a 3rd degree would be better. Etc...
As I said, what I show is not an approximation. It is a hyperbolic equation that models the Amdah'ls hypothesis.

Miguel
In chess, I think the true serial code is probably on the order of 0.1% at worst. The problem is the 92% fail high number. That means that If you satisfy the YBWC (Young Brothers Wait Criterion for those not familiar with the acronym) then 92% of the time you can be sure this is not a CUT node since you just searched that first move without a fail high. But 8% of the time, you do a split and waste some time. This is the killer provides most of the loss. If you look at some of the data I have posted previously, the number of nodes is noticable.
For example, two different positions. Look at the node counts. The first two lines are from the same position, with 1 cpu and then with 8. both to the _same_ fixed depth. The first position actually has fewer nodes in the parallel search. lucky hash information, or lucky out-of-order searching to prune away parts of the tree that normally would have to be searched. For an example, suppose you have a position where at the root, move one appears to be best, but the real best move is down at slot 9. Move 2 is very complicated and produces a big tree. Then you work your way down to 3, 4, and finally get to 9, fail high, and then move. With the parallel search, we search the first move, then the next 8 in parallel. If the mate is easy to find, we get that fail high back way before the 2nd complex move is searched, and after finding the mate move, we have to finish the 2nd move to make sure all are done, and we get a quick cutoff. The second position adds a few million extra nodes, which guarantees no optimal speedup can happen:

Code: Select all

``````log.001&#58;              time=1&#58;33  mat=0  n=282932478  fh=87%  nps=3.0M
log.002&#58;              time=14.58  mat=0  n=201479975  fh=88%  nps=13.8M

log.001&#58;              time=28.34  mat=1  n=92890080  fh=88%  nps=3.3M
log.002&#58;              time=6.12  mat=1  n=99881486  fh=88%  nps=16.3M

``````
SIgnificant variability. One tree is smaller than normal and can potentially produce a super-linear speedup (> 8 although the case above s only 6x). The other tree is larger and produces a speedup a little less than 5. Note that both are short searches for a smp search, and deeper would probably improve the speedups...
I derived an equation based of the following model: every time an extra processor is added, it will work on a fraction of the tree that will become bigger by x % (on average) than the original processor would take in a non-parallel fashion. For instance, when CPU=1 and the tree is 1000 knodes, if x = 50, increasing number of CPUs will make the tree grow up to 1500 knodes. I addition there is a fraction of the time that is not parallel (Amdahl's serial fraction).

The equation is

Code: Select all

``````               ts + tp
S = ----------------------------------
ts + tp * &#91;&#40;n-1&#41; * f + 1&#93; / n^2``````
n = number of cpus
S = speed up
ts = serial time
tp = parallelizable time
f = inefficiency factor added for each processor, for instance if x = 50% from the example above then f = 1.5

With this equation and ts/tp = 0.001 (which is 0.1%) and f = 1.3, the type of behavior is curved, but very closed to linear up to 64 cpus.

This indicates that, as I suspected when I asked the question, the inefficiency is not dominated by Amdahl's law until a very very high numbers of processors is used. Possibly, the behavior of DTS is close to Amdalh's because f may be smaller.

If you have current data on Crafty it would be interesting to see if the equation above, thought simplistic, fits it.

This is very relevant because it may mean that there is hope that smp programs could scale better than we think at really high number of cpus. Otherwise, if Amdahl dominates early on, the whole thing is doom to failure. See below that if the serial time increases, the efficiency degrades rapidly. If "f" increases, the efficiency is killed at the beginning, but it scales linearly. If ts increases, the efficiency is not hurt at low n, but destroyed at higher n.

Miguel

Code: Select all

``````ts = 0.1
tp = 100
f = 1.3

n       S
1	 1.00
2	 1.70
4	 3.16
8	 6.09
16	11.92
32	23.36
64	45.48

ts = 0.5
tp = 100
f = 1.3

n       S
1	 1.00
2	 1.70
4	 3.13
8	 5.97
16	11.42
32	21.45
64	38.64

ts = 2
tp = 100
f = 1.3

n       S
1	 1.00
2	 1.68
4	 3.04
8 	5.57
16 	9.90
32	16.49
64	24.87``````
I have the test running on our E5345 (one node). Running 1, 2, 4, and 8 threads. For 2, 4 and 8 I am running each test twice. Will add more later... Will post the data later in the week, right now stuck on finishing up the new hash path stuff...

Dann Corbit
Posts: 12232
Joined: Wed Mar 08, 2006 7:57 pm
Location: Redmond, WA USA
Contact:

### Re: SMP speed up

There is a study that shows when processors scale to huge counts, that software transactions operate more efficiently than traditional synchronization metaphors like critical sections and semaphores.
See, for instance:
"Cache Sensitive Software Transactional" Memory by Robert Ennals

Now, I do not know if the general idea translates well to chess programs but it may have utility when the number of CPU threads skyrockets if any gating techniques are used at all.

bob
Posts: 20923
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

### Re: SMP speed up

Dann Corbit wrote:There is a study that shows when processors scale to huge counts, that software transactions operate more efficiently than traditional synchronization metaphors like critical sections and semaphores.
See, for instance:
"Cache Sensitive Software Transactional" Memory by Robert Ennals

Now, I do not know if the general idea translates well to chess programs but it may have utility when the number of CPU threads skyrockets if any gating techniques are used at all.
The problem is, the number of threads is not really going to "skyrocket". 64 CPU machines can be found. Going beyond that takes some money, because you have to do as Cray did and design one hell of a memory sub-system to keep the CPUs happy. Cray used a full crossbar for processors and memory banks. The PC platform uses a simple bus. Or a double-ended bus for AMD. Neither of which scales to huge numbers of processors very effectively.

Clusters are a completely different animal of course, but the programming methodology is completely different as well. Only idiots have race conditions on clusters, as one example. While they are commonplace on shared-memory multiprocessor boxes.