Real Speedup due to core doubling etc

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Real Speedup due to core doubling etc

Post by bob »

Laskos wrote:
bob wrote:
Laskos wrote: And after all your hard work you come with:

TTS (Time-To-Strength or Effective Speedup)=1 + (N-1) * 0.7

Where here is your "As far as "speedup depends on length of search" I believe _I_ am the first to point that out, in my DTS paper."? Your crap "formula" doesn't have appropriate asymptotic behavior and doesn't have the depth dependency for specific engines.

Then, TTD clearly is NOT an universal measure of TTS, whatever buggy "crocks" you consider some 3 of the top 6 programs. Buggy, too selective whatever, they seem to scale reasonably with the number of cores. You simply seem to have a mental rupture: if it cannot be, then it cannot be, although I see a bunch of them, it cannot be because it cannot be, nobody understands the parallel search, that's it.
"where is the depth part?" I simply tested deeply enough to get good numbers. How simple was that? SMP performance doesn't continue to improve for each ply you go deeper, it just improves until you get beyond the "noise" point.

As far as the rest goes, I simply understand the alpha/beta model. And I don't accept handwaving nonsensical explanations. And this "widening" is EXACTLY that. Nothing more, nothing less. Just like Vincent's old super-linear speedup nonsense. It happens, but NOT with any regularity. Any good book explains why.
1) The dependence of depth is pretty significant even for high depths, so there is is none of your "deeply enough". There were posts here showing that. And it's not "noise", it's a monotonously increasing function with depth for speed-up.

2) The results presented here (lazy to dig them up) showed nothing of linear behavior for top engines to 32 cores. I was clearly some power function like 1+(N-1)^a or even logarithmic a+b*log(N). Maybe only DTS would work this linear way to 16 cores, although it surely won't work linearly to 2,000 cores, and no top engine uses DTS.

3) Isn't your linear crap formula 1 + (N-1) * 0.7 showing some weird behavior, that with each doubling the scaling improves? 2 -> 4 cores the doubling gives 1.82 in TTS, 16 -> 32 cores the doubling gives 1.97 in TTS. That's unreasonable, everybody knows (only you seem to miss it) that today's top engines don't gain much more than 1.4 (depends on depth too) in TTS going from 16 to 32 cores at reasonable depths.

4) So, at least you admit that widening happens, maybe not with regularity, but with both of the top 2 engines, and with 3 of the top 6 engines. And TTD is for them a wrong substitute to the real TTS.
Don't know what you base the "no top engine uses DTS" on. Several HAVE reported to use DTS. Notably Zappa, but there have been others discussed here.

NOBODY has any sort of formula for 2000 cores, so you might as well get off that bandwagon, it is stuck in mud forever. Won't EVER be a shared memory machine with 2000 cores unless you venture into the old connection-machine architecture which did NOT scale very well for any parallel algorithm.

In short, my simple linear approximation is more than accurate enough for its intended purpose, namely to estimate the effective speedup / gain produced by some number of processors over a single one. It is not based on a few positions. It is based on hundreds of positions, run dozens to hundreds of times, to produce good averages.

I have NEVER said "widening does not happen." I have certainly said "it does NOT happen INTENTIONALLY". Because if widening improves a parallel search it will ALSO improve a serial search just as much. That is a trivial point once you think about it. Ergo, "widening" is an accidental thing that simply points out a defect in the basic search....
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Real Speedup due to core doubling etc

Post by Laskos »

bob wrote:
Laskos wrote:
bob wrote:
Laskos wrote: And after all your hard work you come with:

TTS (Time-To-Strength or Effective Speedup)=1 + (N-1) * 0.7

Where here is your "As far as "speedup depends on length of search" I believe _I_ am the first to point that out, in my DTS paper."? Your crap "formula" doesn't have appropriate asymptotic behavior and doesn't have the depth dependency for specific engines.

Then, TTD clearly is NOT an universal measure of TTS, whatever buggy "crocks" you consider some 3 of the top 6 programs. Buggy, too selective whatever, they seem to scale reasonably with the number of cores. You simply seem to have a mental rupture: if it cannot be, then it cannot be, although I see a bunch of them, it cannot be because it cannot be, nobody understands the parallel search, that's it.
"where is the depth part?" I simply tested deeply enough to get good numbers. How simple was that? SMP performance doesn't continue to improve for each ply you go deeper, it just improves until you get beyond the "noise" point.

As far as the rest goes, I simply understand the alpha/beta model. And I don't accept handwaving nonsensical explanations. And this "widening" is EXACTLY that. Nothing more, nothing less. Just like Vincent's old super-linear speedup nonsense. It happens, but NOT with any regularity. Any good book explains why.
1) The dependence of depth is pretty significant even for high depths, so there is is none of your "deeply enough". There were posts here showing that. And it's not "noise", it's a monotonously increasing function with depth for speed-up.

2) The results presented here (lazy to dig them up) showed nothing of linear behavior for top engines to 32 cores. I was clearly some power function like 1+(N-1)^a or even logarithmic a+b*log(N). Maybe only DTS would work this linear way to 16 cores, although it surely won't work linearly to 2,000 cores, and no top engine uses DTS.

3) Isn't your linear crap formula 1 + (N-1) * 0.7 showing some weird behavior, that with each doubling the scaling improves? 2 -> 4 cores the doubling gives 1.82 in TTS, 16 -> 32 cores the doubling gives 1.97 in TTS. That's unreasonable, everybody knows (only you seem to miss it) that today's top engines don't gain much more than 1.4 (depends on depth too) in TTS going from 16 to 32 cores at reasonable depths.

4) So, at least you admit that widening happens, maybe not with regularity, but with both of the top 2 engines, and with 3 of the top 6 engines. And TTD is for them a wrong substitute to the real TTS.
Don't know what you base the "no top engine uses DTS" on. Several HAVE reported to use DTS. Notably Zappa, but there have been others discussed here.

NOBODY has any sort of formula for 2000 cores, so you might as well get off that bandwagon, it is stuck in mud forever. Won't EVER be a shared memory machine with 2000 cores unless you venture into the old connection-machine architecture which did NOT scale very well for any parallel algorithm.

In short, my simple linear approximation is more than accurate enough for its intended purpose, namely to estimate the effective speedup / gain produced by some number of processors over a single one. It is not based on a few positions. It is based on hundreds of positions, run dozens to hundreds of times, to produce good averages.

I have NEVER said "widening does not happen." I have certainly said "it does NOT happen INTENTIONALLY". Because if widening improves a parallel search it will ALSO improve a serial search just as much. That is a trivial point once you think about it. Ergo, "widening" is an accidental thing that simply points out a defect in the basic search....
Zappa came 4th at 2006 WCCC on 512 Itanium2 processors, behind the likes of Shredder and Junior, which used 4 or 8 cores. If Zappa was using DTS, then it performed poorly. By your linear approximation, it should have been 350 Elo points above the competition. Anyway, I was saying that no present day top engine uses DTS.

Your linear approximation gives roughly the following series of Effective Speedup (TTS) per additional doubling (to 32 cores):
1.7, 1.8, 1.9, 1.95, 1.97
This is nonsensical, the scaling improves for each additional doubling.

A power series would give
1.7, 1.7, 1.7, 1.7, 1.7

What Vasik reported awhile ago on the Rybka Forum and what 32 core tests showed here recently was roughly:
1.8, 1.7, 1.6, 1.5, 1.4
This is consistent with diminishing value as TTS for each higher doubling. And is consistent with some sort of logarithmic scaling of the form 1+a*log(N), disregarding the depth dependency. And again, is consistent with what 512 or 2,000+ core machines have achieved.
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: Real Speedup due to core doubling etc

Post by Adam Hair »

bob wrote: "where is the depth part?" I simply tested deeply enough to get good numbers. How simple was that? SMP performance doesn't continue to improve for each ply you go deeper, it just improves until you get beyond the "noise" point.
Actually, it appears that SMP performance in this context does improve as you go deeper. Here is a graph of how Gaviota's speed up from 1 to 4 cores changes as the depth increases:

Image

We think that the speed up from 1 to n cores asymptotically approaches n as the depth increases. The discussion can be found in this thread: http://talkchess.com/forum/viewtopic.php?t=48780
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Real Speedup due to core doubling etc

Post by bob »

Adam Hair wrote:
bob wrote: "where is the depth part?" I simply tested deeply enough to get good numbers. How simple was that? SMP performance doesn't continue to improve for each ply you go deeper, it just improves until you get beyond the "noise" point.
Actually, it appears that SMP performance in this context does improve as you go deeper. Here is a graph of how Gaviota's speed up from 1 to 4 cores changes as the depth increases:

Image

We think that the speed up from 1 to n cores asymptotically approaches n as the depth increases. The discussion can be found in this thread: http://talkchess.com/forum/viewtopic.php?t=48780
That is minimal depth compared to my testing. My approach is to take an N processor box, and search a bunch of positions for 3 minutes using all cores. Then I start backing off the number of processors but force the search to go to the same depth as the max-core test went. By the time this gets back to 1 core, we are talking maybe an hour per position.

I do not believe there is any improvement related to depth beyond that point. I don't test anything to depth 15 or 16, today. In fact, most tests I run are generally to depth 30.

I agree that at 5-10 plies depth makes a difference, just not when you get to that long test I described above, which is how I computed my speedup approximation every time I have tested/revised it...

I do not think this approaches N as depth increases. There is a very definite search overhead that NEVER goes away, because move ordering can not be perfect. I'll try to run a detailed test on this myself since I can run multiple positions on the cluster using 12 cores. Speedup will NEVER get anywhere near 12x however. There is simply no chance whatsoever, unless you can somehow provide perfect move ordering so no split mistakes are made...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Real Speedup due to core doubling etc

Post by bob »

Laskos wrote:
bob wrote:
Laskos wrote:
bob wrote:
Laskos wrote: And after all your hard work you come with:

TTS (Time-To-Strength or Effective Speedup)=1 + (N-1) * 0.7

Where here is your "As far as "speedup depends on length of search" I believe _I_ am the first to point that out, in my DTS paper."? Your crap "formula" doesn't have appropriate asymptotic behavior and doesn't have the depth dependency for specific engines.

Then, TTD clearly is NOT an universal measure of TTS, whatever buggy "crocks" you consider some 3 of the top 6 programs. Buggy, too selective whatever, they seem to scale reasonably with the number of cores. You simply seem to have a mental rupture: if it cannot be, then it cannot be, although I see a bunch of them, it cannot be because it cannot be, nobody understands the parallel search, that's it.
"where is the depth part?" I simply tested deeply enough to get good numbers. How simple was that? SMP performance doesn't continue to improve for each ply you go deeper, it just improves until you get beyond the "noise" point.

As far as the rest goes, I simply understand the alpha/beta model. And I don't accept handwaving nonsensical explanations. And this "widening" is EXACTLY that. Nothing more, nothing less. Just like Vincent's old super-linear speedup nonsense. It happens, but NOT with any regularity. Any good book explains why.
1) The dependence of depth is pretty significant even for high depths, so there is is none of your "deeply enough". There were posts here showing that. And it's not "noise", it's a monotonously increasing function with depth for speed-up.

2) The results presented here (lazy to dig them up) showed nothing of linear behavior for top engines to 32 cores. I was clearly some power function like 1+(N-1)^a or even logarithmic a+b*log(N). Maybe only DTS would work this linear way to 16 cores, although it surely won't work linearly to 2,000 cores, and no top engine uses DTS.

3) Isn't your linear crap formula 1 + (N-1) * 0.7 showing some weird behavior, that with each doubling the scaling improves? 2 -> 4 cores the doubling gives 1.82 in TTS, 16 -> 32 cores the doubling gives 1.97 in TTS. That's unreasonable, everybody knows (only you seem to miss it) that today's top engines don't gain much more than 1.4 (depends on depth too) in TTS going from 16 to 32 cores at reasonable depths.

4) So, at least you admit that widening happens, maybe not with regularity, but with both of the top 2 engines, and with 3 of the top 6 engines. And TTD is for them a wrong substitute to the real TTS.
Don't know what you base the "no top engine uses DTS" on. Several HAVE reported to use DTS. Notably Zappa, but there have been others discussed here.

NOBODY has any sort of formula for 2000 cores, so you might as well get off that bandwagon, it is stuck in mud forever. Won't EVER be a shared memory machine with 2000 cores unless you venture into the old connection-machine architecture which did NOT scale very well for any parallel algorithm.

In short, my simple linear approximation is more than accurate enough for its intended purpose, namely to estimate the effective speedup / gain produced by some number of processors over a single one. It is not based on a few positions. It is based on hundreds of positions, run dozens to hundreds of times, to produce good averages.

I have NEVER said "widening does not happen." I have certainly said "it does NOT happen INTENTIONALLY". Because if widening improves a parallel search it will ALSO improve a serial search just as much. That is a trivial point once you think about it. Ergo, "widening" is an accidental thing that simply points out a defect in the basic search....
Zappa came 4th at 2006 WCCC on 512 Itanium2 processors, behind the likes of Shredder and Junior, which used 4 or 8 cores. If Zappa was using DTS, then it performed poorly. By your linear approximation, it should have been 350 Elo points above the competition. Anyway, I was saying that no present day top engine uses DTS.

Your linear approximation gives roughly the following series of Effective Speedup (TTS) per additional doubling (to 32 cores):
1.7, 1.8, 1.9, 1.95, 1.97
This is nonsensical, the scaling improves for each additional doubling.

A power series would give
1.7, 1.7, 1.7, 1.7, 1.7

What Vasik reported awhile ago on the Rybka Forum and what 32 core tests showed here recently was roughly:
1.8, 1.7, 1.6, 1.5, 1.4
This is consistent with diminishing value as TTS for each higher doubling. And is consistent with some sort of logarithmic scaling of the form 1+a*log(N), disregarding the depth dependency. And again, is consistent with what 512 or 2,000+ core machines have achieved.
My first suggestion is to PLEASE study computer architecture a bit. Do you have ANY idea what that Itanium box looks like in terms of architecture? NUMA. With MASSIVE memory latency issues. Comparing that to todays 12-16-24 core boxes is a waste of time.

You fail to mention that although the scaling doubles on out there, it is WAY below the optimal speedup.

better numbers

cores speedup speedup over previous test
2 1.7x -
4 3.1x 1.82
8 5.9x 1.90
16 11.5x 1.94

My formula is based on a lot of search overhead measurement, which may or may not be accurate today, I am going to test to reconfirm the numbers. But over hundreds of positions, searched dozens of times and averaged together, adding a second CPU searches nodes the serial search avoids by alpha/beta, because of incorrect internal bounds caused by sub-optimal move ordering. In general, if one CPU searches 100M nodes, two CPUs should search 50M nodes each for a perfect speedup. But actually each of the CPUs waste about 15% of its time because it adds extra nodes that the serial search would not search. So instead of searching 100M nodes with 2 cpus, we search 115M

The math goes like this.

1 cpu = 100M nodes at 1M nodes per second = 100 seconds.

2 cpus makes things more interesting. The typical speedup is around 1.7x=1.8x. I chose a pessimistic number to under-estimate rather than over-estimate. Since 2 cpus stay busy the whole time, but the speedup is not 2.0, something is going on. And that something is the search space is bigger. To get a perfect speedup both would search 100M nodes total, but in reality they end up searching something around 117M nodes. At 2M nodes per second, we get a search time of 58.5s and a speedup of 1.7x.

4 cpus continues that trend. The tree continues to grow each time another cpu is added, so that even if they all suffer no NPS reduction due to cache/memory conflicts, the search space increase eats up part of the NPS improvement by the additional cores.

You want to argue the statistical point that more processors approaches optimal. But that's a twisted view. optimal = number of processors.

I don't see anything that suggests my formula is broken, it simply suggests that as the number of processors grows, the actual workload (overhead) grows at a constant rate, which seems perfectly reasonable for a chess engine. That single CPU (before adding the others that bring in overhead) is slowly being averaged away because it is the perfect searcher and all the others add in overhead. Eventually the overhead becomes the limiting factor for speedup each time. But it is STILL well below the optimal speedup.

You are simply thinking about the numbers wrong, and it really does NOT make much sense to compute the speedup, then compute the speedup of the current value compared to the previous value. Since my linear approximation is a straight line, that equation is going to approach that line as N increases. I don't see anything theoretically pro or con the point of view that for larger numbers of processors the overhead continues to increase at a constant rate. What do you see that suggests otherwise. I wrote a paper for the journal of parallel computing about 25-30 years ago addressing this "overhead". And the math suggests that a constant increase as you add processors is perfectly reasonable, assuming nothing happens to cause move ordering to break down (which would certainly increase overhead.)

Think about it. If you plot

1, 1.7, 2.4, 3.1, 3.8, ... that 1 at the front was the ONLY whole number in the formula. I'd think a statistician would recognize that 1 + (N-1) * 0.7 approaches N * 0.7 as N gets large. Only issue is, does that 0.7 change. At least for reasonable N (1-16 for certain, some testing on the Cray C90/32 (32 cpus) stayed right in line with that) there's nothing to suddenly increase overhead, until something changes the tree in some important way. It becomes harder to use more CPUs as the tree grows, because if you have less than 32 moves at a ply, 32 cpus is not going to help. But you CAN split elsewhere, and that might well tend to decrease that .7 number by increasing overhead. But for the hardware I have tested on (1-16 HEAVILY and 1-32 less rigorously, and 1-64 minimally) that simple approximation does quite well. Somewhat pessimistic on the low end, "perhaps" somewhat optimistic beyond 32, and I would not make any claims beyond 32 since the 64 core testing I have done was minimal.

I suspect current hardware is not as good as what I used years ago, because when you take a single chip and build in 6 or 8 cores, you build in a bottleneck as well, as that L3 cache can't keep up with all of those cores, nor can a single memory bus handle the requests from all the cores either. On the first 8-cpu opteron box I ran on, my NPS scaling was perfect. It has dropped off over time as the number of cores per chip has climbed. That being said, the 3.1x for 4 seems to be somewhat pessimistic today, the last time I tested 1, 2 and 4 put the number north of 3.3x. But my formula was never intended to provide an exact (or nearly exact) speedup. In fact, no such number exists, because it varies based on hardware, chess position, memory used, etc. Giving a complex polynomial or exponential formula is not exactly useful when the question most ask is "about how much faster will 8 cores be over 4?" The linear approximation is quite simple to compute in your head, and it gives a number more than good enough when "speedup" is not a constant under any circumstances, anyway. It will never give the exact number, ever. It just gives a reasonable first-order approximation. Which is all most want. If you want exact numbers, run at least enough positions to hit all phases of the game, and run them enough times that the standard deviation is reasonable. For me, I use about 100 positions, and with 2 cpus I run 'em all 4 times. For 4 cpus I run em all 8 times, and for 8 cpus, I run em all 16 times. I run to a fixed depth, where the depth requires at LEAST 3 minutes with the max number of processors. That averages away the rather significant non-determinism one sees in parallel speedup on alpha/beta trees.

If you want to estimate the computational cost, for 8 cpus that is

16 runs * 100 positions * 3 mins_per_position = 3+ days.

Of course, you need to run the others. That 1 cpu test, assuming a 6x speedup for 8 turns into

1 run * 100 positions * 3 * 6 minutes = 1+ day. the 2 and 4 cpu tests will burn another 2-3 days. One week of nonstop testing. And then you get a pretty accurate "average speedup" that says absolutely nothing about a specific position.

BTW another flaw I saw in your analysis was that 32 core test you mentioned.

The proper test method:

1. run a single cpu test to establish NPS.

2. run a max cpu test and take NPS.

The NPS from 2 divided by the NPS from 1 gives you a max speedup before you even START to factor in my "formula". If a 12 core box won't provide 12x the NPS, then it is impossible for it to provide a 12x speedup if the software algorithm was perfect.

Other than the 32 CPU cray (which has NO memory bottlenecks BTW) I have not seen any machine recently where a program scales perfectly. More cores and more local/shared caches add their own overhead to the equation that is outside the software being tested.

On my 2x6core cluster, a single core runs crafty at 4.9M nodes per second (this is 4 year old hardware, not particularly fast). On this particular position, it settles into the 4.9M number after 10 seconds or so and sticks there beyond 2 minutes. For 12 cores, Crafty settles right at 46M nodes per second, same position. An effective scaling of 9.38x. My parallel speedup on this box can never exceed 9.38x, even if the parallel search is perfect and has no overhead (impossible, obviously). My formula is off here, unless you correct for the above problem, and instead of calling this a 12 core box, call it a 9.4 core box, and suddenly it is right back in the right ballpark again...
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Real Speedup due to core doubling etc

Post by Laskos »

bob wrote:
My first suggestion is to PLEASE study computer architecture a bit. Do you have ANY idea what that Itanium box looks like in terms of architecture? NUMA. With MASSIVE memory latency issues. Comparing that to todays 12-16-24 core boxes is a waste of time.

You fail to mention that although the scaling doubles on out there, it is WAY below the optimal speedup.

better numbers

cores speedup speedup over previous test
2 1.7x -
4 3.1x 1.82
8 5.9x 1.90
16 11.5x 1.94

My formula is based on a lot of search overhead measurement, which may or may not be accurate today, I am going to test to reconfirm the numbers. But over hundreds of positions, searched dozens of times and averaged together, adding a second CPU searches nodes the serial search avoids by alpha/beta, because of incorrect internal bounds caused by sub-optimal move ordering. In general, if one CPU searches 100M nodes, two CPUs should search 50M nodes each for a perfect speedup. But actually each of the CPUs waste about 15% of its time because it adds extra nodes that the serial search would not search. So instead of searching 100M nodes with 2 cpus, we search 115M

The math goes like this.

1 cpu = 100M nodes at 1M nodes per second = 100 seconds.

2 cpus makes things more interesting. The typical speedup is around 1.7x=1.8x. I chose a pessimistic number to under-estimate rather than over-estimate. Since 2 cpus stay busy the whole time, but the speedup is not 2.0, something is going on. And that something is the search space is bigger. To get a perfect speedup both would search 100M nodes total, but in reality they end up searching something around 117M nodes. At 2M nodes per second, we get a search time of 58.5s and a speedup of 1.7x.

4 cpus continues that trend. The tree continues to grow each time another cpu is added, so that even if they all suffer no NPS reduction due to cache/memory conflicts, the search space increase eats up part of the NPS improvement by the additional cores.

You want to argue the statistical point that more processors approaches optimal. But that's a twisted view. optimal = number of processors.

I don't see anything that suggests my formula is broken, it simply suggests that as the number of processors grows, the actual workload (overhead) grows at a constant rate, which seems perfectly reasonable for a chess engine. That single CPU (before adding the others that bring in overhead) is slowly being averaged away because it is the perfect searcher and all the others add in overhead. Eventually the overhead becomes the limiting factor for speedup each time. But it is STILL well below the optimal speedup.

You are simply thinking about the numbers wrong, and it really does NOT make much sense to compute the speedup, then compute the speedup of the current value compared to the previous value. Since my linear approximation is a straight line, that equation is going to approach that line as N increases. I don't see anything theoretically pro or con the point of view that for larger numbers of processors the overhead continues to increase at a constant rate. What do you see that suggests otherwise. I wrote a paper for the journal of parallel computing about 25-30 years ago addressing this "overhead". And the math suggests that a constant increase as you add processors is perfectly reasonable, assuming nothing happens to cause move ordering to break down (which would certainly increase overhead.)

Think about it. If you plot

1, 1.7, 2.4, 3.1, 3.8, ... that 1 at the front was the ONLY whole number in the formula. I'd think a statistician would recognize that 1 + (N-1) * 0.7 approaches N * 0.7 as N gets large. Only issue is, does that 0.7 change. At least for reasonable N (1-16 for certain, some testing on the Cray C90/32 (32 cpus) stayed right in line with that) there's nothing to suddenly increase overhead, until something changes the tree in some important way. It becomes harder to use more CPUs as the tree grows, because if you have less than 32 moves at a ply, 32 cpus is not going to help. But you CAN split elsewhere, and that might well tend to decrease that .7 number by increasing overhead. But for the hardware I have tested on (1-16 HEAVILY and 1-32 less rigorously, and 1-64 minimally) that simple approximation does quite well. Somewhat pessimistic on the low end, "perhaps" somewhat optimistic beyond 32, and I would not make any claims beyond 32 since the 64 core testing I have done was minimal.

I suspect current hardware is not as good as what I used years ago, because when you take a single chip and build in 6 or 8 cores, you build in a bottleneck as well, as that L3 cache can't keep up with all of those cores, nor can a single memory bus handle the requests from all the cores either. On the first 8-cpu opteron box I ran on, my NPS scaling was perfect. It has dropped off over time as the number of cores per chip has climbed. That being said, the 3.1x for 4 seems to be somewhat pessimistic today, the last time I tested 1, 2 and 4 put the number north of 3.3x. But my formula was never intended to provide an exact (or nearly exact) speedup. In fact, no such number exists, because it varies based on hardware, chess position, memory used, etc. Giving a complex polynomial or exponential formula is not exactly useful when the question most ask is "about how much faster will 8 cores be over 4?" The linear approximation is quite simple to compute in your head, and it gives a number more than good enough when "speedup" is not a constant under any circumstances, anyway. It will never give the exact number, ever. It just gives a reasonable first-order approximation. Which is all most want. If you want exact numbers, run at least enough positions to hit all phases of the game, and run them enough times that the standard deviation is reasonable. For me, I use about 100 positions, and with 2 cpus I run 'em all 4 times. For 4 cpus I run em all 8 times, and for 8 cpus, I run em all 16 times. I run to a fixed depth, where the depth requires at LEAST 3 minutes with the max number of processors. That averages away the rather significant non-determinism one sees in parallel speedup on alpha/beta trees.

If you want to estimate the computational cost, for 8 cpus that is

16 runs * 100 positions * 3 mins_per_position = 3+ days.

Of course, you need to run the others. That 1 cpu test, assuming a 6x speedup for 8 turns into

1 run * 100 positions * 3 * 6 minutes = 1+ day. the 2 and 4 cpu tests will burn another 2-3 days. One week of nonstop testing. And then you get a pretty accurate "average speedup" that says absolutely nothing about a specific position.

BTW another flaw I saw in your analysis was that 32 core test you mentioned.

The proper test method:

1. run a single cpu test to establish NPS.

2. run a max cpu test and take NPS.

The NPS from 2 divided by the NPS from 1 gives you a max speedup before you even START to factor in my "formula". If a 12 core box won't provide 12x the NPS, then it is impossible for it to provide a 12x speedup if the software algorithm was perfect.

Other than the 32 CPU cray (which has NO memory bottlenecks BTW) I have not seen any machine recently where a program scales perfectly. More cores and more local/shared caches add their own overhead to the equation that is outside the software being tested.

On my 2x6core cluster, a single core runs crafty at 4.9M nodes per second (this is 4 year old hardware, not particularly fast). On this particular position, it settles into the 4.9M number after 10 seconds or so and sticks there beyond 2 minutes. For 12 cores, Crafty settles right at 46M nodes per second, same position. An effective scaling of 9.38x. My parallel speedup on this box can never exceed 9.38x, even if the parallel search is perfect and has no overhead (impossible, obviously). My formula is off here, unless you correct for the above problem, and instead of calling this a 12 core box, call it a 9.4 core box, and suddenly it is right back in the right ballpark again...
Let's start from your crap formula:

TTS (Time-To-Strength or Effective Speedup) = 1 + (N-1) * 0.7
by Bob Hyatt.

And let's begin with NPS speedup as the limiting value for TTS, on modern hardware.
Andreas Strangmüller with a Dual AMD Opteron 6376 (2 x 16 Cores)
http://www.talkchess.com/forum/viewtopic.php?t=52340

Image

Increase in NPS from 16 to 32 threads:

Komodo TCECr (11,94 - 20,60 = 73%)
Zappa Mexico II (11,37 - 16,46 = 45%)
Stockfish 140513 ( 9,79 - 14,21 = 45%)
Stockfish DD ( 8,01 - 10,18 = 27%)
Houdini 4 Pro (11,34 - 13,48 = 19%)

Your formula for TTS, not even NPS (which does not count additional search overhead), gives 97% increase in TTS from 16 to 32 cores. This formula is NOT compatible with any available data for NPS as the limiting value of TTS.

Now let's talk about your linear formula for TTS.
Andreas Strangmüller with a Dual AMD Opteron 6376 (2 x 16 Cores)
http://www.talkchess.com/forum/viewtopic.php?t=52219

Image

That's already a TTS graph, Andreas measured strength in many games, not some dubious TTD and such.
Your imagined "linearity" in TTS is nowhere to be seen here.

I don't understand how you are prone to neglect empirical data, and come with weird diminishing additional overhead arguments, which give 97% increase in TTS by your formula, not even NPS, when going from 16 to 32 cores, and with your imagined linearity.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Real Speedup due to core doubling etc

Post by bob »

1. Funny you say +I+ ignore empirical data when that is ALL +i+ use.

How did I derive my linear approximation? By running hundreds of tests. To try to discount MY linear approximation, you use OTHER programs than the one I specifically tested. And you question MY science?

My numbers are what they are. Others verified my numbers here a few years ago when Vincent raised a similar issue. Martin Fierz ( believe that's how his last name was spelled) took the raw data and showed that through 8 processors my approximation was actually LOW. for crafty.

2. You have NEVER seen me claim my linear approximation is accurate to 2 decimal places or anything else. All I have ever called it is a "linear approximation to a non-linear set of data points that applies only to Crafty." So all your data is meaningless for "crafty".

3. I do not claim nor show any "diminishing overhead". In fact, my formula assumes each processor adds a constant overhead, not one that diminishes as more are added. No idea what you are talking about here. For large N, my formula basically reduces to 0.7 * N. Whether that works beyond 16 for me today is unknown. I have not claimed that it does.

So, how about STOPPING with the attempts to put words into my mount. Again, here is my statement that has not changed over recent years:

"For Crafty, a rough approximation of the speedup is given by the simple linear formula:

speedup = 1 + (N-1) * 0.7

This formula has been computed for 2, 4, 8 and 16 processors. Whether it holds beyond 16 has not been studied with any accuracy due to lack of availability of such hardware. But for up to 16, it is a reasonable estimate."

You can talk about 500 processors and such all you want. Those architectures are NOT the same as our classic SMP shared memory machines we have in the PC world today. As far as the NPS scaling you mentioned, I'm not that surprised. AMD cache coherency has lagged behind Intel significantly over the past 4-5 years, and their machines simply don't scale as well for the tests I have personally run.

Maybe one day you will actually show data that is relevant to this discussion, not just random data from other programs.