SMP speed up

bob · Post by **bob** » Wed Sep 15, 2010 6:12 pm

rbarreira wrote:
Dann Corbit wrote:
rbarreira wrote:In your DTS paper you give a measured speedup of 11.1 for 16 CPUs, which is less than your formula says (11.5), despite the fact that DTS is supposed to be a better parallel algorithm than what you're using today.

So which one is it... the formula is wrong, or your current parallel algorithm is actually about as good or even better than DTS for 16 CPUs and possibly above?

To me the results in the DTS paper make more sense, i.e. efficiency keeps going down when adding CPUs instead of hovering around a minimum. Unfortunately the paper doesn't have results for more than 16 CPUs.
11.1 seems to answer very neatly for the 11.5 predicted by the forumla.
You were expecting an exact answer agreement? Of course, that would be an utterly ridiculous expectation (in fact we rightly view such data with mathematical suspicion).

If, for instance, a student dropped a penny and measured the force of gravity using the mass of the penny and the mass of the earth and came up with 6.673 x 10^(-11) m^3/(kg*s^2) it would seem a little far-fetched.
What part of my post made you think that I was complaining about 11.1 not being equal to 11.5?

Did you actually read/understand my post and the fact that the formula is predicing something different from DTS? According to Robert, the formula is supposed to predict Crafty's scaling, which is supposed to be worse than DTS scaling.

The fact that the formula predicts a scaling which is better than DTS (or about equal, if you want to ignore the small difference) is what I find strange given that Robert always says that DTS is quite a lot better than the other parallel algorithms.

Does DTS only pay off with more than 16 CPUs? Does the formula wrongly predict Crafty's scaling above 8 CPUs? Or maybe Crafty's parallel algorithm is already as good as DTS? All are possibilities, I'm just trying to find out the reason for the discrepancy.

Above 16, DTS is unknown for me. Some have used it on more recent hardware, but in every case, I am not aware of anyone doing everything I did in CB (one problem is memory bandwidth. C90 had tons, PC has very little in comparison, so that limits what you are willing to access and use to make split decisions) The fall-off from 4 to 16 with DTS was significant. And we really had little access to a dedicated C90 to test and tune and tweak. Can't use a "shared machine" since parallel performance is meaningless when others run at the same time.

Crafty has been better tuned and tested, because 8 cpu machines are everywhere. We have a cluster of 70 dual-quadcore boxes right now, and are in the process of buying either a bunch of dual-hexacore boxes or dual-octacore boxes, depending on delivery times. That will provide a ton of accurate 16-core data. It won't likely change the formula, since that is just a rough approximation to keep it simple. And if things get a bit worse than the formula predicts at 32-64, lowering the slope will make it less accurate at the more common core counts most will be using. It is just a rough estimate. When you get to 16, it could be off by 2 or 3 and still be useful.

I've never seen people lock on to a very rough approximation and complain that "Hey, this is a very rough approximation, it is off here, it is off here, it is off here..." Like I didn't already know it was "rough". But it does get you into the right ballpark, quickly and easily, which is why I reference it.

bob · Post by **bob** » Wed Sep 15, 2010 6:25 pm

Karlo Bala wrote:
michiguel wrote:This is a spin off of another thread.

I never gave a lot thought about it but the well know formula for Crafty

speedup = 1 + (NCPUS - 1) * 0.7

may indicate that the inefficiency is not related to the Amdahl's law, even if this applies to a low number of CPUs. What is the cause of the parallel inefficiency? The shape of the tree? still, it looks like it should either saturate quicker or the speed up with 2 cores should be higher than 1.7

Was this investigated?

Miguel
I think this is a realistic formula:

Speedup(1) = 1
Speedup(NCPUS) = 1.7*Speedup(NCPUS/2)

It looks wrong on the surface, because of the NCPUS/2. Each pair of cpus doesn't introduce the same overhead. What you have done is to create a formula, that is a rough approximation of my formula, which is a rough approximation of speedup to number of cores. And made it recursive to boot, so that the error compounds.

For 2, you get 1.7
for 4, you get 1.7*1.7 = 2.89 which is significantly low.
for 8 you get 1.7*1.7*1.7 = 4.9 which is lower still.

My formula is already low for 2 and 4, and is just a little low at 8. It might well be a little high at 16, and perhaps higher still at 32 and 64. But the data points are not on a straight line, and to use a straight line for approximation there is going to be error.

The problem with refining this is that if you want to test on a 64 cpu box, for example, you have to have a significant set of positions. And they need to run for at least a minute or two each, on 64 cpus. If the speedup is only 32x, that means each position has to run for an hour or so on a single cpu. If you want to get a really good measurement, you might run 1000 positions. And you run into trouble immediately once you think about it. 1000 minutes at 64 cpus is doable. Less than a day, but for simple math, lets round it up to a day (1440 minutes). And lets assume that at 64 we get 50% scaling and this scaling curve is a straight line, so that 32 cpus takes twice as long, 16 twice again, etc.

so 64 takes a day, 32 takes 2 days, 16 takes 4 days, 8 takes 8 days, 4 takes 16 days, 2 takes 32 days, and 1 takes 64 days. 127 days of test time where that machine can't be used for anything else? That was the killer on our C90 (16 cpu Cray) tests. We took positions that were searched for 4-5-6 minutes in a real game, and by the time we backed up to 1 cpu, we were burning hours. Times a bunch of positions. On a machine that sold for about 70 million bucks.

That's the problem with producing a lot of accurate data. For 1,2,4 and 8 I have done that kind of computation exhaustively because I have a ton of identical 8-core boxes (70 to be exact). But for 16, I have had to resort to Intel/AMD/others and if you go beyond the small form factor 4-socket machines, access is difficult and limited in time. Until we get one of our own here...

The gap between "desirable" and "practical" can be quite large once you start thinking about time requirements.

michiguel · Post by **michiguel** » Wed Sep 15, 2010 6:29 pm

rbarreira wrote:In your DTS paper you give a measured speedup of 11.1 for 16 CPUs, which is less than your formula says (11.5), despite the fact that DTS is supposed to be a better parallel algorithm than what you're using today.

So which one is it... the formula is wrong, or your current parallel algorithm is actually about as good or even better than DTS for 16 CPUs and possibly above?

To me the results in the DTS paper make more sense, i.e. efficiency keeps going down when adding CPUs instead of hovering around a minimum. Unfortunately the paper doesn't have results for more than 16 CPUs.

The DTS data actually follows an Amdahl's behavior and close to a formula
speedup = n * (k + 1) / (k + n)
in which n is the number of cpus and k the fraction of the program that is not parallel. This type of hypebolic formulas are typical in science for saturation processes.

Wih k = 34 (1/34 of the program is not parallel), you get 1.9, 3.7, 6.7, and 11.2 for 2, 4, 8, and 16 processors respectively, where the reported data is 2, 3.7, 6.7 and 11.1. Very close.

Note that the reported data for DTS and 2 processors is way much better than 1.7!! it is 2.0. If it were 1.7, the degradation of efficiency should escalate much quicker with more cpus.

I think an approximation to the Amdahl behavior could be reached by multiplying "0.7" not each extra processor, but each extra "doubling of processors". Is it possible that the confusion come from this? (EDIT: Now I see that Karlo suggested this in mathematical form).

Miguel

rbarreira · Post by **rbarreira** » Wed Sep 15, 2010 7:03 pm

bob wrote:
Dann Corbit wrote:
rbarreira wrote:In your DTS paper you give a measured speedup of 11.1 for 16 CPUs, which is less than your formula says (11.5), despite the fact that DTS is supposed to be a better parallel algorithm than what you're using today.

So which one is it... the formula is wrong, or your current parallel algorithm is actually about as good or even better than DTS for 16 CPUs and possibly above?

To me the results in the DTS paper make more sense, i.e. efficiency keeps going down when adding CPUs instead of hovering around a minimum. Unfortunately the paper doesn't have results for more than 16 CPUs.
11.1 seems to answer very neatly for the 11.5 predicted by the forumla.
You were expecting an exact answer agreement? Of course, that would be an utterly ridiculous expectation (in fact we rightly view such data with mathematical suspicion).
Forget that point. Remember, this is a linear approximation (emphasis on approximation) to non-linear data points, but it _must_ be accurate to 4 decimal places for some to be happy. For the purposes it was intended for, within 25% would be just fine. How fast using 4 cpus? 2.0-4.0 depending on the position. Occasionally not even 2.0. Average? 1.7-1.9 depending on positions? Lots of positions? Close to 1.8. Formula predicts 1.7. Less than 10% off? Not good enough, it seems. Never knew "linear approximations" had to have that kind of accuracy or I would never have even mentioned the rough guess...

If, for instance, a student dropped a penny and measured the force of gravity using the mass of the penny and the mass of the earth and came up with 6.673 x 10^(-11) m^3/(kg*s^2) it would seem a little far-fetched.

No one said it must be accurate to 4 decimal places. I know you like to use hyperbole, but no one here is stupid enough to read my post and think that I demand such accuracy. In any case, you already admitted that the error margin is something like 25%, in which case I'll just ignore the results of the formula for the purpose of comparison with DTS.

You just can't have your cake and eat it too, remember that. If you're going to throw out this formula during arguments (even arguments including the 64 CPU case in another thread), at least be prepared for people to be surprised that it has such a big error as 25% even on easier and more tested cases than the 64 CPU one.

In conclusion, this thread was created with a formula which doesn't produce any useful data. The speedup for 2 cores can be anywhere from 1.3 to 2 (average speedup, not the range over a set of positions), etc...

bob · Post by **bob** » Wed Sep 15, 2010 7:11 pm

michiguel wrote:
rbarreira wrote:In your DTS paper you give a measured speedup of 11.1 for 16 CPUs, which is less than your formula says (11.5), despite the fact that DTS is supposed to be a better parallel algorithm than what you're using today.

So which one is it... the formula is wrong, or your current parallel algorithm is actually about as good or even better than DTS for 16 CPUs and possibly above?

To me the results in the DTS paper make more sense, i.e. efficiency keeps going down when adding CPUs instead of hovering around a minimum. Unfortunately the paper doesn't have results for more than 16 CPUs.
The DTS data actually follows an Amdahl's behavior and close to a formula
speedup = n * (k + 1) / (k + n)
in which n is the number of cpus and k the fraction of the program that is not parallel. This type of hypebolic formulas are typical in science for saturation processes.

Wih k = 34 (1/34 of the program is not parallel), you get 1.9, 3.7, 6.7, and 11.2 for 2, 4, 8, and 16 processors respectively, where the reported data is 2, 3.7, 6.7 and 11.1. Very close.

Note that the reported data for DTS and 2 processors is way much better than 1.7!! it is 2.0. If it were 1.7, the degradation of efficiency should escalate much quicker with more cpus.

I think an approximation to the Amdahl behavior could be reached by multiplying "0.7" not each extra processor, but each extra "doubling of processors". Is it possible that the confusion come from this? (EDIT: Now I see that Karlo suggested this in mathematical form).

Miguel

The problem is this. The data is a non-linear collection. Almost every parallel algorithm known shows a "bending down or flattening" as you add processors. Not all, and a couple of "grand challenge" applications produced speedups of over 1000 on a 1024 node machine at Sandia Labs a few years ago. But in general the curve takes off on a near 45-degree slope, following the perfect speedup curve well. But it begins to flatten out, dropping below the speedup curve.

If you fit a straight line thru that curve, part of the curve is going to lie above the linear approximation, part below. The question is, where do you want the most accuracy. For any sensible algorithm, you will reach a point where additional processors offer zero (I do tell my students that an algorithm that starts to actually worse is a horribly written program, because you can refuse to use too many if you want, inside the software). Now you have a problem. Going from 256 to 512 has a slope of zero, a horizontal line. DItto for 512 to 1024. So pick a point for max accuracy and run your linear approximation thru that. And look at how far _below_ the real curve you are at 2, 4, 8, ... If you want max accuracy at 65536, you are going to be _way_ off at 2, 4, 8...

My formula was tweaked using 1, 2, 4, 8 and 16. With a little 32 node data that suggested that it was "close". And with even less 64 node data still saying "close".

In call of duty, the online game, you have two common weapons of choice, a hand grenade and a rifle. If you are practicing tossing grenades, do you need to drop it into the target's pocket, or is within a 10' radius centered on him close enough? You want to hit him between the eyes? Why not use a more accurate tool like a sniper rifle?

I wanted this rough estimate to be easy to compute. The formula could be made more accurate. In fact, it could be made perfectly accurate with enough data points, and a sophisticated curve-fitting program. One could likely analyze the DTS data, to determine accurately how much overhead adding a single cpu produces. And one would probably find that as you add more processors, each new one adds even more overhead than the previous.

And then I would be getting complaints "Bob, this damned formula requires 5,000 measurements, and takes me 30 minutes to compute the speedup. Give me a break, I just want to know, roughly, if I am trying to choose between paying X dollars for 8 cores, or X + Y dollars for Y cores, about what kind of speed gain will I see?"

That's why my simple linear approximation is supposed to provide. It is really optimized from 1-2-4-8-16. The real data is not a straight line. The further out you take this, the farther it will be from that straight line approximation. One could retune that 0.7 multiplier to make it more accurate at 32 and 64, once significant data for those is available. But now you will be significantly under-estimating 2, 4, 8. The formula does what it was intended to do with respect to readily available hardware. Thru 16 it is quite good. at 32 and 64 it is still pretty good. But out there whether it is 45x or 40x or 50x is within a small margin of error overall, and it is an error that should be understood to be there by those familiar with parallel algorithms in general.

Some want to suggest that I am claiming a linear (sub-optimal but still linear) speedup. Would be nice, but nope. I'm just claiming that for 1 to 16 the numbers are pretty close, indeed, and for 32 and 64 they are close, but with even a 10% error you are talking about 6x at 64, which would be considered quite good for a linear fit to the usual parallel speedup curve.

bob · Post by **bob** » Wed Sep 15, 2010 7:26 pm

rbarreira wrote:
bob wrote:
Dann Corbit wrote:
rbarreira wrote:In your DTS paper you give a measured speedup of 11.1 for 16 CPUs, which is less than your formula says (11.5), despite the fact that DTS is supposed to be a better parallel algorithm than what you're using today.

So which one is it... the formula is wrong, or your current parallel algorithm is actually about as good or even better than DTS for 16 CPUs and possibly above?

To me the results in the DTS paper make more sense, i.e. efficiency keeps going down when adding CPUs instead of hovering around a minimum. Unfortunately the paper doesn't have results for more than 16 CPUs.
11.1 seems to answer very neatly for the 11.5 predicted by the forumla.
You were expecting an exact answer agreement? Of course, that would be an utterly ridiculous expectation (in fact we rightly view such data with mathematical suspicion).
Forget that point. Remember, this is a linear approximation (emphasis on approximation) to non-linear data points, but it _must_ be accurate to 4 decimal places for some to be happy. For the purposes it was intended for, within 25% would be just fine. How fast using 4 cpus? 2.0-4.0 depending on the position. Occasionally not even 2.0. Average? 1.7-1.9 depending on positions? Lots of positions? Close to 1.8. Formula predicts 1.7. Less than 10% off? Not good enough, it seems. Never knew "linear approximations" had to have that kind of accuracy or I would never have even mentioned the rough guess...

If, for instance, a student dropped a penny and measured the force of gravity using the mass of the penny and the mass of the earth and came up with 6.673 x 10^(-11) m^3/(kg*s^2) it would seem a little far-fetched.
No one said it must be accurate to 4 decimal places. I know you like to use hyperbole, but no one here is stupid enough to read my post and think that I demand such accuracy. In any case, you already admitted that the error margin is something like 25%, in which case I'll just ignore the results of the formula for the purpose of comparison with DTS.

Note 1. I did not "admit that it is 25% off". I said that clearly it can't be accurate up and down the range since it is a linear fit to non-linear data. I don't have enough data to give an error margin. I know it is low at 2 and 4 and 8. And that was computed independently a few years ago by someone other than myself. Going to 1.8 gives a better fit at 2, pretty good at 4, somewhat optimistic at 8, and the optimism increases since this is known to be a curve that starts off at about a 45 degree slope and at some point, yet to be determined, becomes a line with 0 slope (no additional processors will help). So for the linear fit, where do you want the most accuracy. Way out at 65536 where the slope has been zero for the last 4 doublings? Or near the 2-4-8-16 that are far more prevalent today? If, at some point, the common choices are 8, 16, 32 and 64, I'd probably change that .7 to something smaller, knowing that increases error on the low end but reduces it farther out. At the moment, it is about as good as it can be. And running 64 thru that formula and then claiming that is impossible is simply a misuse of the formula. I believe the people doing this know that to be true, and just want to create a lengthy thread for no good reason, other than to argue.

You just can't have your cake and eat it too, remember that. If you're going to throw out this formula during arguments (even arguments including the 64 CPU case in another thread), at least be prepared for people to be surprised that it has such a big error as 25% even on easier and more tested cases than the 64 CPU one.

I have clearly stated, _many times_ "A rough approximation is." A high-school senior will recognize that y = M * X + c is a straight line. Anyone that knows anything about parallel algorithms knows that a speedup curve is never a straight line. Most algorithms reach a point of no return, where additional cpus add nothing. A very few are near-optimal for large numbers of processors (1024 at least). So, the logical person would think "OK, linear approximation to a non-linear curve, obviously there is error somewhere. And the word "approximation" should be a further hint.

But now we are having arguments about 32 and 64, where the people doing the arguing should either (a) know about the linear approximation and non-linear data error; or (b) should not be posting when they don't have grasp of the basic problem.

In conclusion, this thread was created with a formula which doesn't produce any useful data. The speedup for 2 cores can be anywhere from 1.3 to 2 (average speedup, not the range over a set of positions), etc...

Don't know where you get that from. I _know_ very accurate numbers for 1, 2, 4 and 8, as previously mentioned. And now have some pretty good numbers for 16, as I also mentioned. This straight line is about as good as it gets.

If it is "useless" to you, ignore it. It gives a reasonable approximation, however, which is all it was intended to do. One doesn't fit a straight line to non-linear data and expect high accuracy. Or at least rational people don't. But most want a straight line estimate because it is quite easy to calculate. If you want to recompute the 1-2-4-8 speedups again, to see how well the formula does, I think I have saved that quad dual-core opteron data that I had posted a few years back. It is a bunch of positions, run a bunch of times, raw crafty log files. It will show just how good that approximation is for 1-8, at least...

michiguel · Post by **michiguel** » Wed Sep 15, 2010 8:31 pm

bob wrote:
michiguel wrote:
rbarreira wrote:In your DTS paper you give a measured speedup of 11.1 for 16 CPUs, which is less than your formula says (11.5), despite the fact that DTS is supposed to be a better parallel algorithm than what you're using today.

So which one is it... the formula is wrong, or your current parallel algorithm is actually about as good or even better than DTS for 16 CPUs and possibly above?

To me the results in the DTS paper make more sense, i.e. efficiency keeps going down when adding CPUs instead of hovering around a minimum. Unfortunately the paper doesn't have results for more than 16 CPUs.
The DTS data actually follows an Amdahl's behavior and close to a formula
speedup = n * (k + 1) / (k + n)
in which n is the number of cpus and k the fraction of the program that is not parallel. This type of hypebolic formulas are typical in science for saturation processes.

Wih k = 34 (1/34 of the program is not parallel), you get 1.9, 3.7, 6.7, and 11.2 for 2, 4, 8, and 16 processors respectively, where the reported data is 2, 3.7, 6.7 and 11.1. Very close.

Note that the reported data for DTS and 2 processors is way much better than 1.7!! it is 2.0. If it were 1.7, the degradation of efficiency should escalate much quicker with more cpus.

I think an approximation to the Amdahl behavior could be reached by multiplying "0.7" not each extra processor, but each extra "doubling of processors". Is it possible that the confusion come from this? (EDIT: Now I see that Karlo suggested this in mathematical form).

Miguel
The problem is this. The data is a non-linear collection. Almost every parallel algorithm known shows a "bending down or flattening" as you add processors. Not all, and a couple of "grand challenge" applications produced speedups of over 1000 on a 1024 node machine at Sandia Labs a few years ago. But in general the curve takes off on a near 45-degree slope, following the perfect speedup curve well. But it begins to flatten out, dropping below the speedup curve.

If you fit a straight line thru that curve, part of the curve is going to lie above the linear approximation, part below. The question is, where do you want the most accuracy. For any sensible algorithm, you will reach a point where additional processors offer zero (I do tell my students that an algorithm that starts to actually worse is a horribly written program, because you can refuse to use too many if you want, inside the software). Now you have a problem. Going from 256 to 512 has a slope of zero, a horizontal line. DItto for 512 to 1024. So pick a point for max accuracy and run your linear approximation thru that. And look at how far _below_ the real curve you are at 2, 4, 8, ... If you want max accuracy at 65536, you are going to be _way_ off at 2, 4, 8...

My formula was tweaked using 1, 2, 4, 8 and 16. With a little 32 node data that suggested that it was "close". And with even less 64 node data still saying "close".

In call of duty, the online game, you have two common weapons of choice, a hand grenade and a rifle. If you are practicing tossing grenades, do you need to drop it into the target's pocket, or is within a 10' radius centered on him close enough? You want to hit him between the eyes? Why not use a more accurate tool like a sniper rifle?

I wanted this rough estimate to be easy to compute. The formula could be made more accurate. In fact, it could be made perfectly accurate with enough data points, and a sophisticated curve-fitting program. One could likely analyze the DTS data, to determine accurately how much overhead adding a single cpu produces. And one would probably find that as you add more processors, each new one adds even more overhead than the previous.

And then I would be getting complaints "Bob, this damned formula requires 5,000 measurements, and takes me 30 minutes to compute the speedup. Give me a break, I just want to know, roughly, if I am trying to choose between paying X dollars for 8 cores, or X + Y dollars for Y cores, about what kind of speed gain will I see?"

For DTS (S = speedup) I suggest

S = 35 * n / (34 + n)

is fairly simple.

This can also be expressed as the reciprocal of the speed up is linear to the reciprocal of the CPUs.
1/S = a * 1/CPUs + b

This is not an approximation, it is what it is supposed to be if Amdahl's apply. The good thing is that it has also "some" predicting power over numbers you did not test.

Parallel programing is not my expertise, but the basic math is identical to enzyme kinetics or simple electricity problems (when you have resistors connected in parallel).

Miguel

That's why my simple linear approximation is supposed to provide. It is really optimized from 1-2-4-8-16. The real data is not a straight line. The further out you take this, the farther it will be from that straight line approximation. One could retune that 0.7 multiplier to make it more accurate at 32 and 64, once significant data for those is available. But now you will be significantly under-estimating 2, 4, 8. The formula does what it was intended to do with respect to readily available hardware. Thru 16 it is quite good. at 32 and 64 it is still pretty good. But out there whether it is 45x or 40x or 50x is within a small margin of error overall, and it is an error that should be understood to be there by those familiar with parallel algorithms in general.

Some want to suggest that I am claiming a linear (sub-optimal but still linear) speedup. Would be nice, but nope. I'm just claiming that for 1 to 16 the numbers are pretty close, indeed, and for 32 and 64 they are close, but with even a 10% error you are talking about 6x at 64, which would be considered quite good for a linear fit to the usual parallel speedup curve.

Rein Halbersma · Post by **Rein Halbersma** » Wed Sep 15, 2010 11:06 pm

michiguel wrote: For DTS (S = speedup) I suggest

S = 35 * n / (34 + n)

is fairly simple.

This can also be expressed as the reciprocal of the speed up is linear to the reciprocal of the CPUs.
1/S = a * 1/CPUs + b

This is not an approximation, it is what it is supposed to be if Amdahl's apply. The good thing is that it has also "some" predicting power over numbers you did not test.

Parallel programing is not my expertise, but the basic math is identical to enzyme kinetics or simple electricity problems (when you have resistors connected in parallel).

Miguel

Could it be that your example of Amdahl's law reflects the YBW paradigm for the average chess positions's branching factor of 35? If this relationship for up to N=16 can be extrapolated, the upper bound on the speedup is equal to S=35 for N=infinity. That would be very poor scaling behavior for the DTS algorithm for large N, considering that chess search trees have ample parallelism. Didn't Cilkchess show linear speedup for a much larger range of processors?

Rein

michiguel · Post by **michiguel** » Thu Sep 16, 2010 12:20 am

Rein Halbersma wrote:
michiguel wrote: For DTS (S = speedup) I suggest

S = 35 * n / (34 + n)

is fairly simple.

This can also be expressed as the reciprocal of the speed up is linear to the reciprocal of the CPUs.
1/S = a * 1/CPUs + b

This is not an approximation, it is what it is supposed to be if Amdahl's apply. The good thing is that it has also "some" predicting power over numbers you did not test.

Parallel programing is not my expertise, but the basic math is identical to enzyme kinetics or simple electricity problems (when you have resistors connected in parallel).

Miguel
Could it be that your example of Amdahl's law reflects the YBW paradigm for the average chess positions's branching factor of 35? If this relationship for up to N=16 can be extrapolated, the upper bound on the speedup is equal to S=35 for N=infinity. That would be very poor scaling behavior for the DTS algorithm for large N, considering that chess search trees have ample parallelism. Didn't Cilkchess show linear speedup for a much larger range of processors?

Rein

If Amdahl's law applies and the formula is correct, 35 means that 1/35 is the fraction of the program that cannot be made parallel. In this case, 1/35 ~ 0.028 or 2.8%. In my humble experience of making Gaviota parallel (which is not efficient BTW), I observed that non-parallel fraction of the program is bigger at low depths. This is because a big chunk of non-parallelism comes from the fact that the PV and its last x plies never split. The deeper you search, the less significant this becomes. I measured it and it could go from 15% at the beginning to 1% or less, but It never goes to zero. So, how well these algorithms scale depend on the depth reached. If I am not mistaken, The DTS experiments may have been done at a reasonably "low" depths by today's standards. Maybe it would be even better today.

To make things more complicated, and maybe the cause of a non-Amdahl behavior is that, as Bob pointed out, each processor added makes the tree bigger (this may end up looking like it is a non parallel fraction of the tree, but the concept is different). OTOH, there is a positive effect because there is a chance to find the solution faster by peaking to successful late moves earlier, like making a "fast-forward" of the process.

That is why the problem may look like Amdahl, but the k=35 may not mean what we think it means but an average of all the issues at stake.

Miguel

bob · Post by **bob** » Thu Sep 16, 2010 2:19 am

michiguel wrote: For DTS (S = speedup) I suggest

S = 35 * n / (34 + n)

is fairly simple.

This can also be expressed as the reciprocal of the speed up is linear to the reciprocal of the CPUs.
1/S = a * 1/CPUs + b

This is not an approximation, it is what it is supposed to be if Amdahl's apply. The good thing is that it has also "some" predicting power over numbers you did not test.

Parallel programing is not my expertise, but the basic math is identical to enzyme kinetics or simple electricity problems (when you have resistors connected in parallel).

Miguel

OK, a few quick tests:

Code: Select all

Ncpus      S/U
    1             1
    2             1.95
    4             3,68
    8             6.67
  16           11.2

Not bad, assuming you are talking about the DTS algorithm and data from C90 Cray Blitz. It is too high early for current YBW in Crafty, although it drops into line around 16.

however, for simplicity, my original formula is easier to calculate, and works better for the above numbers... However, this is a non-linear function being approximated by (now) a second degree polynomial, so it should be able to produce more accuracy with tweaking for the current implementation. And a 3rd degree would be better. Etc...

SMP speed up

Re: SMP speed up

Re: SMP speed up

Re: SMP speed up

Re: SMP speed up

Re: SMP speed up

Re: SMP speed up

Re: SMP speed up

Re: SMP speed up

Re: SMP speed up

Re: SMP speed up