Measure of SMP scalability

bob · Post by **bob** » Fri Jul 05, 2013 7:26 pm

Laskos wrote:
bob wrote:
Laskos wrote:
Laskos wrote:
bob wrote: For the record, Amdahl's law applies to ALL parallel programs. Or anything else done related to efficiency, such as internal microprocessor design decisions and such.

The primary issue is "which is easier to measure, the parallel speedup / SMP efficiency using the standard time-to-depth that has been in use forever, or the Elo gain when adding CPUs, which takes a huge amount of computational resources to measure?" For all programs excepting just one, it appears that the traditional time-to-depth is by far the better measure. At least until the point where one throws up his hands and says "I can't get any deeper by adding more processors, so what else can I use 'em for?"

Since for all programs except 1, the easier-to-measure metric works just as well as the Elo measurement, who would want to go to that extreme? I'm able to do it frequently due to having access to large clusters. But even for me, it is unpleasant. On the 70 node cluster with 8 cores per node, I can either play 560 games at a time, or 70 using 8 threads. That's a factor of 8x. I then need to run a BUNCH of games to get acceptable accuracy, further slowing testing. With a traditional time-to-depth measurement, I can run a thorough test in under an hour...
One cannot rely on time-to-depth. Besides Komodo and Rybka, I found another program, Hiarcs 14, which widens and has poor time-to-depth factor, although it appears to scale very well 1->4 cores in CEGT 40/4' list Elo-wise.
It appears that there are two factors: time-to-depth and widening 1->4 cores. If the widening gives N Elo points to the same depth, and doubling in time is 100 Elo points at fast controls, then the formula would be:
(time-to-depth factor) * 2^(N/100).

I agree that playing games to find the gain is practically impossible for normal users. One needs thousands of multi-threaded games at not very short control, say 40/4', a task which is too much even for developers with large machines.
Applying the above formula, I find [time(1CPU)/time(4CPU)] to equal _strength_, which gives the quality of SMP implementation:
Code: Select all
Engine          time-to-depth    widening      time-to-strength      

Houdini 3           3.05            0             3.05
Stockfish 3         2.91            0             2.91
Rybka 4.1           2.32           24             2.74
Komodo 5.1          1.68           75             2.83
Hiarcs 14           2.10           42             2.81
Gaviota 0.86        2.74            0             2.74
Komodo is the most extreme in deviating from time-to-depth rule.
How is widening measured???

Since EVERY parallel search widens to some extent due to overhead.
In Elo points gain at the fixed depth. The depth is chosen such that the games are not extremely fast, but fast enough to measure Elo with reasonable precision in reasonable amount of time. Many engines do not show widening at all (as Elo gain), just checked Shredder 12, and it does not.
Time-to-depth is measured on 150 positions (30 repeating 5 times) running for an hour or two on single core (and less than that on 4 cores).

I'm not particularly a fan of using Elo to predict widening, then widening to predict Elo. Somewhat circular. Your time to depth is similar to what I do except that I generally use more sample runs for more cores since the variance is pretty significant at times.

For the "widening" I would be tempted to actually compare node counts to see how much larger the tree actually gets, as opposed to just measuring the Elo gain.

Right now, numbers (nodes, time, etc) are most helpful to the discussion since they can be used to derive a mathematical model that explains what is going on. If I could run Komodo on our cluster, I'd be generating enough data to analyze this to great precision...

BTW, this "Elo" measurement still leaves a lot to be desired. It is only measuring the Elo gain produced by looking at extra stuff. But it doesn't address the Elo loss that extra stuff costs in terms of time. That skews this Elo measurement significantly. In your tests, does EVERY 4 cpu move to fixed depth take less time than the equivalent single-cpu move? If not this really has a problem.

bob · Post by **bob** » Fri Jul 05, 2013 7:29 pm

michiguel wrote:
bob wrote:
michiguel wrote:
abulmo wrote:
Edsel Apostol wrote: Time to depth
This measures the capability of the SMP algorithm to search efficiently. This is a good measure, better than just NPS-wise, but it will only be most accurate if the algorithm traversed the same search tree as the serial one. In the case of modern chess engines, SMP algorithm doesn't search the same tree as the serial one, some prunes more, and some prunes less when in SMP for example.
.
What I like here is to use the following formula:
t(n) = W/n + C.
n is the number of threads;
t(n) the time spent
Once you have measured a few t(n) for various number of n (threads), you can compute W and C, and get an idea of the scalability of your program. Indeed:
W+C = t(1)
C = t(infinity)
C/(W+C) = theoretical upper limit of achievable acceleration.

Unfortunately, it is not that simple, because the Amdahl law (which is what you imply) does not apply to chess engines. That assumes the same work to be done for the parallel and single thread programs. We do not have that. The parallel program explores different branches, many times increasing the work adding extra inefficiency, and other times finding "short cuts".

That is the reason why the data does not fit well the Amdahl's formula.

Miguel

Edsel Apostol wrote: Performance/Elo based
Elo performance is related to speed increase, so that doubling the speed is about a 50 Elo improvement. When dealing with speed improvement, which is the only thing SMP provides, Elo improvement is not a necessary tool, IMHO.
For the record, Amdahl's law applies to ALL parallel programs.

No, they don't!

It applies to the ones in which the size of the problem remains the same.

For EVERY problem, this size can remain constant. In a weather model, one would not compare two different programs that examine a different number of data points. Chess certainly fits this description. Changing pruning is certainly unusual, to say the least...

Miguel

Or anything else done related to efficiency, such as internal microprocessor design decisions and such.

The primary issue is "which is easier to measure, the parallel speedup / SMP efficiency using the standard time-to-depth that has been in use forever, or the Elo gain when adding CPUs, which takes a huge amount of computational resources to measure?" For all programs excepting just one, it appears that the traditional time-to-depth is by far the better measure. At least until the point where one throws up his hands and says "I can't get any deeper by adding more processors, so what else can I use 'em for?"

Since for all programs except 1, the easier-to-measure metric works just as well as the Elo measurement, who would want to go to that extreme? I'm able to do it frequently due to having access to large clusters. But even for me, it is unpleasant. On the 70 node cluster with 8 cores per node, I can either play 560 games at a time, or 70 using 8 threads. That's a factor of 8x. I then need to run a BUNCH of games to get acceptable accuracy, further slowing testing. With a traditional time-to-depth measurement, I can run a thorough test in under an hour...

bob · Post by **bob** » Fri Jul 05, 2013 7:48 pm

Laskos wrote:
bob wrote:
Laskos wrote:
Laskos wrote:
bob wrote: For the record, Amdahl's law applies to ALL parallel programs. Or anything else done related to efficiency, such as internal microprocessor design decisions and such.

The primary issue is "which is easier to measure, the parallel speedup / SMP efficiency using the standard time-to-depth that has been in use forever, or the Elo gain when adding CPUs, which takes a huge amount of computational resources to measure?" For all programs excepting just one, it appears that the traditional time-to-depth is by far the better measure. At least until the point where one throws up his hands and says "I can't get any deeper by adding more processors, so what else can I use 'em for?"

Since for all programs except 1, the easier-to-measure metric works just as well as the Elo measurement, who would want to go to that extreme? I'm able to do it frequently due to having access to large clusters. But even for me, it is unpleasant. On the 70 node cluster with 8 cores per node, I can either play 560 games at a time, or 70 using 8 threads. That's a factor of 8x. I then need to run a BUNCH of games to get acceptable accuracy, further slowing testing. With a traditional time-to-depth measurement, I can run a thorough test in under an hour...
One cannot rely on time-to-depth. Besides Komodo and Rybka, I found another program, Hiarcs 14, which widens and has poor time-to-depth factor, although it appears to scale very well 1->4 cores in CEGT 40/4' list Elo-wise.
It appears that there are two factors: time-to-depth and widening 1->4 cores. If the widening gives N Elo points to the same depth, and doubling in time is 100 Elo points at fast controls, then the formula would be:
(time-to-depth factor) * 2^(N/100).

I agree that playing games to find the gain is practically impossible for normal users. One needs thousands of multi-threaded games at not very short control, say 40/4', a task which is too much even for developers with large machines.
Applying the above formula, I find [time(1CPU)/time(4CPU)] to equal _strength_, which gives the quality of SMP implementation:
Code: Select all
Engine          time-to-depth    widening      time-to-strength      

Houdini 3           3.05            0             3.05
Stockfish 3         2.91            0             2.91
Rybka 4.1           2.32           24             2.74
Komodo 5.1          1.68           75             2.83
Hiarcs 14           2.10           42             2.81
Gaviota 0.86        2.74            0             2.74
Komodo is the most extreme in deviating from time-to-depth rule.
How is widening measured???

Since EVERY parallel search widens to some extent due to overhead.
In Elo points gain at the fixed depth. The depth is chosen such that the games are not extremely fast, but fast enough to measure Elo with reasonable precision in reasonable amount of time. Many engines do not show widening at all (as Elo gain), just checked Shredder 12, and it does not.
Time-to-depth is measured on 150 positions (30 repeating 5 times) running for an hour or two on single core (and less than that on 4 cores).

I'm not particularly a fan of using Elo to predict widening, then widening to predict Elo. Somewhat circular. Your time to depth is similar to what I do except that I generally use more sample runs for more cores since the variance is pretty significant at times.

For the "widening" I would be tempted to actually compare node counts to see how much larger the tree actually gets, as opposed to just measuring the Elo gain.

Right now, numbers (nodes, time, etc) are most helpful to the discussion since they can be used to derive a mathematical model that explains what is going on. If I could run Komodo on our cluster, I'd be generating enough data to analyze this to great precision...

BTW, this "Elo" measurement still leaves a lot to be desired. It is only measuring the Elo gain produced by looking at extra stuff. But it doesn't address the Elo loss that extra stuff costs in terms of time. That skews this Elo measurement significantly. In your tests, does EVERY 4 cpu move to fixed depth take less time than the equivalent single-cpu move? If not this really has a problem.

Laszlo Gaspar · Post by **Laszlo Gaspar** » Fri Jul 05, 2013 10:58 pm

In my first tries in SMP programming I didn't synchronize correctly the threads, the search simply fell apart. The different threads were working on different depths.

I feel something similar here, it seems to me that in this case the search doesn't stop when the FIRST thread finishes the last iteration but goes on until another thread - the MAIN thread, which is connected to the termination condition - finishes and does some extra search in between and can find a new bestmove which makes the search look like a better search.

Laskos · Post by **Laskos** » Sat Jul 06, 2013 2:56 am

bob wrote:
I'm not particularly a fan of using Elo to predict widening, then widening to predict Elo. Somewhat circular. Your time to depth is similar to what I do except that I generally use more sample runs for more cores since the variance is pretty significant at times.

For the "widening" I would be tempted to actually compare node counts to see how much larger the tree actually gets, as opposed to just measuring the Elo gain.

Right now, numbers (nodes, time, etc) are most helpful to the discussion since they can be used to derive a mathematical model that explains what is going on. If I could run Komodo on our cluster, I'd be generating enough data to analyze this to great precision...

BTW, this "Elo" measurement still leaves a lot to be desired. It is only measuring the Elo gain produced by looking at extra stuff. But it doesn't address the Elo loss that extra stuff costs in terms of time. That skews this Elo measurement significantly. In your tests, does EVERY 4 cpu move to fixed depth take less time than the equivalent single-cpu move? If not this really has a problem.

I express the gain in [(time 1CPU)/(rime 4 CPU)] to the same _strength_. The time_to depth_ i.e. [(time 1CPU)/(time 4CPU)] to the same _depth_ is pretty straightforward to do for an engine in an hour or two. What remains is the Elo deviation at the same _depth_, and I gather _time_ handicap to equal the strength from this Elo deviation. It has to be measured directly, as Elo difference at the same depth, then I infer the time handicap from 2^(N/100) approximate formula.

If you want to see the nodes to the same depth, I already posted in an another thread some results, but you seemed not very interested. Here it is:

Komodo 5.1 1thread:

Code: Select all

1 thread 
  n/s&#58; 1.329.027  
  TotTime&#58; 71&#58;30m    SolTime&#58; 71&#58;30m 
  Ply&#58; 0   Positions&#58; 30   Avg Nodes&#58;       0   Branching = 0.00 
  Ply&#58; 1   Positions&#58; 30   Avg Nodes&#58;     208   Branching = 0.00 
  Ply&#58; 2   Positions&#58; 30   Avg Nodes&#58;     475   Branching = 2.28 
  Ply&#58; 3   Positions&#58; 30   Avg Nodes&#58;    1162   Branching = 2.45 
  Ply&#58; 4   Positions&#58; 30   Avg Nodes&#58;    2369   Branching = 2.04 
  Ply&#58; 5   Positions&#58; 30   Avg Nodes&#58;    4022   Branching = 1.70 
  Ply&#58; 6   Positions&#58; 30   Avg Nodes&#58;    7420   Branching = 1.84 
  Ply&#58; 7   Positions&#58; 30   Avg Nodes&#58;   15120   Branching = 2.04 
  Ply&#58; 8   Positions&#58; 30   Avg Nodes&#58;   31341   Branching = 2.07 
  Ply&#58; 9   Positions&#58; 30   Avg Nodes&#58;   65919   Branching = 2.10 
  Ply&#58;10   Positions&#58; 30   Avg Nodes&#58;  138817   Branching = 2.11 
  Ply&#58;11   Positions&#58; 30   Avg Nodes&#58;  249815   Branching = 1.80 
  Ply&#58;12   Positions&#58; 30   Avg Nodes&#58;  486700   Branching = 1.95 
  Ply&#58;13   Positions&#58; 30   Avg Nodes&#58; 1025584   Branching = 2.11 
  Ply&#58;14   Positions&#58; 30   Avg Nodes&#58; 1761740   Branching = 1.72 
  Ply&#58;15   Positions&#58; 30   Avg Nodes&#58; 3037256   Branching = 1.72 
  Ply&#58;16   Positions&#58; 30   Avg Nodes&#58; 5514478   Branching = 1.82 
  Ply&#58;17   Positions&#58; 30   Avg Nodes&#58; 9323714   Branching = 1.69 
  Ply&#58;18   Positions&#58; 30   Avg Nodes&#58;18533095   Branching = 1.99 
  Ply&#58;19   Positions&#58; 30   Avg Nodes&#58;33487106   Branching = 1.81 
  Ply&#58;20   Positions&#58; 30   Avg Nodes&#58;53449385   Branching = 1.60 
  Ply&#58;21   Positions&#58; 30   Avg Nodes&#58;94696000   Branching = 1.77 
Engine&#58; Komodo 5.1 64-bit &#40;512 MB&#41; 
by Don Dailey, Larry Kaufman

Komodo 5.1 4 threads:

Code: Select all

4 threads 
  n/s&#58; 4.426.797  
  TotTime&#58; 51&#58;18m    SolTime&#58; 51&#58;18m 
  Ply&#58; 0   Positions&#58; 30   Avg Nodes&#58;       0   Branching = 0.00 
  Ply&#58; 1   Positions&#58; 30   Avg Nodes&#58;     658   Branching = 0.00 
  Ply&#58; 2   Positions&#58; 30   Avg Nodes&#58;    1454   Branching = 2.21 
  Ply&#58; 3   Positions&#58; 30   Avg Nodes&#58;    3023   Branching = 2.08 
  Ply&#58; 4   Positions&#58; 30   Avg Nodes&#58;    5522   Branching = 1.83 
  Ply&#58; 5   Positions&#58; 30   Avg Nodes&#58;   10040   Branching = 1.82 
  Ply&#58; 6   Positions&#58; 30   Avg Nodes&#58;   19588   Branching = 1.95 
  Ply&#58; 7   Positions&#58; 30   Avg Nodes&#58;   37536   Branching = 1.92 
  Ply&#58; 8   Positions&#58; 30   Avg Nodes&#58;   70835   Branching = 1.89 
  Ply&#58; 9   Positions&#58; 30   Avg Nodes&#58;  150130   Branching = 2.12 
  Ply&#58;10   Positions&#58; 30   Avg Nodes&#58;  308354   Branching = 2.05 
  Ply&#58;11   Positions&#58; 30   Avg Nodes&#58;  569434   Branching = 1.85 
  Ply&#58;12   Positions&#58; 30   Avg Nodes&#58; 1048122   Branching = 1.84 
  Ply&#58;13   Positions&#58; 30   Avg Nodes&#58; 2170028   Branching = 2.07 
  Ply&#58;14   Positions&#58; 30   Avg Nodes&#58; 3712354   Branching = 1.71 
  Ply&#58;15   Positions&#58; 30   Avg Nodes&#58; 6877139   Branching = 1.85 
  Ply&#58;16   Positions&#58; 30   Avg Nodes&#58;13106835   Branching = 1.91 
  Ply&#58;17   Positions&#58; 30   Avg Nodes&#58;23709744   Branching = 1.81 
  Ply&#58;18   Positions&#58; 30   Avg Nodes&#58;39712826   Branching = 1.67 
  Ply&#58;19   Positions&#58; 30   Avg Nodes&#58;65730096   Branching = 1.66 
  Ply&#58;20   Positions&#58; 30   Avg Nodes&#58;118838030   Branching = 1.81 
  Ply&#58;21   Positions&#58; 30   Avg Nodes&#58;207054782   Branching = 1.74 
Engine&#58; Komodo 5.1 64-bit &#40;512 MB&#41; 
by Don Dailey, Larry Kaufman

Komodo searches to the depth 21 2.19 more nodes to the same depth 1->4 cores. If overhead 1->4 cores is 20%, then Komodo searches ~1.8 more nodes compared to non-widening engines. Compare that to Houdini:

Houdini 3 1 thread:

Code: Select all

1 thread  
  n/s&#58; 2.439.155  
  TotTime&#58; 76&#58;12m    SolTime&#58; 76&#58;12m
  Ply&#58; 0   Positions&#58; 30   Avg Nodes&#58;       0   Branching = 0.00
  Ply&#58; 4   Positions&#58; 30   Avg Nodes&#58;    3039   Branching = 0.00
  Ply&#58; 5   Positions&#58; 30   Avg Nodes&#58;    4795   Branching = 1.58
  Ply&#58; 6   Positions&#58; 30   Avg Nodes&#58;    7922   Branching = 1.65
  Ply&#58; 7   Positions&#58; 30   Avg Nodes&#58;   12506   Branching = 1.58
  Ply&#58; 8   Positions&#58; 30   Avg Nodes&#58;   24685   Branching = 1.97
  Ply&#58; 9   Positions&#58; 30   Avg Nodes&#58;   59627   Branching = 2.42
  Ply&#58;10   Positions&#58; 30   Avg Nodes&#58;  124644   Branching = 2.09
  Ply&#58;11   Positions&#58; 30   Avg Nodes&#58;  234924   Branching = 1.88
  Ply&#58;12   Positions&#58; 30   Avg Nodes&#58;  546004   Branching = 2.32
  Ply&#58;13   Positions&#58; 30   Avg Nodes&#58; 1018643   Branching = 1.87
  Ply&#58;14   Positions&#58; 30   Avg Nodes&#58; 2079456   Branching = 2.04
  Ply&#58;15   Positions&#58; 30   Avg Nodes&#58; 4242175   Branching = 2.04
  Ply&#58;16   Positions&#58; 30   Avg Nodes&#58; 7301397   Branching = 1.72
  Ply&#58;17   Positions&#58; 30   Avg Nodes&#58;12755277   Branching = 1.75
  Ply&#58;18   Positions&#58; 30   Avg Nodes&#58;23244203   Branching = 1.82
  Ply&#58;19   Positions&#58; 31   Avg Nodes&#58;62940490   Branching = 2.71
  Ply&#58;20   Positions&#58; 31   Avg Nodes&#58;105879068   Branching = 1.68
  Ply&#58;21   Positions&#58; 31   Avg Nodes&#58;175381990   Branching = 1.66
Engine&#58; Houdini 3 Pro x64 1 &#40;512 MB&#41;
by Robert Houdart

Houdini 3 4 threads:

Code: Select all

4 threads
  n/s&#58; 8.242.875  
  TotTime&#58; 25&#58;51m    SolTime&#58; 25&#58;51m
  Ply&#58; 0   Positions&#58; 30   Avg Nodes&#58;       0   Branching = 0.00
  Ply&#58; 4   Positions&#58; 30   Avg Nodes&#58;    3039   Branching = 0.00
  Ply&#58; 5   Positions&#58; 30   Avg Nodes&#58;    4795   Branching = 1.58
  Ply&#58; 6   Positions&#58; 30   Avg Nodes&#58;    7845   Branching = 1.64
  Ply&#58; 7   Positions&#58; 30   Avg Nodes&#58;   14987   Branching = 1.91
  Ply&#58; 8   Positions&#58; 30   Avg Nodes&#58;   30712   Branching = 2.05
  Ply&#58; 9   Positions&#58; 30   Avg Nodes&#58;   80234   Branching = 2.61
  Ply&#58;10   Positions&#58; 30   Avg Nodes&#58;  180169   Branching = 2.25
  Ply&#58;11   Positions&#58; 31   Avg Nodes&#58;  540269   Branching = 3.00
  Ply&#58;12   Positions&#58; 31   Avg Nodes&#58;  908020   Branching = 1.68
  Ply&#58;13   Positions&#58; 31   Avg Nodes&#58; 1409296   Branching = 1.55
  Ply&#58;14   Positions&#58; 31   Avg Nodes&#58; 2515494   Branching = 1.78
  Ply&#58;15   Positions&#58; 30   Avg Nodes&#58; 4145935   Branching = 1.65
  Ply&#58;16   Positions&#58; 30   Avg Nodes&#58; 9216814   Branching = 2.22
  Ply&#58;17   Positions&#58; 30   Avg Nodes&#58;15233970   Branching = 1.65
  Ply&#58;18   Positions&#58; 30   Avg Nodes&#58;26090160   Branching = 1.71
  Ply&#58;19   Positions&#58; 30   Avg Nodes&#58;53620822   Branching = 2.06
  Ply&#58;20   Positions&#58; 30   Avg Nodes&#58;126277017   Branching = 2.35
  Ply&#58;21   Positions&#58; 30   Avg Nodes&#58;216850121   Branching = 1.72
Engine&#58; Houdini 3 Pro x64 1 &#40;512 MB&#41;
by Robert Houdart

Houdini searches to the depth 21 only 23% more nodes to the same depth 1->4 cores. These can be attributed to overhead. Different from ~1.8 widening other than overhead in Komodo. But the efficiency of this 1.8 effective widening other than the overhead still has to be measured in Elo points, to get [(time 1CPU)/(rime 4 CPU)] to the same _strength_. One has to play fast games at fixed depth, a thing I did previously.

bob · Post by **bob** » Sat Jul 06, 2013 3:44 am

Laszlo Gaspar wrote:In my first tries in SMP programming I didn't synchronize correctly the threads, the search simply fell apart. The different threads were working on different depths.

I feel something similar here, it seems to me that in this case the search doesn't stop when the FIRST thread finishes the last iteration but goes on until another thread - the MAIN thread, which is connected to the termination condition - finishes and does some extra search in between and can find a new bestmove which makes the search look like a better search.

Nobody "synchronizes threads" today. Threads are working all over the tree, at different depths, etc...

bob · Post by **bob** » Sat Jul 06, 2013 4:03 am

Laskos wrote:

bob wrote:
I'm not particularly a fan of using Elo to predict widening, then widening to predict Elo. Somewhat circular. Your time to depth is similar to what I do except that I generally use more sample runs for more cores since the variance is pretty significant at times.

For the "widening" I would be tempted to actually compare node counts to see how much larger the tree actually gets, as opposed to just measuring the Elo gain.

Right now, numbers (nodes, time, etc) are most helpful to the discussion since they can be used to derive a mathematical model that explains what is going on. If I could run Komodo on our cluster, I'd be generating enough data to analyze this to great precision...

BTW, this "Elo" measurement still leaves a lot to be desired. It is only measuring the Elo gain produced by looking at extra stuff. But it doesn't address the Elo loss that extra stuff costs in terms of time. That skews this Elo measurement significantly. In your tests, does EVERY 4 cpu move to fixed depth take less time than the equivalent single-cpu move? If not this really has a problem.

I express the gain in [(time 1CPU)/(rime 4 CPU)] to the same _strength_. The time_to depth_ i.e. [(time 1CPU)/(time 4CPU)] to the same _depth_ is pretty straightforward to do for an engine in an hour or two. What remains is the Elo deviation at the same _depth_, and I gather _time_ handicap to equal the strength from this Elo deviation. It has to be measured directly, as Elo difference at the same depth, then I infer the time handicap from 2^(N/100) approximate formula.

If you want to see the nodes to the same depth, I already posted in an another thread some results, but you seemed not very interested. Here it is:

Komodo 5.1 1thread:

Code: Select all

1 thread 
  n/s&#58; 1.329.027  
  TotTime&#58; 71&#58;30m    SolTime&#58; 71&#58;30m 
  Ply&#58; 0   Positions&#58; 30   Avg Nodes&#58;       0   Branching = 0.00 
  Ply&#58; 1   Positions&#58; 30   Avg Nodes&#58;     208   Branching = 0.00 
  Ply&#58; 2   Positions&#58; 30   Avg Nodes&#58;     475   Branching = 2.28 
  Ply&#58; 3   Positions&#58; 30   Avg Nodes&#58;    1162   Branching = 2.45 
  Ply&#58; 4   Positions&#58; 30   Avg Nodes&#58;    2369   Branching = 2.04 
  Ply&#58; 5   Positions&#58; 30   Avg Nodes&#58;    4022   Branching = 1.70 
  Ply&#58; 6   Positions&#58; 30   Avg Nodes&#58;    7420   Branching = 1.84 
  Ply&#58; 7   Positions&#58; 30   Avg Nodes&#58;   15120   Branching = 2.04 
  Ply&#58; 8   Positions&#58; 30   Avg Nodes&#58;   31341   Branching = 2.07 
  Ply&#58; 9   Positions&#58; 30   Avg Nodes&#58;   65919   Branching = 2.10 
  Ply&#58;10   Positions&#58; 30   Avg Nodes&#58;  138817   Branching = 2.11 
  Ply&#58;11   Positions&#58; 30   Avg Nodes&#58;  249815   Branching = 1.80 
  Ply&#58;12   Positions&#58; 30   Avg Nodes&#58;  486700   Branching = 1.95 
  Ply&#58;13   Positions&#58; 30   Avg Nodes&#58; 1025584   Branching = 2.11 
  Ply&#58;14   Positions&#58; 30   Avg Nodes&#58; 1761740   Branching = 1.72 
  Ply&#58;15   Positions&#58; 30   Avg Nodes&#58; 3037256   Branching = 1.72 
  Ply&#58;16   Positions&#58; 30   Avg Nodes&#58; 5514478   Branching = 1.82 
  Ply&#58;17   Positions&#58; 30   Avg Nodes&#58; 9323714   Branching = 1.69 
  Ply&#58;18   Positions&#58; 30   Avg Nodes&#58;18533095   Branching = 1.99 
  Ply&#58;19   Positions&#58; 30   Avg Nodes&#58;33487106   Branching = 1.81 
  Ply&#58;20   Positions&#58; 30   Avg Nodes&#58;53449385   Branching = 1.60 
  Ply&#58;21   Positions&#58; 30   Avg Nodes&#58;94696000   Branching = 1.77 
Engine&#58; Komodo 5.1 64-bit &#40;512 MB&#41; 
by Don Dailey, Larry Kaufman

Komodo 5.1 4 threads:

Code: Select all

4 threads 
  n/s&#58; 4.426.797  
  TotTime&#58; 51&#58;18m    SolTime&#58; 51&#58;18m 
  Ply&#58; 0   Positions&#58; 30   Avg Nodes&#58;       0   Branching = 0.00 
  Ply&#58; 1   Positions&#58; 30   Avg Nodes&#58;     658   Branching = 0.00 
  Ply&#58; 2   Positions&#58; 30   Avg Nodes&#58;    1454   Branching = 2.21 
  Ply&#58; 3   Positions&#58; 30   Avg Nodes&#58;    3023   Branching = 2.08 
  Ply&#58; 4   Positions&#58; 30   Avg Nodes&#58;    5522   Branching = 1.83 
  Ply&#58; 5   Positions&#58; 30   Avg Nodes&#58;   10040   Branching = 1.82 
  Ply&#58; 6   Positions&#58; 30   Avg Nodes&#58;   19588   Branching = 1.95 
  Ply&#58; 7   Positions&#58; 30   Avg Nodes&#58;   37536   Branching = 1.92 
  Ply&#58; 8   Positions&#58; 30   Avg Nodes&#58;   70835   Branching = 1.89 
  Ply&#58; 9   Positions&#58; 30   Avg Nodes&#58;  150130   Branching = 2.12 
  Ply&#58;10   Positions&#58; 30   Avg Nodes&#58;  308354   Branching = 2.05 
  Ply&#58;11   Positions&#58; 30   Avg Nodes&#58;  569434   Branching = 1.85 
  Ply&#58;12   Positions&#58; 30   Avg Nodes&#58; 1048122   Branching = 1.84 
  Ply&#58;13   Positions&#58; 30   Avg Nodes&#58; 2170028   Branching = 2.07 
  Ply&#58;14   Positions&#58; 30   Avg Nodes&#58; 3712354   Branching = 1.71 
  Ply&#58;15   Positions&#58; 30   Avg Nodes&#58; 6877139   Branching = 1.85 
  Ply&#58;16   Positions&#58; 30   Avg Nodes&#58;13106835   Branching = 1.91 
  Ply&#58;17   Positions&#58; 30   Avg Nodes&#58;23709744   Branching = 1.81 
  Ply&#58;18   Positions&#58; 30   Avg Nodes&#58;39712826   Branching = 1.67 
  Ply&#58;19   Positions&#58; 30   Avg Nodes&#58;65730096   Branching = 1.66 
  Ply&#58;20   Positions&#58; 30   Avg Nodes&#58;118838030   Branching = 1.81 
  Ply&#58;21   Positions&#58; 30   Avg Nodes&#58;207054782   Branching = 1.74 
Engine&#58; Komodo 5.1 64-bit &#40;512 MB&#41; 
by Don Dailey, Larry Kaufman

Komodo searches to the depth 21 2.19 more nodes to the same depth 1->4 cores. If overhead 1->4 cores is 20%, then Komodo searches ~1.8 more nodes compared to non-widening engines. Compare that to Houdini:

Houdini 3 1 thread:

Code: Select all

1 thread  
  n/s&#58; 2.439.155  
  TotTime&#58; 76&#58;12m    SolTime&#58; 76&#58;12m
  Ply&#58; 0   Positions&#58; 30   Avg Nodes&#58;       0   Branching = 0.00
  Ply&#58; 4   Positions&#58; 30   Avg Nodes&#58;    3039   Branching = 0.00
  Ply&#58; 5   Positions&#58; 30   Avg Nodes&#58;    4795   Branching = 1.58
  Ply&#58; 6   Positions&#58; 30   Avg Nodes&#58;    7922   Branching = 1.65
  Ply&#58; 7   Positions&#58; 30   Avg Nodes&#58;   12506   Branching = 1.58
  Ply&#58; 8   Positions&#58; 30   Avg Nodes&#58;   24685   Branching = 1.97
  Ply&#58; 9   Positions&#58; 30   Avg Nodes&#58;   59627   Branching = 2.42
  Ply&#58;10   Positions&#58; 30   Avg Nodes&#58;  124644   Branching = 2.09
  Ply&#58;11   Positions&#58; 30   Avg Nodes&#58;  234924   Branching = 1.88
  Ply&#58;12   Positions&#58; 30   Avg Nodes&#58;  546004   Branching = 2.32
  Ply&#58;13   Positions&#58; 30   Avg Nodes&#58; 1018643   Branching = 1.87
  Ply&#58;14   Positions&#58; 30   Avg Nodes&#58; 2079456   Branching = 2.04
  Ply&#58;15   Positions&#58; 30   Avg Nodes&#58; 4242175   Branching = 2.04
  Ply&#58;16   Positions&#58; 30   Avg Nodes&#58; 7301397   Branching = 1.72
  Ply&#58;17   Positions&#58; 30   Avg Nodes&#58;12755277   Branching = 1.75
  Ply&#58;18   Positions&#58; 30   Avg Nodes&#58;23244203   Branching = 1.82
  Ply&#58;19   Positions&#58; 31   Avg Nodes&#58;62940490   Branching = 2.71
  Ply&#58;20   Positions&#58; 31   Avg Nodes&#58;105879068   Branching = 1.68
  Ply&#58;21   Positions&#58; 31   Avg Nodes&#58;175381990   Branching = 1.66
Engine&#58; Houdini 3 Pro x64 1 &#40;512 MB&#41;
by Robert Houdart

Houdini 3 4 threads:

Code: Select all

4 threads
  n/s&#58; 8.242.875  
  TotTime&#58; 25&#58;51m    SolTime&#58; 25&#58;51m
  Ply&#58; 0   Positions&#58; 30   Avg Nodes&#58;       0   Branching = 0.00
  Ply&#58; 4   Positions&#58; 30   Avg Nodes&#58;    3039   Branching = 0.00
  Ply&#58; 5   Positions&#58; 30   Avg Nodes&#58;    4795   Branching = 1.58
  Ply&#58; 6   Positions&#58; 30   Avg Nodes&#58;    7845   Branching = 1.64
  Ply&#58; 7   Positions&#58; 30   Avg Nodes&#58;   14987   Branching = 1.91
  Ply&#58; 8   Positions&#58; 30   Avg Nodes&#58;   30712   Branching = 2.05
  Ply&#58; 9   Positions&#58; 30   Avg Nodes&#58;   80234   Branching = 2.61
  Ply&#58;10   Positions&#58; 30   Avg Nodes&#58;  180169   Branching = 2.25
  Ply&#58;11   Positions&#58; 31   Avg Nodes&#58;  540269   Branching = 3.00
  Ply&#58;12   Positions&#58; 31   Avg Nodes&#58;  908020   Branching = 1.68
  Ply&#58;13   Positions&#58; 31   Avg Nodes&#58; 1409296   Branching = 1.55
  Ply&#58;14   Positions&#58; 31   Avg Nodes&#58; 2515494   Branching = 1.78
  Ply&#58;15   Positions&#58; 30   Avg Nodes&#58; 4145935   Branching = 1.65
  Ply&#58;16   Positions&#58; 30   Avg Nodes&#58; 9216814   Branching = 2.22
  Ply&#58;17   Positions&#58; 30   Avg Nodes&#58;15233970   Branching = 1.65
  Ply&#58;18   Positions&#58; 30   Avg Nodes&#58;26090160   Branching = 1.71
  Ply&#58;19   Positions&#58; 30   Avg Nodes&#58;53620822   Branching = 2.06
  Ply&#58;20   Positions&#58; 30   Avg Nodes&#58;126277017   Branching = 2.35
  Ply&#58;21   Positions&#58; 30   Avg Nodes&#58;216850121   Branching = 1.72
Engine&#58; Houdini 3 Pro x64 1 &#40;512 MB&#41;
by Robert Houdart

Houdini searches to the depth 21 only 23% more nodes to the same depth 1->4 cores. These can be attributed to overhead. Different from ~1.8 widening other than overhead in Komodo. But the efficiency of this 1.8 effective widening other than the overhead still has to be measured in Elo points, to get [(time 1CPU)/(rime 4 CPU)] to the same _strength_. One has to play fast games at fixed depth, a thing I did previously.

Here's the expected results. Assume 100M nodes with one cpu. I would expect 2 cpus to search 115m, 3 to search 130m, and 4 to search 145m. Solely based on the frequently measured SMP overhead that I have measured in Crafty. For the 2 cpu case, that 100M node tree will be searched in 1 parts, each cpu getting 50M. But one CPU will add about 30% overhead to its part, which grows the tree to 115M. Add another cpu and you add another 15M.

If you do the math, you would see that those speedups are pretty well spot on to the numbers I have posted here many times. 2 cores = 1.7x (estimate by 115m total/2 and then dividing into 100 to get the speedup), 3 cores = 2.4x and 4 cores = 3.1x...

So Komodo has increased the tree about double what the SMP overhead has added. That is, another 50M roughly (BTW this is a tiny number of nodes for an estimate, 3-4 seconds on decent hardware, 2-3 on very fast hardware, which adds a lot of jitter to the calculations).

So, to get a speedup of 3.1x on 4 cores, we want to see a tree size no larger than 145M.

We are looking at an extra 50M or whatever nodes on top of that. Those extra nodes are taking the speedup from 3.1 down to whatever number you measured. Losing close to a ply. The extra width is gaining that back, apparently. And as I said, that suggests to me that perhaps the serial search could be tuned a bit wider as a default, recovering that 1.7x that was lost (using the 1.4x number someone posted as opposed to the 3.1x+ number that can be reached...

Now that I have a number to think about, let me work out some options to see what this might mean.

michiguel · Post by **michiguel** » Sat Jul 06, 2013 5:18 am

bob wrote:
michiguel wrote:
bob wrote:
michiguel wrote:
abulmo wrote:
Edsel Apostol wrote: Time to depth
This measures the capability of the SMP algorithm to search efficiently. This is a good measure, better than just NPS-wise, but it will only be most accurate if the algorithm traversed the same search tree as the serial one. In the case of modern chess engines, SMP algorithm doesn't search the same tree as the serial one, some prunes more, and some prunes less when in SMP for example.
.
What I like here is to use the following formula:
t(n) = W/n + C.
n is the number of threads;
t(n) the time spent
Once you have measured a few t(n) for various number of n (threads), you can compute W and C, and get an idea of the scalability of your program. Indeed:
W+C = t(1)
C = t(infinity)
C/(W+C) = theoretical upper limit of achievable acceleration.

Unfortunately, it is not that simple, because the Amdahl law (which is what you imply) does not apply to chess engines. That assumes the same work to be done for the parallel and single thread programs. We do not have that. The parallel program explores different branches, many times increasing the work adding extra inefficiency, and other times finding "short cuts".

That is the reason why the data does not fit well the Amdahl's formula.

Miguel

Edsel Apostol wrote: Performance/Elo based
Elo performance is related to speed increase, so that doubling the speed is about a 50 Elo improvement. When dealing with speed improvement, which is the only thing SMP provides, Elo improvement is not a necessary tool, IMHO.
For the record, Amdahl's law applies to ALL parallel programs.

No, they don't!

It applies to the ones in which the size of the problem remains the same.
For EVERY problem, this size can remain constant.

I does not for chess engines, and you know it.

Miguel

In a weather model, one would not compare two different programs that examine a different number of data points. Chess certainly fits this description. Changing pruning is certainly unusual, to say the least...

Miguel

Or anything else done related to efficiency, such as internal microprocessor design decisions and such.

The primary issue is "which is easier to measure, the parallel speedup / SMP efficiency using the standard time-to-depth that has been in use forever, or the Elo gain when adding CPUs, which takes a huge amount of computational resources to measure?" For all programs excepting just one, it appears that the traditional time-to-depth is by far the better measure. At least until the point where one throws up his hands and says "I can't get any deeper by adding more processors, so what else can I use 'em for?"

Since for all programs except 1, the easier-to-measure metric works just as well as the Elo measurement, who would want to go to that extreme? I'm able to do it frequently due to having access to large clusters. But even for me, it is unpleasant. On the 70 node cluster with 8 cores per node, I can either play 560 games at a time, or 70 using 8 threads. That's a factor of 8x. I then need to run a BUNCH of games to get acceptable accuracy, further slowing testing. With a traditional time-to-depth measurement, I can run a thorough test in under an hour...

bob · Post by **bob** » Sat Jul 06, 2013 7:42 am

michiguel wrote:
bob wrote:
michiguel wrote:
bob wrote:
michiguel wrote:
abulmo wrote:
Edsel Apostol wrote: Time to depth
This measures the capability of the SMP algorithm to search efficiently. This is a good measure, better than just NPS-wise, but it will only be most accurate if the algorithm traversed the same search tree as the serial one. In the case of modern chess engines, SMP algorithm doesn't search the same tree as the serial one, some prunes more, and some prunes less when in SMP for example.
.
What I like here is to use the following formula:
t(n) = W/n + C.
n is the number of threads;
t(n) the time spent
Once you have measured a few t(n) for various number of n (threads), you can compute W and C, and get an idea of the scalability of your program. Indeed:
W+C = t(1)
C = t(infinity)
C/(W+C) = theoretical upper limit of achievable acceleration.

Unfortunately, it is not that simple, because the Amdahl law (which is what you imply) does not apply to chess engines. That assumes the same work to be done for the parallel and single thread programs. We do not have that. The parallel program explores different branches, many times increasing the work adding extra inefficiency, and other times finding "short cuts".

That is the reason why the data does not fit well the Amdahl's formula.

Miguel

Edsel Apostol wrote: Performance/Elo based
Elo performance is related to speed increase, so that doubling the speed is about a 50 Elo improvement. When dealing with speed improvement, which is the only thing SMP provides, Elo improvement is not a necessary tool, IMHO.
For the record, Amdahl's law applies to ALL parallel programs.

No, they don't!

It applies to the ones in which the size of the problem remains the same.
For EVERY problem, this size can remain constant.

I does not for chess engines, and you know it.

You are mangling the word "size". In chess, in a parallel search, we search a larger tree due to overhead. THAT is why we don't get perfectly linear speedup. Who cares? Those extra nodes, for almost every program on the planet, do not give any additional information, they are simply searched due to alpha/beta considerations and imperfect move ordering / splitting.

Can we return to the regularly scheduled discussion now? Amdahl's law can be applied to ANY algorithm that has a specific output for a specific input. Chess certainly does that for fixed depth measuring time to depth. Since Elo has ALWAYS been directly correlated with speed, one could probably measure Elo, extrapolate to speedup, and Amdahl's law would still apply just fine...

Miguel

In a weather model, one would not compare two different programs that examine a different number of data points. Chess certainly fits this description. Changing pruning is certainly unusual, to say the least...

Miguel

Or anything else done related to efficiency, such as internal microprocessor design decisions and such.

The primary issue is "which is easier to measure, the parallel speedup / SMP efficiency using the standard time-to-depth that has been in use forever, or the Elo gain when adding CPUs, which takes a huge amount of computational resources to measure?" For all programs excepting just one, it appears that the traditional time-to-depth is by far the better measure. At least until the point where one throws up his hands and says "I can't get any deeper by adding more processors, so what else can I use 'em for?"

Since for all programs except 1, the easier-to-measure metric works just as well as the Elo measurement, who would want to go to that extreme? I'm able to do it frequently due to having access to large clusters. But even for me, it is unpleasant. On the 70 node cluster with 8 cores per node, I can either play 560 games at a time, or 70 using 8 threads. That's a factor of 8x. I then need to run a BUNCH of games to get acceptable accuracy, further slowing testing. With a traditional time-to-depth measurement, I can run a thorough test in under an hour...

Laszlo Gaspar · Post by **Laszlo Gaspar** » Sat Jul 06, 2013 12:04 pm

I meant different 'iteration' depths and synchronization between iterations...

Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability