Measure of SMP scalability

michiguel · Post by **michiguel** » Sat Jul 06, 2013 4:01 pm

bob wrote:
michiguel wrote:
bob wrote:
michiguel wrote:
bob wrote:
michiguel wrote:
abulmo wrote:
Edsel Apostol wrote: Time to depth
This measures the capability of the SMP algorithm to search efficiently. This is a good measure, better than just NPS-wise, but it will only be most accurate if the algorithm traversed the same search tree as the serial one. In the case of modern chess engines, SMP algorithm doesn't search the same tree as the serial one, some prunes more, and some prunes less when in SMP for example.
.
What I like here is to use the following formula:
t(n) = W/n + C.
n is the number of threads;
t(n) the time spent
Once you have measured a few t(n) for various number of n (threads), you can compute W and C, and get an idea of the scalability of your program. Indeed:
W+C = t(1)
C = t(infinity)
C/(W+C) = theoretical upper limit of achievable acceleration.

Unfortunately, it is not that simple, because the Amdahl law (which is what you imply) does not apply to chess engines. That assumes the same work to be done for the parallel and single thread programs. We do not have that. The parallel program explores different branches, many times increasing the work adding extra inefficiency, and other times finding "short cuts".

That is the reason why the data does not fit well the Amdahl's formula.

Miguel

Edsel Apostol wrote: Performance/Elo based
Elo performance is related to speed increase, so that doubling the speed is about a 50 Elo improvement. When dealing with speed improvement, which is the only thing SMP provides, Elo improvement is not a necessary tool, IMHO.
For the record, Amdahl's law applies to ALL parallel programs.

No, they don't!

It applies to the ones in which the size of the problem remains the same.
For EVERY problem, this size can remain constant.

I does not for chess engines, and you know it.
You are mangling the word "size". In chess, in a parallel search, we search a larger tree due to overhead. THAT is why we don't get perfectly linear speedup.

That is why it deviates from Amdhal.

Who cares? Those extra nodes, for almost every program on the planet, do not give any additional information,

Not actually true, because it fills the hashtable with information that sometimes it is useful and few times it generates bigger speedups.

they are simply searched due to alpha/beta considerations and imperfect move ordering / splitting.

Can we return to the regularly scheduled discussion now?

No, we were discussing this in this sub-branch.

Amdahl's law can be applied to ANY algorithm that has a specific output for a specific input.

No, the work done needs to be the same.

Chess certainly does that for fixed depth measuring time to depth. Since Elo has ALWAYS been directly correlated with speed, one could probably measure Elo, extrapolate to speedup, and Amdahl's law would still apply just fine...

You can measure the speed and plot it vs. number of processors. Does it fit an equation y = a * x / (b + x)? with the data we have seen over the years, it doesn't. Does it fit today?

Miguel

Miguel

In a weather model, one would not compare two different programs that examine a different number of data points. Chess certainly fits this description. Changing pruning is certainly unusual, to say the least...

Miguel

Or anything else done related to efficiency, such as internal microprocessor design decisions and such.

The primary issue is "which is easier to measure, the parallel speedup / SMP efficiency using the standard time-to-depth that has been in use forever, or the Elo gain when adding CPUs, which takes a huge amount of computational resources to measure?" For all programs excepting just one, it appears that the traditional time-to-depth is by far the better measure. At least until the point where one throws up his hands and says "I can't get any deeper by adding more processors, so what else can I use 'em for?"

Since for all programs except 1, the easier-to-measure metric works just as well as the Elo measurement, who would want to go to that extreme? I'm able to do it frequently due to having access to large clusters. But even for me, it is unpleasant. On the 70 node cluster with 8 cores per node, I can either play 560 games at a time, or 70 using 8 threads. That's a factor of 8x. I then need to run a BUNCH of games to get acceptable accuracy, further slowing testing. With a traditional time-to-depth measurement, I can run a thorough test in under an hour...

Laskos · Post by **Laskos** » Sat Jul 06, 2013 6:38 pm

Edsel Apostol wrote:There seems to be no agreement yet regarding how to measure SMP scalability in computer chess. I can think of 3 things that could be used to measure scalability.

NPS-wise
This seems to be the most simple and obvious one. It simply measures the CPU utilization of the algorithm used. The problem with this is that it doesn't take into account if the amount of work being measured is just redundant in the case of a shared hash table implementation.

Time to depth
This measures the capability of the SMP algorithm to search efficiently. This is a good measure, better than just NPS-wise, but it will only be most accurate if the algorithm traversed the same search tree as the serial one. In the case of modern chess engines, SMP algorithm doesn't search the same tree as the serial one, some prunes more, and some prunes less when in SMP for example.

In some cases, in order to gain better performance, one can widen the search a bit when using SMP to compensate for excessive pruning/reduction. This will improve performance for the reason that at higher depths there is not much compensation for searching deeper, it is better to search wider and not miss critical variations.

Performance/Elo based
This seems to be the most logical measurement to me since SMP algorithm in modern chess engines is more complex to be measured by just "Time to depth" and "NPS-wise".

Edsel, returning to your original question applied to Hannibal 1.3, no, you can't use time-to-depth to Hannibal, it widens its search by much, in fact it seems one of the most extreme in this. I included it in the list of engines I tested 1->4 cores, and extrapolated the result with [(time-to-depth factor) * 2^(N/100)] empiric formula, where N is widening 1->4 cores in Elos at fast controls, to get time-to-strength, which is in fact the effective speedup.

Code: Select all

Engine          time-to-depth    widening      time-to-strength      

Houdini 3           3.05            0             3.05 
Stockfish 3         2.91            0             2.91 
Rybka 4.1           2.32           24             2.74 
Komodo 5.1          1.68           75             2.83 
Hiarcs 14           2.10           42             2.81 
Gaviota 0.86        2.74            0             2.74 
Shredder 12         2.37            0             2.37 
Hannibal 1.3        1.33           77             2.27

The numbers in the last column are still pretty volatile, but they give a picture.

Edsel Apostol · Post by **Edsel Apostol** » Sat Jul 06, 2013 7:03 pm

Laskos wrote:
Edsel Apostol wrote:There seems to be no agreement yet regarding how to measure SMP scalability in computer chess. I can think of 3 things that could be used to measure scalability.

NPS-wise
This seems to be the most simple and obvious one. It simply measures the CPU utilization of the algorithm used. The problem with this is that it doesn't take into account if the amount of work being measured is just redundant in the case of a shared hash table implementation.

Time to depth
This measures the capability of the SMP algorithm to search efficiently. This is a good measure, better than just NPS-wise, but it will only be most accurate if the algorithm traversed the same search tree as the serial one. In the case of modern chess engines, SMP algorithm doesn't search the same tree as the serial one, some prunes more, and some prunes less when in SMP for example.

In some cases, in order to gain better performance, one can widen the search a bit when using SMP to compensate for excessive pruning/reduction. This will improve performance for the reason that at higher depths there is not much compensation for searching deeper, it is better to search wider and not miss critical variations.

Performance/Elo based
This seems to be the most logical measurement to me since SMP algorithm in modern chess engines is more complex to be measured by just "Time to depth" and "NPS-wise".
Edsel, returning to your original question applied to Hannibal 1.3, no, you can't use time-to-depth to Hannibal, it widens its search by much, in fact it seems on of the most extreme in this. I included it in the list of engines I tested 1->4 cores, and extrapolated the result with (time-to-depth factor) * 2^(N/100) empiric formula, where N is widening 1->4 cores in Elos at fast controls, to get time-to-strength, which is in fact the effective speedup.
Code: Select all
Engine          time-to-depth    widening      time-to-strength      

Houdini 3           3.05            0             3.05 
Stockfish 3         2.91            0             2.91 
Rybka 4.1           2.32           24             2.74 
Komodo 5.1          1.68           75             2.83 
Hiarcs 14           2.10           42             2.81 
Gaviota 0.86        2.74            0             2.74 
Shredder 12         2.37            0             2.37 
Hannibal 1.3        1.33           77             2.27

I expected that result. We have this stupid bug in the code where due to a rewrite of the search (from specific-to-node-type search to a generic one) I forgot to change one critical variable parameter to the split function. So instead of passing a boolean variable "inPV" parameter, it is always passing a value of "true" instead.

Since we have a different pruning/reduction/search criteria for PV and Non PV nodes, when in split point, it always applies the PV node criteria even if the node is Non PV, thereby widening the search to a greater degree in SMP. That explains the extreme result in your test.

We are expecting to release some new version back then so we didn't release a bug fix for that.

As for Komodo, I think it is not doing the same widening as what is done inadvertently in Hannibal. My guess is that the cause of the widening in Komodo is it's use of non-conventional SMP implementation. It's maybe using some sort of Lazy SMP (which has been discussed here before and which Don may also have tried IIRC based on what I've read from the posts), a kind of shared hash table implementation with tricks.

In that kind of implementation, time-to-depth is not an accurate measure of SMP effectiveness. I could be wrong though.

bob · Post by **bob** » Sat Jul 06, 2013 7:08 pm

Laszlo Gaspar wrote:I meant different 'iteration' depths and synchronization between iterations...

Same point still applies. In my search, for example, idle processors don't wait for an iteration to end. They help others finish their work to get the iteration completed.

bob · Post by **bob** » Sat Jul 06, 2013 7:16 pm

michiguel wrote:
bob wrote:
michiguel wrote:
bob wrote:
michiguel wrote:
bob wrote:
michiguel wrote:
abulmo wrote:
Edsel Apostol wrote: Time to depth
This measures the capability of the SMP algorithm to search efficiently. This is a good measure, better than just NPS-wise, but it will only be most accurate if the algorithm traversed the same search tree as the serial one. In the case of modern chess engines, SMP algorithm doesn't search the same tree as the serial one, some prunes more, and some prunes less when in SMP for example.
.
What I like here is to use the following formula:
t(n) = W/n + C.
n is the number of threads;
t(n) the time spent
Once you have measured a few t(n) for various number of n (threads), you can compute W and C, and get an idea of the scalability of your program. Indeed:
W+C = t(1)
C = t(infinity)
C/(W+C) = theoretical upper limit of achievable acceleration.

Unfortunately, it is not that simple, because the Amdahl law (which is what you imply) does not apply to chess engines. That assumes the same work to be done for the parallel and single thread programs. We do not have that. The parallel program explores different branches, many times increasing the work adding extra inefficiency, and other times finding "short cuts".

That is the reason why the data does not fit well the Amdahl's formula.

Miguel

Edsel Apostol wrote: Performance/Elo based
Elo performance is related to speed increase, so that doubling the speed is about a 50 Elo improvement. When dealing with speed improvement, which is the only thing SMP provides, Elo improvement is not a necessary tool, IMHO.
For the record, Amdahl's law applies to ALL parallel programs.

No, they don't!

It applies to the ones in which the size of the problem remains the same.
For EVERY problem, this size can remain constant.

I does not for chess engines, and you know it.
You are mangling the word "size". In chess, in a parallel search, we search a larger tree due to overhead. THAT is why we don't get perfectly linear speedup.
That is why it deviates from Amdhal.

That is NOT a "deviation". The goal is to get to depth N as quickly as possible. That the nature of alpha/beta impedes that is simply a PART of the algorithm definition. And it is why chess programs don't get perfect linear speedup, and it is why amdahl's law fits perfectly, still.

Who cares? Those extra nodes, for almost every program on the planet, do not give any additional information,

Not actually true, because it fills the hashtable with information that sometimes it is useful and few times it generates bigger speedups.

"sometimes". And sometimes it hurts. And the net effect is "no gain". As verified by the 1 vs 4 cpu tests to fixed depth. Have you read all the previous discussions? If there is no gain then there is no gain.

they are simply searched due to alpha/beta considerations and imperfect move ordering / splitting.

Can we return to the regularly scheduled discussion now?

No, we were discussing this in this sub-branch.

Amdahl's law can be applied to ANY algorithm that has a specific output for a specific input.

No, the work done needs to be the same.

work done = search to fixed depth. Seems clear enough to me. Doesn't matter whether the serial program uses alpha/beta and the parallel program uses pure minimax. Time to depth is "the same task"

Chess certainly does that for fixed depth measuring time to depth. Since Elo has ALWAYS been directly correlated with speed, one could probably measure Elo, extrapolate to speedup, and Amdahl's law would still apply just fine...

You can measure the speed and plot it vs. number of processors. Does it fit an equation y = a * x / (b + x)? with the data we have seen over the years, it doesn't. Does it fit today?

Miguel

Which program? Since we are discussing Komodo which seems to be significantly different from what I see with Crafty, or Houdini or stockfish or the rest of the "traditional parallel searchers".

[quote

Miguel

In a weather model, one would not compare two different programs that examine a different number of data points. Chess certainly fits this description. Changing pruning is certainly unusual, to say the least...

Miguel

Or anything else done related to efficiency, such as internal microprocessor design decisions and such.

The primary issue is "which is easier to measure, the parallel speedup / SMP efficiency using the standard time-to-depth that has been in use forever, or the Elo gain when adding CPUs, which takes a huge amount of computational resources to measure?" For all programs excepting just one, it appears that the traditional time-to-depth is by far the better measure. At least until the point where one throws up his hands and says "I can't get any deeper by adding more processors, so what else can I use 'em for?"

Since for all programs except 1, the easier-to-measure metric works just as well as the Elo measurement, who would want to go to that extreme? I'm able to do it frequently due to having access to large clusters. But even for me, it is unpleasant. On the 70 node cluster with 8 cores per node, I can either play 560 games at a time, or 70 using 8 threads. That's a factor of 8x. I then need to run a BUNCH of games to get acceptable accuracy, further slowing testing. With a traditional time-to-depth measurement, I can run a thorough test in under an hour...

[/quote]

bob · Post by **bob** » Sat Jul 06, 2013 7:19 pm

Laskos wrote:
Edsel Apostol wrote:There seems to be no agreement yet regarding how to measure SMP scalability in computer chess. I can think of 3 things that could be used to measure scalability.

NPS-wise
This seems to be the most simple and obvious one. It simply measures the CPU utilization of the algorithm used. The problem with this is that it doesn't take into account if the amount of work being measured is just redundant in the case of a shared hash table implementation.

Time to depth
This measures the capability of the SMP algorithm to search efficiently. This is a good measure, better than just NPS-wise, but it will only be most accurate if the algorithm traversed the same search tree as the serial one. In the case of modern chess engines, SMP algorithm doesn't search the same tree as the serial one, some prunes more, and some prunes less when in SMP for example.

In some cases, in order to gain better performance, one can widen the search a bit when using SMP to compensate for excessive pruning/reduction. This will improve performance for the reason that at higher depths there is not much compensation for searching deeper, it is better to search wider and not miss critical variations.

Performance/Elo based
This seems to be the most logical measurement to me since SMP algorithm in modern chess engines is more complex to be measured by just "Time to depth" and "NPS-wise".
Edsel, returning to your original question applied to Hannibal 1.3, no, you can't use time-to-depth to Hannibal, it widens its search by much, in fact it seems one of the most extreme in this. I included it in the list of engines I tested 1->4 cores, and extrapolated the result with [(time-to-depth factor) * 2^(N/100)] empiric formula, where N is widening 1->4 cores in Elos at fast controls, to get time-to-strength, which is in fact the effective speedup.
Code: Select all
Engine          time-to-depth    widening      time-to-strength      

Houdini 3           3.05            0             3.05 
Stockfish 3         2.91            0             2.91 
Rybka 4.1           2.32           24             2.74 
Komodo 5.1          1.68           75             2.83 
Hiarcs 14           2.10           42             2.81 
Gaviota 0.86        2.74            0             2.74 
Shredder 12         2.37            0             2.37 
Hannibal 1.3        1.33           77             2.27
The numbers in the last column are still pretty volatile, but they give a picture.

How does it widen? Is it because it is a fairly new version/program? Early parallel search efforts always greatly blow up the number of nodes searched by introducing overhead unintentionally as opposed to increasing width intentionally...

If you look at each of the parallel search threads here over the past 10 years, from simple YBW to DTS, the goal of the discussions is to reduce overhead by choosing better split points. DTS went to great extremes trying to do this. And, as expected, early versions just blew the tree up in terms of size...

bob · Post by **bob** » Sat Jul 06, 2013 7:24 pm

There you go! Answers the question I posed before I saw your reply. Seems like it is quite easy to jump to conclusions here. And then reality crashes in.

Bugs happen. Regularly when one is dealing with parallel search.

bob · Post by **bob** » Sat Jul 06, 2013 7:24 pm

There you go! Answers the question I posed before I saw your reply. Seems like it is quite easy to jump to conclusions here. And then reality crashes in.

Bugs happen. Regularly when one is dealing with parallel search.

Laszlo Gaspar · Post by **Laszlo Gaspar** » Sat Jul 06, 2013 7:40 pm

I can accept that it should work that way and it might explain why my SMP is not that good as the SMP in Crafty. But the question is what algorytm is implemented in Komodo and this was my idea. It looks like it is not DTS and I agree with Edsel in his opinion.

Laskos · Post by **Laskos** » Sat Jul 06, 2013 7:42 pm

bob wrote:There you go! Answers the question I posed before I saw your reply. Seems like it is quite easy to jump to conclusions here. And then reality crashes in.

Bugs happen. Regularly when one is dealing with parallel search.

What "there you go"? The gain from width is real even in buggy Hannibal, it's not simply overhead, it's 77 Elo points gain at fixed depth. The problem is that Hannibal almost doesn't go deeper 1->4 cores, time-to-depth being only 1.33. The overall effective speedup also suffers at 2.27, compared to 2.7-3.1 of most other engines. Yes, it is a bug, but not "there you go". Hannibal still has 2.27 effective speedup, coming mostly from non-trivial (non-overhead) widening, which gains points.

And as Edsel points out, Komodo's implementation is probably different and not buggy.

Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability

Re: Measure of SMP scalability