Scaling with threads based on Andreas Strangmüller tests

Laskos · Post by **Laskos** » Fri Oct 24, 2014 5:27 pm

Here I will use for Effective Speed Up (time-to-strength) as ESU.

Andreas Strangmüller with a Dual AMD Opteron 6376 (2x 16 Cores), on 60''+0.05'' time control, in this post
http://www.talkchess.com/forum/viewtopi ... 95&start=0
had shown how 5 well known engines (several top engines) scale in Elo points in N vs 1 thread. N being 2,4,8,16, i.e. doubling the number of threads. 3,000 games in each data point, error margins are not large.
Then he included Crafty 24.1, in this post
http://www.talkchess.com/forum/viewtopi ... 59&start=0
Then, Adam Hair presented adjusted for strength "Wilos" ratings of improvement with doubling of cores up to 16
http://www.talkchess.com/forum/viewtopi ... =&start=73

Robert Hyatt, on the other hand, posted "The only 64 core data I have is from an Itanium. Eugene Nalimov ran Crafty on a 64 core Itanium box quite a few years back. The speedup was around 32, which was not so good. That was the point where we both started looking at NUMA issues because that was the first NUMA box either of us had ever used to run Crafty... So 32x faster on 64 cores. Pretty poor."
http://www.talkchess.com/forum/viewtopi ... =&start=33

The same NUMA as on AMD.
32x ESU for 64 cores means 1.78 ESU per each doubling, 6 doublings.

Several (quite a few) years ago I read V. Rajlich on Rybka forum that the scaling for a doubling in ESU is something like diminishing returns 1.8,1.7,1.6,1.5 for each doubling to 16 cores. So, the basics is to compare R. Hyatt linear, exceptionally high quality 32x ESU on 64 cores, or linear 1.78 ESU per doubling, with apparently logarithmic or Amdahl claims of V. Rajlich. I made my logarithmic fit about which R. Hyatt said it's wrong, and waited for Andreas Strangmüller 32 core test. ESU from R. Hyatt is 1.78 from 16 to 32 threads, my ESU from logarithmic model was 1.34 for 16 to 32.

Andreas Strangmüller, after four days of full testing (3,000 games 32 thread Komodo 8 against 16 thread Komodo 8 test) came with 1.31. He got 27 Elo points difference 32 versus 16 threaded Komodo 8, an engine which seems to scale even better than Crafty to 16 threads.

http://www.talkchess.com/forum/viewtopi ... 9&start=21

At 60''+0.05'' time control, on AMD core and on 16 cores, the doubling in time is about 70 Elo points with small error margins, so these 27 Elo points transform easily to 2^(27/70)~1.31 ESU. I predicted 30 Elo pints gain, or ESU of 1.34 for that doubling, R. Hyatt 1.78 ESU was predicting 58 Elo points, well off.

That confirms V. Rajlich diminishing returns empirical numbers or my logarithmic model, and infirmes

1/ Linear model
2/ R. Hyatt claims that 32x scaling on 64 cores is "bad"

I plot here R. Hyatt model for Crafty (1.78 constant ESU per doubling) in red, which is "bad scaling" in his own words, Komodo 8 scaling in green, and my logarithmic model in black, to predict that Komodo 8 will scale to ESU of about 13 on 64 cores on AMD Opteron or Intel Xeon architecture. My model is fitted only to 16 cores, so 32 core behavior is a _prediction_. And it seems to fit well the empirical data for Komodo on 32 cores.

So, what's the matter with 32x ESU for Crafty on 64 core box is a bit mysterious in his mysterious linear model. He declined a bet between my fit and his fit applied to Andreas test on 32 cores, quoting "unknown architecture of the machine and memory bottlenecks". Be it. He probably has no clue how an alpha/beta search scales on 32 , 64 or more cores, at least on Opterons and Xeons.

Miguel Ballicora gave another, modified Amdahl behavior, which diverges even more from such excellent scaling as R. Hyatt "bad scaling" (in his own words) of Crafty of ESU 32 on 64 cores Itanium box.

Laskos · Post by **Laskos** » Fri Oct 24, 2014 9:42 pm

The logarithmic fit using the new, 32 thread result of Komodo 8 of Andreas Strangmüller result.

Red line is the mysterious 32 ESU of Crafty on 64 core Itanium box of R. Hyatt, in a linear scaling (which he calls "poor scaling").

The green line is the real ESU of Komodo 8 measured by Andreas Strangmüller with 3,000 games each data point.

The black line, is the logarithmic model fitted up to 32 cores of Andreas Strangmüller results. It is practically indistinguishable from green line of data line to 32 cores of Komodo 8. It gives a prediction of 12.7 ESU on 64 AMD Opteron cores for Komodo 8, compared 32 ESU "bad scaling Crafty" of R. Hyatt. He even gave 1,000 ESU on 1,024 cores for other non-chess applications, as a good example for chess (alpha/beta search scaling).

bob · Post by **bob** » Sat Oct 25, 2014 2:13 am

Laskos wrote:Here I will use for Effective Speed Up (time-to-strength) as ESU.

Andreas Strangmüller with a Dual AMD Opteron 6376 (2x 16 Cores), on 60''+0.05'' time control, in this post
http://www.talkchess.com/forum/viewtopi ... 95&start=0
had shown how 5 well known engines (several top engines) scale in Elo points in N vs 1 thread. N being 2,4,8,16, i.e. doubling the number of threads. 3,000 games in each data point, error margins are not large.
Then he included Crafty 24.1, in this post
http://www.talkchess.com/forum/viewtopi ... 59&start=0
Then, Adam Hair presented adjusted for strength "Wilos" ratings of improvement with doubling of cores up to 16
http://www.talkchess.com/forum/viewtopi ... =&start=73

Robert Hyatt, on the other hand, posted "The only 64 core data I have is from an Itanium. Eugene Nalimov ran Crafty on a 64 core Itanium box quite a few years back. The speedup was around 32, which was not so good. That was the point where we both started looking at NUMA issues because that was the first NUMA box either of us had ever used to run Crafty... So 32x faster on 64 cores. Pretty poor."
http://www.talkchess.com/forum/viewtopi ... =&start=33

The same NUMA as on AMD.
32x ESU for 64 cores means 1.78 ESU per each doubling, 6 doublings.

Several (quite a few) years ago I read V. Rajlich on Rybka forum that the scaling for a doubling in ESU is something like diminishing returns 1.8,1.7,1.6,1.5 for each doubling to 16 cores. So, the basics is to compare R. Hyatt linear, exceptionally high quality 32x ESU on 64 cores, or linear 1.78 ESU per doubling, with apparently logarithmic or Amdahl claims of V. Rajlich. I made my logarithmic fit about which R. Hyatt said it's wrong, and waited for Andreas Strangmüller 32 core test. ESU from R. Hyatt is 1.78 from 16 to 32 threads, my ESU from logarithmic model was 1.34 for 16 to 32.

Andreas Strangmüller, after four days of full testing (3,000 games 32 thread Komodo 8 against 16 thread Komodo 8 test) came with 1.31. He got 27 Elo points difference 32 versus 16 threaded Komodo 8, an engine which seems to scale even better than Crafty to 16 threads.

http://www.talkchess.com/forum/viewtopi ... 9&start=21

At 60''+0.05'' time control, on AMD core and on 16 cores, the doubling in time is about 70 Elo points with small error margins, so these 27 Elo points transform easily to 2^(27/70)~1.31 ESU. I predicted 30 Elo pints gain, or ESU of 1.34 for that doubling, R. Hyatt 1.78 ESU was predicting 58 Elo points, well off.

That confirms V. Rajlich diminishing returns empirical numbers or my logarithmic model, and infirmes

1/ Linear model
2/ R. Hyatt claims that 32x scaling on 64 cores is "bad"

I plot here R. Hyatt model for Crafty (1.78 constant ESU per doubling) in red, which is "bad scaling" in his own words, Komodo 8 scaling in green, and my logarithmic model in black, to predict that Komodo 8 will scale to ESU of about 13 on 64 cores on AMD Opteron or Intel Xeon architecture. My model is fitted only to 16 cores, so 32 core behavior is a _prediction_. And it seems to fit well the empirical data for Komodo on 32 cores.

So, what's the matter with 32x ESU for Crafty on 64 core box is a bit mysterious in his mysterious linear model. He declined a bet between my fit and his fit applied to Andreas test on 32 cores, quoting "unknown architecture of the machine and memory bottlenecks". Be it. He probably has no clue how an alpha/beta search scales on 32 , 64 or more cores, at least on Opterons and Xeons.

Miguel Ballicora gave another, modified Amdahl behavior, which diverges even more from such excellent scaling as R. Hyatt "bad scaling" (in his own words) of Crafty of ESU 32 on 64 cores Itanium box.

Quite simply, there is something wrong with the NPS numbers he provided. What, I have no idea. I just posted some 12 core NPS numbers for a 2x6 4 year old computer, game played this past week:

time=1:25(89%) n=4283303045(4.3B) fh1=81% nps=50.1M 50=0
chks=199.2M qchks=492.2M sing=413.1K/104.5K fut=1.2B pred=35
LMReductions: 1/52.4M 2/23.3M 3/8.0M 4/759.3K 5/4.4K
null-move (R): 3/83.8M 4/7.1M 5/284.7K 6/7.6K
splits=511.3K aborts=102.9K data=25% probes=0 hits=0

50M is what I have been seeing. Theoretical peak on that box ought to be closer to 60M which I have been working on. But even in the 1990's, speed for CB and Crafty was > 10 for 16 cores. What is happening I do not know. I know that gcc has been flakey, whether it has some internal thread-safe locking it is doing behind the scenes or something I do not know. My numbers come from the Intel compiler which has always been rock solid. We have one cluster with no Intel CC, and there gcc is flakey, to be kind. So what causes Crafty to search SLOWER at 32 than at 16 is completely unknown. And suggests something is broken. I can only test through 24 at the moment (2x12) and I see no such fall-off. Be interesting to figure out what it is. I might give Intel a call when I have time and set up another machine in their developer lab to run some of these tests. In the meantime, something's certainly fishy. Be interesting to see if anyone else can run some NPS numbers. I have easy access to 2x4 and 2x6 cluster nodes, but 2x12 is much harder since it is a machine that doesn't belong to me.

bob · Post by **bob** » Sat Oct 25, 2014 2:16 am

Laskos wrote:The logarithmic fit using the new, 32 thread result of Komodo 8 of Andreas Strangmüller result.

Red line is the mysterious 32 ESU of Crafty on 64 core Itanium box of R. Hyatt, in a linear scaling (which he calls "poor scaling").

The green line is the real ESU of Komodo 8 measured by Andreas Strangmüller with 3,000 games each data point.

The black line, is the logarithmic model fitted up to 32 cores of Andreas Strangmüller results. It is practically indistinguishable from green line of data line to 32 cores of Komodo 8. It gives a prediction of 12.7 ESU on 64 AMD Opteron cores for Komodo 8, compared 32 ESU "bad scaling Crafty" of R. Hyatt. He even gave 1,000 ESU on 1,024 cores for other non-chess applications, as a good example for chess (alpha/beta search scaling).

I do not understand your obsession with those itanium numbers. Every architecture is different. Every configuration of each architecture is different. Every program implements parallel search differently. You are trying to explain something you simply do not understand very well. And not doing a very good job of it. My linear formula was NEVER intended to go beyond 16. I have told you that repeatedly. yet you continually take it to 32 and even extrapolate to beyond 60. You don't see the problem there? Of course not, because you don't understand the topic.

mjlef · Post by **mjlef** » Sat Oct 25, 2014 3:48 am

Bob, were your results with a Windows compile? Perhaps this issue is related to Windows compiles?

Mark

bob · Post by **bob** » Sat Oct 25, 2014 5:18 am

mjlef wrote:Bob, were your results with a Windows compile? Perhaps this issue is related to Windows compiles?

Mark

No, I don't run windows anywhere. Almost every box I have runs linux except for a couple of OS X mac machines. My primary compiler is Intel, but on one cluster we could not get it to install cleanly due to some other issues, leaving me with gcc or nothing. gcc has been acting quite quirky. I posted a couple of threads here a couple of months back about a crazy NPS slowdown that made absolutely no sense. 12 CPU NPS would be less than 1/3 the normal value if I compiled it one way, but change something and it would go back to normal. Until something else was changed. Sometimes profiling ran normally, sometimes 1/3 speed. Etc.

I have never had a case with Crafty where NPS drops with more cores, so long as NOTHING else is running on that machine (I use spin locks everywhere and they do NOT work well if there are more threads than cores, or if there are other things running.)

Modern Times · Post by **Modern Times** » Sat Oct 25, 2014 8:27 am

bob wrote: (I use spin locks everywhere and they do NOT work well if there are more threads than cores, or if there are other things running.)

Then Crafty probably has a problem with the modular design of these AMD CPUs. Other engines are fairly happy with it.

Evert · Post by **Evert** » Sat Oct 25, 2014 9:14 am

Laskos wrote:That confirms V. Rajlich diminishing returns empirical numbers or my logarithmic model, and infirmes

1/ Linear model
2/ R. Hyatt claims that 32x scaling on 64 cores is "bad"

I plot here R. Hyatt model for Crafty (1.78 constant ESU per doubling) in red, which is "bad scaling" in his own words, Komodo 8 scaling in green, and my logarithmic model in black, to predict that Komodo 8 will scale to ESU of about 13 on 64 cores on AMD Opteron or Intel Xeon architecture. My model is fitted only to 16 cores, so 32 core behavior is a _prediction_. And it seems to fit well the empirical data for Komodo on 32 cores.

I'm starting to get pretty annoyed at this type of argument.

1. How much accuracy do you expect to get from a simple linear fit to anything that is outside the range of parameters for which the fit was obtained?
All you can ever use it for is as an educated first guess, in the absence of other data. That comes with the caveat that it's just that: an educated guess.
2. A 32x speedup for 64x more resources is aweful. If I submitted a job on a cluster with that type of scaling, I would be told to free up some cores for other people and not waste everybody's time.

Laskos · Post by **Laskos** » Sat Oct 25, 2014 10:06 am

Evert wrote:
Laskos wrote:That confirms V. Rajlich diminishing returns empirical numbers or my logarithmic model, and infirmes

1/ Linear model
2/ R. Hyatt claims that 32x scaling on 64 cores is "bad"

I plot here R. Hyatt model for Crafty (1.78 constant ESU per doubling) in red, which is "bad scaling" in his own words, Komodo 8 scaling in green, and my logarithmic model in black, to predict that Komodo 8 will scale to ESU of about 13 on 64 cores on AMD Opteron or Intel Xeon architecture. My model is fitted only to 16 cores, so 32 core behavior is a _prediction_. And it seems to fit well the empirical data for Komodo on 32 cores.
I'm starting to get pretty annoyed at this type of argument.

1. How much accuracy do you expect to get from a simple linear fit to anything that is outside the range of parameters for which the fit was obtained?
All you can ever use it for is as an educated first guess, in the absence of other data. That comes with the caveat that it's just that: an educated guess.
2. A 32x speedup for 64x more resources is aweful. If I submitted a job on a cluster with that type of scaling, I would be told to free up some cores for other people and not waste everybody's time.

1/ In chess, with alpha/beta search and YBWC parallelization, one doesn't spread hints that the scaling with threads is linear. It's logarithmic or even Amdahl.
2/ If the educated guess is valid for only 3 doublings, then one doesn't invent a linear formula for those 3 data points, just remember something like 1.8, 3.1, 5.5 numbers.
3/ With YBWC, 32 ESU on 64 cores is VERY HIGH, contrary to what you and R. Hyatt are stating. V. Rajlich was a complete incompetent when talking about ESU 7 on 16 cores, and his diminishing progression?
4/ Some applications are more parallelizable than other, in many games even to 2-3 cores parallelization is problematic. If 32 ESU on 64 cores is "awful" for chess, then just come with a crappy engine on 256 cores with 128 ESU and win WCCC.

syzygy · Post by **syzygy** » Sat Oct 25, 2014 12:43 pm

Evert wrote:2. A 32x speedup for 64x more resources is aweful. If I submitted a job on a cluster with that type of scaling, I would be told to free up some cores for other people and not waste everybody's time.

Not at all. 32x speedup for 64x more resources will usually be an excellent deal. In chess it is likely far more than one can hope for.

Whether 64x more resources for 32x speedup is "worth it" will depend on the value of being 32x faster. If it means finding the cure to some deadly disease in 1 month instead of 32 months, it should certainly be considered. The gain in time can be worth far more than the investment in more resources.

In 1996/1997, Deep Blue only got a relatively very modest speedup from the huge amount of resources thrown in. That was still worth it to IBM.

Scaling with threads based on Andreas Strangmüller tests

Scaling with threads based on Andreas Strangmüller tests

Re: Scaling with threads based on Andreas Strangmüller tests

Re: Scaling with threads based on Andreas Strangmüller tests

Re: Scaling with threads based on Andreas Strangmüller tests

Re: Scaling with threads based on Andreas Strangmüller tests

Re: Scaling with threads based on Andreas Strangmüller tests

Re: Scaling with threads based on Andreas Strangmüller tests

Re: Scaling with threads based on Andreas Strangmüller tests

Re: Scaling with threads based on Andreas Strangmüller tests

Re: Scaling with threads based on Andreas Strangmüller tests