Real Speedup due to core doubling etc

Werewolf · Post by **Werewolf** » Tue Jul 15, 2014 1:07 pm

I have two questions, both of which seem to be disputed:

1) When a doubling of physical cores takes place (say going from 1 to 2 cores, both at same clock speed) is this formula generally accepted as right to measure speedup?:

number of cores^0.76

To get the total speed of the machine, one would multiply this value by clockspeed (say 3.0 GHz) and processor efficiency.

Can someone explain WHY this value is used? Is it based on empirical evidence or just maths? The Rybka team favoured this formula.

2) Linked to the above question:
Can someone deduce the search efficiency loss when a doubling of cores occurs?
The reason I ask is because it would help greatly in determining whether Hyper-Threading is good or bad.

This is what I've heard so far:

Bob H: approx 30% loss of efficiency for a doubling. Therefore HT ON would have to raise NPS by 30% just to break even.
(Bob if you're reading this, how did you work that out?)

Robert Houdart: same as Bob but 20% instead of 30% (he was applying this to Houdini)

My own workings, based on the top formula with ^0.76, are that search efficiency drops by 15% each doubling. e.g:

2^0.76 = 1.7 (i.e. for a doubling of cores a 1.7x speedup is achieved)
1 core = 1000 NPS (say)
2 cores = 2000 NPS = 1700 effective NPS

which means you need a co-efficient of 0.85 to make this happen, which is a loss of 15%

Can anyone comment on this, please?

I realise there will be variations from engine to engine, but if we can deduce a principle...that would help a lot!

Vinvin · Post by **Vinvin** » Tue Jul 15, 2014 1:59 pm

I give the numbers for my 3 computers (dual core).
see here : http://www.sedatcanbaz.com/chess/?page_id=19
Two config only in 32 bit and one in 64 and 32 bit.

Code: Select all

         CPU               Mode       Freq       Nbr cores   Kn/s   1C*2/2C  2C/1C
Intel Core2Duo T-7300      w32       2.00GHz         2       1082   0,958   1,915
Intel Core2Duo T-7300      w32       2.00GHz         1        565 

AMD Athlon 64 X2 4200+     w32       2.21GHz         2       1019   0,974   1,948
AMD Athlon 64 X2 4200+     w32       2.21GHz         1        523

AMD Athlon 64 X2 4200+     x64       2.21GHz         2       1243   0,932   1,864
AMD Athlon 64 X2 4200+     x64       2.21GHz         1        667

Intel Corei3-2100          w32       3.10GHz         2       1804   0,953   1,907
Intel Corei3-2100          w32       3.10GHz         1        946

It's clear that the numbers are way higher than 0.85 (or *1.7) 10 years old (or more ?).
This could be explained by :
- dual CPU more efficient (hardware)
- algo+code more efficient (programming)
- OS more efficient to manage 2 CPUs or organize threads

Werewolf · Post by **Werewolf** » Tue Jul 15, 2014 2:16 pm

Vinvin wrote:I give the numbers for my 3 computers (dual core).
see here : http://www.sedatcanbaz.com/chess/?page_id=19
Two config only in 32 bit and one in 64 and 32 bit.
Code: Select all
         CPU               Mode       Freq       Nbr cores   Kn/s   1C*2/2C  2C/1C
Intel Core2Duo T-7300      w32       2.00GHz         2       1082   0,958   1,915
Intel Core2Duo T-7300      w32       2.00GHz         1        565 

AMD Athlon 64 X2 4200+     w32       2.21GHz         2       1019   0,974   1,948
AMD Athlon 64 X2 4200+     w32       2.21GHz         1        523

AMD Athlon 64 X2 4200+     x64       2.21GHz         2       1243   0,932   1,864
AMD Athlon 64 X2 4200+     x64       2.21GHz         1        667

Intel Corei3-2100          w32       3.10GHz         2       1804   0,953   1,907
Intel Corei3-2100          w32       3.10GHz         1        946
It's clear that the numbers are way higher than 0.85 (or *1.7) 10 years old (or more ?).
This could be explained by :
- dual CPU more efficient (hardware)
- algo+code more efficient (programming)
- OS more efficient to manage 2 CPUs or organize threads

No, I think you're missing the point, sorry.

I'm not talking about the NPS on the screen. I made it clear above I'm talking about search efficiency loss and TRUE speed derived from it.

zullil · Post by **zullil** » Tue Jul 15, 2014 2:16 pm

Werewolf wrote:
Can anyone comment on this, please?

I realise there will be variations from engine to engine, but if we can deduce a principle...that would help a lot!

Code: Select all

Latest Stockfish
Hash = 1024 MB
Threads = 1
position = starting position
nps recorded at depth 30 = 1,260,000

Latest Stockfish
Hash = 1024 MB
Threads = 16
position = starting position
nps recorded at depth 30 = 15,100,000

ratio is 12.0

16 ^ (0.76) = 8.2

With Threads = 16, nps will continue to increase with depth, at least for a while. For example,

Code: Select all

Latest Stockfish
Hash = 1024 MB
Threads = 16
position = starting position
nps recorded at depth 35 = 16,800,000

[EDIT] Apparently I don't understand your post either. Maybe you should try again.

Do you mean speedup measured in some reasonable way, such as time to a fixed depth?

In that case, with one thread depth 30 was finished in 795561 ms, while with 16 threads completion of depth 30 took 67785 ms (at least for this single run). That's a ratio of about 11.7, so again the formula doesn't look so good.

Vinvin · Post by **Vinvin** » Tue Jul 15, 2014 3:00 pm

Werewolf wrote:
Vinvin wrote:I give the numbers for my 3 computers (dual core).
see here : http://www.sedatcanbaz.com/chess/?page_id=19
Two config only in 32 bit and one in 64 and 32 bit.
Code: Select all
         CPU               Mode       Freq       Nbr cores   Kn/s   1C*2/2C  2C/1C
Intel Core2Duo T-7300      w32       2.00GHz         2       1082   0,958   1,915
Intel Core2Duo T-7300      w32       2.00GHz         1        565 

AMD Athlon 64 X2 4200+     w32       2.21GHz         2       1019   0,974   1,948
AMD Athlon 64 X2 4200+     w32       2.21GHz         1        523

AMD Athlon 64 X2 4200+     x64       2.21GHz         2       1243   0,932   1,864
AMD Athlon 64 X2 4200+     x64       2.21GHz         1        667

Intel Corei3-2100          w32       3.10GHz         2       1804   0,953   1,907
Intel Corei3-2100          w32       3.10GHz         1        946
It's clear that the numbers are way higher than 0.85 (or *1.7) 10 years old (or more ?).
This could be explained by :
- dual CPU more efficient (hardware)
- algo+code more efficient (programming)
- OS more efficient to manage 2 CPUs or organize threads
No, I think you're missing the point, sorry.

I'm not talking about the NPS on the screen. I made it clear above I'm talking about search efficiency loss and TRUE speed derived from it.

The point is to get a realistic formula. I try to get recent and real numbers for some years, without success.
http://www.talkchess.com/forum/viewtopi ... 855#551855

bob · Post by **bob** » Tue Jul 15, 2014 7:15 pm

Werewolf wrote:I have two questions, both of which seem to be disputed:

1) When a doubling of physical cores takes place (say going from 1 to 2 cores, both at same clock speed) is this formula generally accepted as right to measure speedup?:

number of cores^0.76

To get the total speed of the machine, one would multiply this value by clockspeed (say 3.0 GHz) and processor efficiency.

Can someone explain WHY this value is used? Is it based on empirical evidence or just maths? The Rybka team favoured this formula.

2) Linked to the above question:
Can someone deduce the search efficiency loss when a doubling of cores occurs?
The reason I ask is because it would help greatly in determining whether Hyper-Threading is good or bad.

This is what I've heard so far:

Bob H: approx 30% loss of efficiency for a doubling. Therefore HT ON would have to raise NPS by 30% just to break even.
(Bob if you're reading this, how did you work that out?)

Based on empirical testing more than anything. I have not done this test/analysis in a few years, so it MIGHT have changed although I personally doubt it.

For Crafty, here's some current data. Not near enough, but I ran this just now on my office iMac, which is a quad-core i7 running at 3.1ghz, 8mb of L3 cache, 16gb RAM. I made 4 runs with 4 threads, 4 runs with 8 threads (which uses hyper-threading, obviously) and then 1 run with one thread. Test position was kopek #22, searched to a depth of 30 plies. Here's a small table with speedup (time(1) / time(n)) and bps increase (nps(n)/nps(1)). I would normally run 16 of each to smooth things out, but this gives the idea:

Code: Select all


#cpu    time       speedup 

   4       49.2           4.0     
   4       58.8           3.4     
   4       47.7           3.7      
   4       52.7           3.7   
  avg     52.1           3.8   

   8       40.8           4.8    
   8       60.0           3.3    
   8       43.0           4.6    
   8       41.6           4.7    
  avg     46.4           4.3

So on my iMac, hyper threading actually seems to help, but I am not certain, because I don't particularly trust the process scheduler to run 4 threads on 4 physical cores correctly.

For my estimated speedup formula, the last time I measured this carefully, each additional thread added about 30% EXTRA nodes (30% of the nodes 1 cpu searched) as search overhead. It is not perfectly linear, but a simple linear fit goes

speedup = 1 + (N - 1) * 0.7

where N = number of CPUs. That is a fairly pessimistic formula, but it works. 0.8 might be a better number today. I need to do a bunch of test runs to confirm, which I will do later this week.

for 4 cpus, that gives 3.1x as the predicted speedup. The 4^.76 predicts 2.9x. Pretty close (I never saw any super-performance from Rybka's parallel search for the few numbers that were ever posted.

Robert Houdart: same as Bob but 20% instead of 30% (he was applying this to Houdini)

My own workings, based on the top formula with ^0.76, are that search efficiency drops by 15% each doubling. e.g:

2^0.76 = 1.7 (i.e. for a doubling of cores a 1.7x speedup is achieved)
1 core = 1000 NPS (say)
2 cores = 2000 NPS = 1700 effective NPS

which means you need a co-efficient of 0.85 to make this happen, which is a loss of 15%

Can anyone comment on this, please?

I realise there will be variations from engine to engine, but if we can deduce a principle...that would help a lot!

There is NO "principle" here. Every parallel approach, combined with a specific search implementation, will produce different speedup numbers. This is a "personalized" number that will fit one specific implementation of one program. My numbers have changed a bit because I spent some time over the past year trying to clean up the parallel search to make it faster and more efficient. And my speedup numbers have changed. A "one size fits all" approach REALLY turns into a "one size fits none" formula for this specific performance measure.

Note that I don't trust my speedup numbers from the iMac because of turboboost which can't be turned off. I'll run this on my cluster later this week where can have turboboost and hyper threading both disabled..

bob · Post by **bob** » Tue Jul 15, 2014 7:16 pm

Vinvin wrote:I give the numbers for my 3 computers (dual core).
see here : http://www.sedatcanbaz.com/chess/?page_id=19
Two config only in 32 bit and one in 64 and 32 bit.
Code: Select all
         CPU               Mode       Freq       Nbr cores   Kn/s   1C*2/2C  2C/1C
Intel Core2Duo T-7300      w32       2.00GHz         2       1082   0,958   1,915
Intel Core2Duo T-7300      w32       2.00GHz         1        565 

AMD Athlon 64 X2 4200+     w32       2.21GHz         2       1019   0,974   1,948
AMD Athlon 64 X2 4200+     w32       2.21GHz         1        523

AMD Athlon 64 X2 4200+     x64       2.21GHz         2       1243   0,932   1,864
AMD Athlon 64 X2 4200+     x64       2.21GHz         1        667

Intel Corei3-2100          w32       3.10GHz         2       1804   0,953   1,907
Intel Corei3-2100          w32       3.10GHz         1        946
It's clear that the numbers are way higher than 0.85 (or *1.7) 10 years old (or more ?).
This could be explained by :
- dual CPU more efficient (hardware)
- algo+code more efficient (programming)
- OS more efficient to manage 2 CPUs or organize threads

You are comparing apples and oranges. The speedup formula(s) you gave above are NOT measuring the improvement in NPS, but instead predict how much faster 2N processors will search to the same depth when compared to just N.

Vinvin · Post by **Vinvin** » Tue Jul 15, 2014 8:25 pm

bob wrote:
Vinvin wrote:I give the numbers for my 3 computers (dual core).
see here : http://www.sedatcanbaz.com/chess/?page_id=19
Two config only in 32 bit and one in 64 and 32 bit.
Code: Select all
         CPU               Mode       Freq       Nbr cores   Kn/s   1C*2/2C  2C/1C
Intel Core2Duo T-7300      w32       2.00GHz         2       1082   0,958   1,915
Intel Core2Duo T-7300      w32       2.00GHz         1        565 

AMD Athlon 64 X2 4200+     w32       2.21GHz         2       1019   0,974   1,948
AMD Athlon 64 X2 4200+     w32       2.21GHz         1        523

AMD Athlon 64 X2 4200+     x64       2.21GHz         2       1243   0,932   1,864
AMD Athlon 64 X2 4200+     x64       2.21GHz         1        667

Intel Corei3-2100          w32       3.10GHz         2       1804   0,953   1,907
Intel Corei3-2100          w32       3.10GHz         1        946
It's clear that the numbers are way higher than 0.85 (or *1.7) 10 years old (or more ?).
This could be explained by :
- dual CPU more efficient (hardware)
- algo+code more efficient (programming)
- OS more efficient to manage 2 CPUs or organize threads
You are comparing apples and oranges. The speedup formula(s) you gave above are NOT measuring the improvement in NPS, but instead predict how much faster 2N processors will search to the same depth when compared to just N.

Oops, yes sorry. Too hot, too tired = too much bills**t from me

Ajedrecista · Post by **Ajedrecista** » Tue Jul 15, 2014 8:38 pm

Hello Carl:

Werewolf wrote:I have two questions, both of which seem to be disputed:

1) When a doubling of physical cores takes place (say going from 1 to 2 cores, both at same clock speed) is this formula generally accepted as right to measure speedup?:

number of cores^0.76

To get the total speed of the machine, one would multiply this value by clockspeed (say 3.0 GHz) and processor efficiency.

Can someone explain WHY this value is used? Is it based on empirical evidence or just maths? The Rybka team favoured this formula.

2) Linked to the above question:
Can someone deduce the search efficiency loss when a doubling of cores occurs?
The reason I ask is because it would help greatly in determining whether Hyper-Threading is good or bad.

This is what I've heard so far:

Bob H: approx 30% loss of efficiency for a doubling. Therefore HT ON would have to raise NPS by 30% just to break even.
(Bob if you're reading this, how did you work that out?)

Robert Houdart: same as Bob but 20% instead of 30% (he was applying this to Houdini)

My own workings, based on the top formula with ^0.76, are that search efficiency drops by 15% each doubling. e.g:

2^0.76 = 1.7 (i.e. for a doubling of cores a 1.7x speedup is achieved)
1 core = 1000 NPS (say)
2 cores = 2000 NPS = 1700 effective NPS

which means you need a co-efficient of 0.85 to make this happen, which is a loss of 15%

Can anyone comment on this, please?

I realise there will be variations from engine to engine, but if we can deduce a principle...that would help a lot!

I agree with Bob when he says that each engine has its SMP implementation that performs differently than other engine.

I paste here a link of speeds versus cores:

Threads factor: Komodo, Houdini, Stockfish and Zappa

I realize that you do not want 'NPS on the screen' (quoting you) but probably you can toy with these numbers. I would propose something like cores N in the x-axis and y1 = ln[v(N)/v(1)]/ln(N) in the y-axis (where v(N) is the speed with N cores); or N in the x-axis and y2 = v(N)/[N*v(1)] in the y-axis, and see what happens.

I did this with Houdini (the data is easier to write in Excel because speeds are rounded to kN/s) and I got:

Code: Select all

y1 = ln&#91;v&#40;N&#41;/v&#40;1&#41;&#93;/ln&#40;N&#41;
y2 = v&#40;N&#41;/&#91;N*v&#40;1&#41;&#93;

 N          y1              y2

 2      0.919977917     0.946043165
 3      0.903368825     0.899280576
 4      0.904418445     0.875899281
 5      0.904750354     0.857873701
 6      0.899937132     0.835864642
 7      0.910408467     0.840013703
 8      0.907977677     0.825839329
 9      0.904157106     0.810107470
10      0.892325980     0.780415667
11      0.896788218     0.780757212
12      0.884085220     0.749733546
13      0.884913365     0.744389104
14      0.880238100     0.729016787
15      0.874521116     0.711910472
16      0.875828540     0.708733014
17      0.854114429     0.661447313
18      0.830518752     0.612709832
19      0.826192471     0.599436240
20      0.823265285     0.588928857
21      0.800436038     0.544669027
22      0.789134766     0.521110384
23      0.785680293     0.510687102
24      0.782670290     0.501232347
25      0.778856370     0.490743405
26      0.776088215     0.482137367
27      0.767819800     0.465227818
28      0.765152614     0.457234213
29      0.762926564     0.450095096
30      0.758901320     0.440420997
31      0.754006147     0.429669168
32      0.750575725     0.421287970

Excel calculated these trend lines by least squares (linear approximation):

Code: Select all

&#40;y1&#41;* ~ 0.9492 - 0.0064*N; R² ~ 0.9441
&#40;y2&#41;* ~ 0.9612 - 0.0181*N; R² ~ 0.9832

I do not know if it has sense. For example, I would say for (y2)* = aN + b that a decreasing value of |a| is good (less efficiency loss). The same applies to (y1)*.

------------

The counter says 1000 posts. It was unthinkable for me when I registered here almost three years ago!

Regards from Spain.

Ajedrecista.

CRoberson · Post by **CRoberson** » Wed Jul 16, 2014 6:23 am

IIRC, the Rybka team knew of the equation NPS speedup = 1 + (N-1)*0.7, but they saw many customers getting confused when the TTP (Time To Ply) speed uo didn't equal the same value as the NPS speedup due to the workload gain. So, they adjusted the equation to be a TTP equation.

Real Speedup due to core doubling etc

Real Speedup due to core doubling etc

Re: Real Speedup due to core doubling etc

Re: Real Speedup due to core doubling etc

Re: Real Speedup due to core doubling etc

Re: Real Speedup due to core doubling etc

Re: Real Speedup due to core doubling etc

Re: Real Speedup due to core doubling etc

Re: Real Speedup due to core doubling etc

Re: Real speedup due to core doubling, etc.

Re: Real Speedup due to core doubling etc