I have two questions, both of which seem to be disputed:
1) When a doubling of physical cores takes place (say going from 1 to 2 cores, both at same clock speed) is this formula generally accepted as right to measure speedup?:
number of cores^0.76
To get the total speed of the machine, one would multiply this value by clockspeed (say 3.0 GHz) and processor efficiency.
Can someone explain WHY this value is used? Is it based on empirical evidence or just maths? The Rybka team favoured this formula.
2) Linked to the above question:
Can someone deduce the search efficiency loss when a doubling of cores occurs?
The reason I ask is because it would help greatly in determining whether Hyper-Threading is good or bad.
This is what I've heard so far:
Bob H: approx 30% loss of efficiency for a doubling. Therefore HT ON would have to raise NPS by 30% just to break even.
(Bob if you're reading this, how did you work that out?)
Robert Houdart: same as Bob but 20% instead of 30% (he was applying this to Houdini)
My own workings, based on the top formula with ^0.76, are that search efficiency drops by 15% each doubling. e.g:
2^0.76 = 1.7 (i.e. for a doubling of cores a 1.7x speedup is achieved)
1 core = 1000 NPS (say)
2 cores = 2000 NPS = 1700 effective NPS
which means you need a co-efficient of 0.85 to make this happen, which is a loss of 15%
Can anyone comment on this, please?
I realise there will be variations from engine to engine, but if we can deduce a principle...that would help a lot!
Real Speedup due to core doubling etc
Moderators: hgm, Rebel, chrisw
-
- Posts: 5228
- Joined: Thu Mar 09, 2006 9:40 am
- Full name: Vincent Lejeune
Re: Real Speedup due to core doubling etc
I give the numbers for my 3 computers (dual core).
see here : http://www.sedatcanbaz.com/chess/?page_id=19
Two config only in 32 bit and one in 64 and 32 bit.
It's clear that the numbers are way higher than 0.85 (or *1.7) 10 years old (or more ?).
This could be explained by :
- dual CPU more efficient (hardware)
- algo+code more efficient (programming)
- OS more efficient to manage 2 CPUs or organize threads
see here : http://www.sedatcanbaz.com/chess/?page_id=19
Two config only in 32 bit and one in 64 and 32 bit.
Code: Select all
CPU Mode Freq Nbr cores Kn/s 1C*2/2C 2C/1C
Intel Core2Duo T-7300 w32 2.00GHz 2 1082 0,958 1,915
Intel Core2Duo T-7300 w32 2.00GHz 1 565
AMD Athlon 64 X2 4200+ w32 2.21GHz 2 1019 0,974 1,948
AMD Athlon 64 X2 4200+ w32 2.21GHz 1 523
AMD Athlon 64 X2 4200+ x64 2.21GHz 2 1243 0,932 1,864
AMD Athlon 64 X2 4200+ x64 2.21GHz 1 667
Intel Corei3-2100 w32 3.10GHz 2 1804 0,953 1,907
Intel Corei3-2100 w32 3.10GHz 1 946
This could be explained by :
- dual CPU more efficient (hardware)
- algo+code more efficient (programming)
- OS more efficient to manage 2 CPUs or organize threads
-
- Posts: 1796
- Joined: Thu Sep 18, 2008 10:24 pm
Re: Real Speedup due to core doubling etc
No, I think you're missing the point, sorry.Vinvin wrote:I give the numbers for my 3 computers (dual core).
see here : http://www.sedatcanbaz.com/chess/?page_id=19
Two config only in 32 bit and one in 64 and 32 bit.It's clear that the numbers are way higher than 0.85 (or *1.7) 10 years old (or more ?).Code: Select all
CPU Mode Freq Nbr cores Kn/s 1C*2/2C 2C/1C Intel Core2Duo T-7300 w32 2.00GHz 2 1082 0,958 1,915 Intel Core2Duo T-7300 w32 2.00GHz 1 565 AMD Athlon 64 X2 4200+ w32 2.21GHz 2 1019 0,974 1,948 AMD Athlon 64 X2 4200+ w32 2.21GHz 1 523 AMD Athlon 64 X2 4200+ x64 2.21GHz 2 1243 0,932 1,864 AMD Athlon 64 X2 4200+ x64 2.21GHz 1 667 Intel Corei3-2100 w32 3.10GHz 2 1804 0,953 1,907 Intel Corei3-2100 w32 3.10GHz 1 946
This could be explained by :
- dual CPU more efficient (hardware)
- algo+code more efficient (programming)
- OS more efficient to manage 2 CPUs or organize threads
I'm not talking about the NPS on the screen. I made it clear above I'm talking about search efficiency loss and TRUE speed derived from it.
-
- Posts: 6442
- Joined: Tue Jan 09, 2007 12:31 am
- Location: PA USA
- Full name: Louis Zulli
Re: Real Speedup due to core doubling etc
Werewolf wrote:
Can anyone comment on this, please?
I realise there will be variations from engine to engine, but if we can deduce a principle...that would help a lot!
Code: Select all
Latest Stockfish
Hash = 1024 MB
Threads = 1
position = starting position
nps recorded at depth 30 = 1,260,000
Latest Stockfish
Hash = 1024 MB
Threads = 16
position = starting position
nps recorded at depth 30 = 15,100,000
16 ^ (0.76) = 8.2
With Threads = 16, nps will continue to increase with depth, at least for a while. For example,
Code: Select all
Latest Stockfish
Hash = 1024 MB
Threads = 16
position = starting position
nps recorded at depth 35 = 16,800,000
Do you mean speedup measured in some reasonable way, such as time to a fixed depth?
In that case, with one thread depth 30 was finished in 795561 ms, while with 16 threads completion of depth 30 took 67785 ms (at least for this single run). That's a ratio of about 11.7, so again the formula doesn't look so good.
-
- Posts: 5228
- Joined: Thu Mar 09, 2006 9:40 am
- Full name: Vincent Lejeune
Re: Real Speedup due to core doubling etc
The point is to get a realistic formula. I try to get recent and real numbers for some years, without success.Werewolf wrote:No, I think you're missing the point, sorry.Vinvin wrote:I give the numbers for my 3 computers (dual core).
see here : http://www.sedatcanbaz.com/chess/?page_id=19
Two config only in 32 bit and one in 64 and 32 bit.It's clear that the numbers are way higher than 0.85 (or *1.7) 10 years old (or more ?).Code: Select all
CPU Mode Freq Nbr cores Kn/s 1C*2/2C 2C/1C Intel Core2Duo T-7300 w32 2.00GHz 2 1082 0,958 1,915 Intel Core2Duo T-7300 w32 2.00GHz 1 565 AMD Athlon 64 X2 4200+ w32 2.21GHz 2 1019 0,974 1,948 AMD Athlon 64 X2 4200+ w32 2.21GHz 1 523 AMD Athlon 64 X2 4200+ x64 2.21GHz 2 1243 0,932 1,864 AMD Athlon 64 X2 4200+ x64 2.21GHz 1 667 Intel Corei3-2100 w32 3.10GHz 2 1804 0,953 1,907 Intel Corei3-2100 w32 3.10GHz 1 946
This could be explained by :
- dual CPU more efficient (hardware)
- algo+code more efficient (programming)
- OS more efficient to manage 2 CPUs or organize threads
I'm not talking about the NPS on the screen. I made it clear above I'm talking about search efficiency loss and TRUE speed derived from it.
http://www.talkchess.com/forum/viewtopi ... 855#551855
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Real Speedup due to core doubling etc
Werewolf wrote:I have two questions, both of which seem to be disputed:
1) When a doubling of physical cores takes place (say going from 1 to 2 cores, both at same clock speed) is this formula generally accepted as right to measure speedup?:
number of cores^0.76
To get the total speed of the machine, one would multiply this value by clockspeed (say 3.0 GHz) and processor efficiency.
Can someone explain WHY this value is used? Is it based on empirical evidence or just maths? The Rybka team favoured this formula.
2) Linked to the above question:
Can someone deduce the search efficiency loss when a doubling of cores occurs?
The reason I ask is because it would help greatly in determining whether Hyper-Threading is good or bad.
This is what I've heard so far:
Bob H: approx 30% loss of efficiency for a doubling. Therefore HT ON would have to raise NPS by 30% just to break even.
(Bob if you're reading this, how did you work that out?)
Based on empirical testing more than anything. I have not done this test/analysis in a few years, so it MIGHT have changed although I personally doubt it.
For Crafty, here's some current data. Not near enough, but I ran this just now on my office iMac, which is a quad-core i7 running at 3.1ghz, 8mb of L3 cache, 16gb RAM. I made 4 runs with 4 threads, 4 runs with 8 threads (which uses hyper-threading, obviously) and then 1 run with one thread. Test position was kopek #22, searched to a depth of 30 plies. Here's a small table with speedup (time(1) / time(n)) and bps increase (nps(n)/nps(1)). I would normally run 16 of each to smooth things out, but this gives the idea:
Code: Select all
#cpu time speedup
4 49.2 4.0
4 58.8 3.4
4 47.7 3.7
4 52.7 3.7
avg 52.1 3.8
8 40.8 4.8
8 60.0 3.3
8 43.0 4.6
8 41.6 4.7
avg 46.4 4.3
For my estimated speedup formula, the last time I measured this carefully, each additional thread added about 30% EXTRA nodes (30% of the nodes 1 cpu searched) as search overhead. It is not perfectly linear, but a simple linear fit goes
speedup = 1 + (N - 1) * 0.7
where N = number of CPUs. That is a fairly pessimistic formula, but it works. 0.8 might be a better number today. I need to do a bunch of test runs to confirm, which I will do later this week.
for 4 cpus, that gives 3.1x as the predicted speedup. The 4^.76 predicts 2.9x. Pretty close (I never saw any super-performance from Rybka's parallel search for the few numbers that were ever posted.
There is NO "principle" here. Every parallel approach, combined with a specific search implementation, will produce different speedup numbers. This is a "personalized" number that will fit one specific implementation of one program. My numbers have changed a bit because I spent some time over the past year trying to clean up the parallel search to make it faster and more efficient. And my speedup numbers have changed. A "one size fits all" approach REALLY turns into a "one size fits none" formula for this specific performance measure.
Robert Houdart: same as Bob but 20% instead of 30% (he was applying this to Houdini)
My own workings, based on the top formula with ^0.76, are that search efficiency drops by 15% each doubling. e.g:
2^0.76 = 1.7 (i.e. for a doubling of cores a 1.7x speedup is achieved)
1 core = 1000 NPS (say)
2 cores = 2000 NPS = 1700 effective NPS
which means you need a co-efficient of 0.85 to make this happen, which is a loss of 15%
Can anyone comment on this, please?
I realise there will be variations from engine to engine, but if we can deduce a principle...that would help a lot!
Note that I don't trust my speedup numbers from the iMac because of turboboost which can't be turned off. I'll run this on my cluster later this week where can have turboboost and hyper threading both disabled..
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Real Speedup due to core doubling etc
You are comparing apples and oranges. The speedup formula(s) you gave above are NOT measuring the improvement in NPS, but instead predict how much faster 2N processors will search to the same depth when compared to just N.Vinvin wrote:I give the numbers for my 3 computers (dual core).
see here : http://www.sedatcanbaz.com/chess/?page_id=19
Two config only in 32 bit and one in 64 and 32 bit.It's clear that the numbers are way higher than 0.85 (or *1.7) 10 years old (or more ?).Code: Select all
CPU Mode Freq Nbr cores Kn/s 1C*2/2C 2C/1C Intel Core2Duo T-7300 w32 2.00GHz 2 1082 0,958 1,915 Intel Core2Duo T-7300 w32 2.00GHz 1 565 AMD Athlon 64 X2 4200+ w32 2.21GHz 2 1019 0,974 1,948 AMD Athlon 64 X2 4200+ w32 2.21GHz 1 523 AMD Athlon 64 X2 4200+ x64 2.21GHz 2 1243 0,932 1,864 AMD Athlon 64 X2 4200+ x64 2.21GHz 1 667 Intel Corei3-2100 w32 3.10GHz 2 1804 0,953 1,907 Intel Corei3-2100 w32 3.10GHz 1 946
This could be explained by :
- dual CPU more efficient (hardware)
- algo+code more efficient (programming)
- OS more efficient to manage 2 CPUs or organize threads
-
- Posts: 5228
- Joined: Thu Mar 09, 2006 9:40 am
- Full name: Vincent Lejeune
Re: Real Speedup due to core doubling etc
Oops, yes sorry. Too hot, too tired = too much bills**t from mebob wrote:You are comparing apples and oranges. The speedup formula(s) you gave above are NOT measuring the improvement in NPS, but instead predict how much faster 2N processors will search to the same depth when compared to just N.Vinvin wrote:I give the numbers for my 3 computers (dual core).
see here : http://www.sedatcanbaz.com/chess/?page_id=19
Two config only in 32 bit and one in 64 and 32 bit.It's clear that the numbers are way higher than 0.85 (or *1.7) 10 years old (or more ?).Code: Select all
CPU Mode Freq Nbr cores Kn/s 1C*2/2C 2C/1C Intel Core2Duo T-7300 w32 2.00GHz 2 1082 0,958 1,915 Intel Core2Duo T-7300 w32 2.00GHz 1 565 AMD Athlon 64 X2 4200+ w32 2.21GHz 2 1019 0,974 1,948 AMD Athlon 64 X2 4200+ w32 2.21GHz 1 523 AMD Athlon 64 X2 4200+ x64 2.21GHz 2 1243 0,932 1,864 AMD Athlon 64 X2 4200+ x64 2.21GHz 1 667 Intel Corei3-2100 w32 3.10GHz 2 1804 0,953 1,907 Intel Corei3-2100 w32 3.10GHz 1 946
This could be explained by :
- dual CPU more efficient (hardware)
- algo+code more efficient (programming)
- OS more efficient to manage 2 CPUs or organize threads
-
- Posts: 1969
- Joined: Wed Jul 13, 2011 9:04 pm
- Location: Madrid, Spain.
Re: Real speedup due to core doubling, etc.
Hello Carl:
I paste here a link of speeds versus cores:
Threads factor: Komodo, Houdini, Stockfish and Zappa
I realize that you do not want 'NPS on the screen' (quoting you) but probably you can toy with these numbers. I would propose something like cores N in the x-axis and y1 = ln[v(N)/v(1)]/ln(N) in the y-axis (where v(N) is the speed with N cores); or N in the x-axis and y2 = v(N)/[N*v(1)] in the y-axis, and see what happens.
I did this with Houdini (the data is easier to write in Excel because speeds are rounded to kN/s) and I got:
Excel calculated these trend lines by least squares (linear approximation):
I do not know if it has sense. For example, I would say for (y2)* = aN + b that a decreasing value of |a| is good (less efficiency loss). The same applies to (y1)*.
------------
The counter says 1000 posts. It was unthinkable for me when I registered here almost three years ago!
Regards from Spain.
Ajedrecista.
I agree with Bob when he says that each engine has its SMP implementation that performs differently than other engine.Werewolf wrote:I have two questions, both of which seem to be disputed:
1) When a doubling of physical cores takes place (say going from 1 to 2 cores, both at same clock speed) is this formula generally accepted as right to measure speedup?:
number of cores^0.76
To get the total speed of the machine, one would multiply this value by clockspeed (say 3.0 GHz) and processor efficiency.
Can someone explain WHY this value is used? Is it based on empirical evidence or just maths? The Rybka team favoured this formula.
2) Linked to the above question:
Can someone deduce the search efficiency loss when a doubling of cores occurs?
The reason I ask is because it would help greatly in determining whether Hyper-Threading is good or bad.
This is what I've heard so far:
Bob H: approx 30% loss of efficiency for a doubling. Therefore HT ON would have to raise NPS by 30% just to break even.
(Bob if you're reading this, how did you work that out?)
Robert Houdart: same as Bob but 20% instead of 30% (he was applying this to Houdini)
My own workings, based on the top formula with ^0.76, are that search efficiency drops by 15% each doubling. e.g:
2^0.76 = 1.7 (i.e. for a doubling of cores a 1.7x speedup is achieved)
1 core = 1000 NPS (say)
2 cores = 2000 NPS = 1700 effective NPS
which means you need a co-efficient of 0.85 to make this happen, which is a loss of 15%
Can anyone comment on this, please?
I realise there will be variations from engine to engine, but if we can deduce a principle...that would help a lot!
I paste here a link of speeds versus cores:
Threads factor: Komodo, Houdini, Stockfish and Zappa
I realize that you do not want 'NPS on the screen' (quoting you) but probably you can toy with these numbers. I would propose something like cores N in the x-axis and y1 = ln[v(N)/v(1)]/ln(N) in the y-axis (where v(N) is the speed with N cores); or N in the x-axis and y2 = v(N)/[N*v(1)] in the y-axis, and see what happens.
I did this with Houdini (the data is easier to write in Excel because speeds are rounded to kN/s) and I got:
Code: Select all
y1 = ln[v(N)/v(1)]/ln(N)
y2 = v(N)/[N*v(1)]
N y1 y2
2 0.919977917 0.946043165
3 0.903368825 0.899280576
4 0.904418445 0.875899281
5 0.904750354 0.857873701
6 0.899937132 0.835864642
7 0.910408467 0.840013703
8 0.907977677 0.825839329
9 0.904157106 0.810107470
10 0.892325980 0.780415667
11 0.896788218 0.780757212
12 0.884085220 0.749733546
13 0.884913365 0.744389104
14 0.880238100 0.729016787
15 0.874521116 0.711910472
16 0.875828540 0.708733014
17 0.854114429 0.661447313
18 0.830518752 0.612709832
19 0.826192471 0.599436240
20 0.823265285 0.588928857
21 0.800436038 0.544669027
22 0.789134766 0.521110384
23 0.785680293 0.510687102
24 0.782670290 0.501232347
25 0.778856370 0.490743405
26 0.776088215 0.482137367
27 0.767819800 0.465227818
28 0.765152614 0.457234213
29 0.762926564 0.450095096
30 0.758901320 0.440420997
31 0.754006147 0.429669168
32 0.750575725 0.421287970
Code: Select all
(y1)* ~ 0.9492 - 0.0064*N; R² ~ 0.9441
(y2)* ~ 0.9612 - 0.0181*N; R² ~ 0.9832
------------
The counter says 1000 posts. It was unthinkable for me when I registered here almost three years ago!
Regards from Spain.
Ajedrecista.
-
- Posts: 2055
- Joined: Mon Mar 13, 2006 2:31 am
- Location: North Carolina, USA
Re: Real Speedup due to core doubling etc
IIRC, the Rybka team knew of the equation NPS speedup = 1 + (N-1)*0.7, but they saw many customers getting confused when the TTP (Time To Ply) speed uo didn't equal the same value as the NPS speedup due to the workload gain. So, they adjusted the equation to be a TTP equation.