Stockfish 8 - Double time control vs. 2 threads

Discussion of computer chess matches and engine tournaments.

Moderators: hgm, Rebel, chrisw

Werewolf
Posts: 1797
Joined: Thu Sep 18, 2008 10:24 pm

Re: Stockfish 8 - Quadruple time control vs. 4 threads

Post by Werewolf »

Laskos wrote:
Milos wrote:
Laskos wrote: So,

1 --> 2 threads: 1.91
1 --> 4 threads: 3.52
What I find amazing is that if you calculate efficiency according to Amdahl's Law it is exactly 95.5% both when going from 1->2 and 1->4 cores. Matching is on the second decimal!!! Amazing.
Very interesting observation. And in line to Amdahl's fit I did for Komodo 9.3 using Andreas' results:
http://www.talkchess.com/forum/viewtopi ... 4&start=45
Andreas posts very important, hardcore results which use huge CPU time.

From your observation, the predictions would be:

1 --> 8 threads: 6.1
1 --> 16 threads: 9.55
1 --> 32 threads: 13.4

That is somewhat higher than the old YBW numbers.
Please could you clarify a few things for me?

In one of the posts above, calculating Lazy SMP speedup from 1->4 cores was proposed as:

4^0.9072 = 3.52

Why are your numbers for 1 to 8, 16 and 32 threads so different to this formula?

32^0.9072 = 23.2 which is vastly different to your 13.4

Sorry I'm probably missing something basic.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Stockfish 8 - Quadruple time control vs. 4 threads

Post by Laskos »

Werewolf wrote:
Laskos wrote:
Milos wrote:
Laskos wrote: So,

1 --> 2 threads: 1.91
1 --> 4 threads: 3.52
What I find amazing is that if you calculate efficiency according to Amdahl's Law it is exactly 95.5% both when going from 1->2 and 1->4 cores. Matching is on the second decimal!!! Amazing.
Very interesting observation. And in line to Amdahl's fit I did for Komodo 9.3 using Andreas' results:
http://www.talkchess.com/forum/viewtopi ... 4&start=45
Andreas posts very important, hardcore results which use huge CPU time.

From your observation, the predictions would be:

1 --> 8 threads: 6.1
1 --> 16 threads: 9.55
1 --> 32 threads: 13.4

That is somewhat higher than the old YBW numbers.
Please could you clarify a few things for me?

In one of the posts above, calculating Lazy SMP speedup from 1->4 cores was proposed as:

4^0.9072 = 3.52

Why are your numbers for 1 to 8, 16 and 32 threads so different to this formula?

32^0.9072 = 23.2 which is vastly different to your 13.4

Sorry I'm probably missing something basic.
That 4^0.9072 was to get the speed-up 1 --> 4 threads from Andreas experimental data. These experimental data 1.91 for 2 threads, 3.52 to 4 threads fit perfectly the Amdahl's law which applies to the speed-up of the execution of a task at fixed workload, with efficiency 95.5%, as Milos observed. The Amdahl's law in our case is:

speed-up = 1 / (1 - 0.955 + 0.955/n_cores)

n_cores = 1: speed-up = 1
n_cores = 2: speed-up = 1.91
n_cores = 4: speed-up = 3.52

These numbers fit to two digits the experimental data of Andreas' test.

Predictions are, according to the same formula:

n_cores = 8: speed-up = 6.08
n_cores = 16: speed-up = 9.55
n_cores = 32: speed-up = 13.36
Werewolf
Posts: 1797
Joined: Thu Sep 18, 2008 10:24 pm

Re: Stockfish 8 - Quadruple time control vs. 4 threads

Post by Werewolf »

Laskos wrote:
Werewolf wrote:
Laskos wrote:
Milos wrote:
Laskos wrote: So,

1 --> 2 threads: 1.91
1 --> 4 threads: 3.52
What I find amazing is that if you calculate efficiency according to Amdahl's Law it is exactly 95.5% both when going from 1->2 and 1->4 cores. Matching is on the second decimal!!! Amazing.
Very interesting observation. And in line to Amdahl's fit I did for Komodo 9.3 using Andreas' results:
http://www.talkchess.com/forum/viewtopi ... 4&start=45
Andreas posts very important, hardcore results which use huge CPU time.

From your observation, the predictions would be:

1 --> 8 threads: 6.1
1 --> 16 threads: 9.55
1 --> 32 threads: 13.4

That is somewhat higher than the old YBW numbers.
Please could you clarify a few things for me?

In one of the posts above, calculating Lazy SMP speedup from 1->4 cores was proposed as:

4^0.9072 = 3.52

Why are your numbers for 1 to 8, 16 and 32 threads so different to this formula?

32^0.9072 = 23.2 which is vastly different to your 13.4

Sorry I'm probably missing something basic.
That 4^0.9072 was to get the speed-up 1 --> 4 threads from Andreas experimental data. These experimental data 1.91 for 2 threads, 3.52 to 4 threads fit perfectly the Amdahl's law which applies to the speed-up of the execution of a task at fixed workload, with efficiency 95.5%, as Milos observed. The Amdahl's law in our case is:

speed-up = 1 / (1 - 0.955 + 0.955/n_cores)

n_cores = 1: speed-up = 1
n_cores = 2: speed-up = 1.91
n_cores = 4: speed-up = 3.52

These numbers fit to two digits the experimental data of Andreas' test.

Predictions are, according to the same formula:

n_cores = 8: speed-up = 6.08
n_cores = 16: speed-up = 9.55
n_cores = 32: speed-up = 13.36
Ah I see.

So, we can now estimate for Lazy SMP 32 cores are 13.36x faster than one core at equal clock speed?

There used to be a simple formula which originated somewhere on the Rybka forum which estimated speedup (for Rybka) as

n*cores^0.76

Interestingly, Lazy SMP offers a greater gain on 2-16 cores, but less at 32. Of course back in those days no one had 32 cores so the formula probably only worked up to 16 cores. I suspect Rybka gets very little from 16-->32 cores not having Lazy SMP.
IWB
Posts: 1539
Joined: Thu Mar 09, 2006 2:02 pm

Re: Stockfish 8 - Quadruple time control vs. 4 threads

Post by IWB »

Laskos wrote:...

speed-up = 1 / (1 - 0.955 + 0.955/n_cores)

n_cores = 1: speed-up = 1
n_cores = 2: speed-up = 1.91
n_cores = 4: speed-up = 3.52

These numbers fit to two digits the experimental data of Andreas' test.

Predictions are, according to the same formula:

n_cores = 8: speed-up = 6.08
n_cores = 16: speed-up = 9.55
n_cores = 32: speed-up = 13.36
Does this mean that games on many cores produce the "better" games with ponder on as soon as the time win due to ponder hits is bigger than the speedup as time is always "real" while the speedup with cores is getting lower?!?!

My guess is, that latest from 8 to 16 (3,47 or 43% of 8 cores) the processing power is better used with ponder ON than with more cores ... as soon as your time increases by more than 43% due to ponderhits you produce better games with 8 cores Ponder ON than 16 cores Ponder OFF (6 cores = 4,90, 7 cores = 5,51, 6 or 7 cores seems to be the cut off for Ponder ON)

If so, thats bad for "big Hardware Ponder OFF" tourneys as they waste resources :-)

To me it is questionable if more than 32 cores make any sence (for chess). With 64 cores (16,69) the speedup from 32 would be only + 3,33, even 16 to 32 seems to be very doubtfull with the limited gain of 3.81 :-) )

Ingo
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Stockfish 8 - Quadruple time control vs. 4 threads

Post by Laskos »

Werewolf wrote:
Laskos wrote:
Werewolf wrote:
Laskos wrote:
Milos wrote:
Laskos wrote: So,

1 --> 2 threads: 1.91
1 --> 4 threads: 3.52
What I find amazing is that if you calculate efficiency according to Amdahl's Law it is exactly 95.5% both when going from 1->2 and 1->4 cores. Matching is on the second decimal!!! Amazing.
Very interesting observation. And in line to Amdahl's fit I did for Komodo 9.3 using Andreas' results:
http://www.talkchess.com/forum/viewtopi ... 4&start=45
Andreas posts very important, hardcore results which use huge CPU time.

From your observation, the predictions would be:

1 --> 8 threads: 6.1
1 --> 16 threads: 9.55
1 --> 32 threads: 13.4

That is somewhat higher than the old YBW numbers.
Please could you clarify a few things for me?

In one of the posts above, calculating Lazy SMP speedup from 1->4 cores was proposed as:

4^0.9072 = 3.52

Why are your numbers for 1 to 8, 16 and 32 threads so different to this formula?

32^0.9072 = 23.2 which is vastly different to your 13.4

Sorry I'm probably missing something basic.
That 4^0.9072 was to get the speed-up 1 --> 4 threads from Andreas experimental data. These experimental data 1.91 for 2 threads, 3.52 to 4 threads fit perfectly the Amdahl's law which applies to the speed-up of the execution of a task at fixed workload, with efficiency 95.5%, as Milos observed. The Amdahl's law in our case is:

speed-up = 1 / (1 - 0.955 + 0.955/n_cores)

n_cores = 1: speed-up = 1
n_cores = 2: speed-up = 1.91
n_cores = 4: speed-up = 3.52

These numbers fit to two digits the experimental data of Andreas' test.

Predictions are, according to the same formula:

n_cores = 8: speed-up = 6.08
n_cores = 16: speed-up = 9.55
n_cores = 32: speed-up = 13.36
Ah I see.

So, we can now estimate for Lazy SMP 32 cores are 13.36x faster than one core at equal clock speed?

There used to be a simple formula which originated somewhere on the Rybka forum which estimated speedup (for Rybka) as

n*cores^0.76

Interestingly, Lazy SMP offers a greater gain on 2-16 cores, but less at 32. Of course back in those days no one had 32 cores so the formula probably only worked up to 16 cores. I suspect Rybka gets very little from 16-->32 cores not having Lazy SMP.
That 13.36 for 32 cores is effective speed-up. It means that 1 core in games at 13.36 minutes per game is equal _strength_ with 32 core (same cores as that 1 core) at 1 minute game.

That formula from Rybka forum is naive. It assumes that each doubling gives always 2^0.76 ~ 1.70 speed-up. It derives from : (2*n)^0.76 over (1*n)^0.76 = 2^0.76 always. But we know that doubling at low number of cores is much more efficient than at high number of cores. So 1--> 2 cores is almost perfect, close to 1.9, but 16 --> 32 cores is mediocre, maybe 1.2 for YBW (older engines like Rybka, old Houdini) and 1.4 for Lazy SMP. This is in accordance with Amdahl's law.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Stockfish 8 - Quadruple time control vs. 4 threads

Post by Laskos »

IWB wrote:
Laskos wrote:...

speed-up = 1 / (1 - 0.955 + 0.955/n_cores)

n_cores = 1: speed-up = 1
n_cores = 2: speed-up = 1.91
n_cores = 4: speed-up = 3.52

These numbers fit to two digits the experimental data of Andreas' test.

Predictions are, according to the same formula:

n_cores = 8: speed-up = 6.08
n_cores = 16: speed-up = 9.55
n_cores = 32: speed-up = 13.36
Does this mean that games on many cores produce the "better" games with ponder on as soon as the time win due to ponder hits is bigger than the speedup as time is always "real" while the speedup with cores is getting lower?!?!

My guess is, that latest from 8 to 16 (3,47 or 43% of 8 cores) the processing power is better used with ponder ON than with more cores ... as soon as your time increases by more than 43% due to ponderhits you produce better games with 8 cores Ponder ON than 16 cores Ponder OFF (6 cores = 4,90, 7 cores = 5,51, 6 or 7 cores seems to be the cut off for Ponder ON)

If so, thats bad for "big Hardware Ponder OFF" tourneys as they waste resources :-)

To me it is questionable if more than 32 cores make any sence (for chess). With 64 cores (16,69) the speedup from 32 would be only + 3,33, even 16 to 32 seems to be very doubtfull with the limited gain of 3.81 :-) )

Ingo
Yes, pretty much so. I guess ponder on is a factor of ~ 1.40-1.50 (depends also on ponder hits) more time used (or effective speed-up). The cut-off seems 16 --> 32 cores, 13.36/9.55 ~ 1.40. So, it is plausible that using ponder on on 16 cores is pretty equivalent or even more efficient than using all 32 cores ponder off. For higher number of cores ponder on is clearly favored. Also, the parallel search with above 32 cores might be distributed in some non-orthodox way, like Jonny does with its cluster.

The same is valid for Komodo, in an older thread on Andreas' experiment with Komodo, I posted this:

Image

You see that after 4-5 doublings (2^4 = 16 cores, 2^5 = 32 cores), one would better use ponder on instead of double cores ponder off. Or different parallel search allocation.
Milos
Posts: 4190
Joined: Wed Nov 25, 2009 1:47 am

Re: Stockfish 8 - Quadruple time control vs. 4 threads

Post by Milos »

Werewolf wrote: There used to be a simple formula which originated somewhere on the Rybka forum which estimated speedup (for Rybka) as

n*cores^0.76

Interestingly, Lazy SMP offers a greater gain on 2-16 cores, but less at 32. Of course back in those days no one had 32 cores so the formula probably only worked up to 16 cores. I suspect Rybka gets very little from 16-->32 cores not having Lazy SMP.
As Kai pointed you this formula from Rybka forum is a bit naive and outdated. LazySMP is quite efficient almost on par with DTS up to 16 coreas and certainly better than suboptimal YBWC implementations (as is Rybka) or even the optimal ones (like in Crafty).
On the other hand performance of 32 cores would be even worse than what was pointed since there are no single CPU with 32 cores meaning you'd have to use at least 2 CPUs communicating over NUMA bus which will be the source of further significant degradation of performance.
Last edited by Milos on Mon Nov 28, 2016 8:57 pm, edited 1 time in total.
IWB
Posts: 1539
Joined: Thu Mar 09, 2006 2:02 pm

Re: Stockfish 8 - Quadruple time control vs. 4 threads

Post by IWB »

Laskos wrote:Yes, pretty much so. I guess ponder on is a factor of ~ 1.40-1.50 (depends also on ponder hits) more time used (or effective speed-up). The cut-off seems 16 --> 32 cores, 13.36/9.55 ~ 1.40. So, it is plausible that using ponder on on 16 cores is pretty equivalent or even more efficient than using all 32 cores ponder off. For higher number of cores ponder on is clearly favored. Also, the parallel search with above 32 cores might be distributed in some non-orthodox way, like Jonny does with its cluster.
From 8 to 16 cores that 1.57. With long time controls and very drawish games (which we usually have with long time controls) even 8 cores might be enough. I go with 12 cores as an average for the time being especially because that is much more fun to watch :-)

Ingo
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Stockfish 8 - Quadruple time control vs. 4 threads

Post by Laskos »

Werewolf wrote:
There used to be a simple formula which originated somewhere on the Rybka forum which estimated speedup (for Rybka) as

n*cores^0.76

Interestingly, Lazy SMP offers a greater gain on 2-16 cores, but less at 32. Of course back in those days no one had 32 cores so the formula probably only worked up to 16 cores. I suspect Rybka gets very little from 16-->32 cores not having Lazy SMP.
By the way, I remembered and found a thread on Rybka forum with more realistic speed-up numbers:
http://www.rybkaforum.net/cgi-bin/rybka ... l?tid=3836
There is a decrease of speedup, but I can´t follow your coloum of numbers.
Rybka 2.3.2 has a speadup of 1.7 (2 cores), 2.8 (1.7*1.65; 4 cores), 4.4 (1.7*1.65*1.57; 8 cores) and 6.2 (estimated) (1.7*1.65*1.57*1.41; 16 cores).
If Vas is right with his estimation from 8 to 16 cores (>=1.7), we will see maybe the following speedups for Rybka 3 mp:
1.8 (2 cores), 3,2 (1.8*1.78; 4 cores), 5.6 (1.8*1.78*1.74; 8 cores) and 9.5 (1.8*1.78*1.74*1.7; 16 cores).
Not bad, I think :-)! But less us see ...
Those more optimistic numbers (the second part of the post) were never realized by Rybka.

So, Lazy SMP Stockfish:

n_cores = 1: speed-up = 1
n_cores = 2: speed-up = 1.91
n_cores = 4: speed-up = 3.52
n_cores = 8: speed-up = 6.08
n_cores = 16: speed-up = 9.55
n_cores = 32: speed-up = 13.36


YBWC Rybka:

n_cores = 1: speed-up = 1
n_cores = 2: speed-up = 1.7
n_cores = 4: speed-up = 2.8
n_cores = 8: speed-up = 4.4
n_cores = 16: speed-up = 6.2

Also, Rybka numbers suggest Amdahl's law as well, with efficiency 89%, compared to SF efficiency of 95.5%.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Stockfish 8 - Quadruple time control vs. 4 threads

Post by Laskos »

Laskos wrote:
Werewolf wrote:
There used to be a simple formula which originated somewhere on the Rybka forum which estimated speedup (for Rybka) as

n*cores^0.76

Interestingly, Lazy SMP offers a greater gain on 2-16 cores, but less at 32. Of course back in those days no one had 32 cores so the formula probably only worked up to 16 cores. I suspect Rybka gets very little from 16-->32 cores not having Lazy SMP.
By the way, I remembered and found a thread on Rybka forum with more realistic speed-up numbers:
http://www.rybkaforum.net/cgi-bin/rybka ... l?tid=3836
There is a decrease of speedup, but I can´t follow your coloum of numbers.
Rybka 2.3.2 has a speadup of 1.7 (2 cores), 2.8 (1.7*1.65; 4 cores), 4.4 (1.7*1.65*1.57; 8 cores) and 6.2 (estimated) (1.7*1.65*1.57*1.41; 16 cores).
If Vas is right with his estimation from 8 to 16 cores (>=1.7), we will see maybe the following speedups for Rybka 3 mp:
1.8 (2 cores), 3,2 (1.8*1.78; 4 cores), 5.6 (1.8*1.78*1.74; 8 cores) and 9.5 (1.8*1.78*1.74*1.7; 16 cores).
Not bad, I think :-)! But less us see ...
Those more optimistic numbers (the second part of the post) were never realized by Rybka.

So, Lazy SMP Stockfish:

n_cores = 1: speed-up = 1
n_cores = 2: speed-up = 1.91
n_cores = 4: speed-up = 3.52
n_cores = 8: speed-up = 6.08
n_cores = 16: speed-up = 9.55
n_cores = 32: speed-up = 13.36


YBWC Rybka:

n_cores = 1: speed-up = 1
n_cores = 2: speed-up = 1.7
n_cores = 4: speed-up = 2.8
n_cores = 8: speed-up = 4.4
n_cores = 16: speed-up = 6.2

Also, Rybka numbers suggest Amdahl's law as well, with efficiency 89%, compared to SF efficiency of 95.5%.
I remember Bob Hyatt had a formula for speed-up of the form: 1 + 0.7 * (n_cores - 1). Imagine that:

1 --> 2 cores: 1.7
16 --> 32 cores: 1.97

LOL