Nodes/sec. with last new CPU's!

Houdini · Post by **Houdini** » Tue Aug 29, 2017 1:54 pm

From all my recent experience with Houdini 5 and 6 I'm quite confident in saying that 24 hyper-threads are slightly stronger than 12 threads.
To make this behavior fit Amdahl's formula, the coefficient would need to be about 0.975. In other words [speed-up = 1 / (1 - 0.975 + 0.975/n_cores)]:
- 12 cores gives speed-up of 9.4
- 24 cores gives speed-up of 15.2; multiplied by 0.65x gives 9.9

These numbers demonstrate that your 0.955 coefficient in the Amdahl formula is too inaccurate (based on too low thread numbers) for you or Milos to be making any quantitative claims about thread counts higher than 16.

Alternatively, if you're convinced that the 0.955 is correct up to 16 threads, it suggests that Amdahl's law is not a good description of the SMP behavior of chess engines.

Only well-controlled tests with high number of threads will be able to move this topic forward.

Milos · Post by **Milos** » Tue Aug 29, 2017 2:31 pm

Houdini wrote:From all my recent experience with Houdini 5 and 6 I'm quite confident in saying that 24 hyper-threads are slightly stronger than 12 threads.
To make this behavior fit Amdahl's formula, the coefficient would need to be about 0.975. In other words [speed-up = 1 / (1 - 0.975 + 0.975/n_cores)]:
- 12 cores gives speed-up of 9.4
- 24 cores gives speed-up of 15.2; multiplied by 0.65x gives 9.9

These numbers demonstrate that your 0.955 coefficient in the Amdahl formula is too inaccurate (based on too low thread numbers) for you or Milos to be making any quantitative claims about thread counts higher than 16.

Alternatively, if you're convinced that the 0.955 is correct up to 16 threads, it suggests that Amdahl's law is not a good description of the SMP behavior of chess engines.

Only well-controlled tests with high number of threads will be able to move this topic forward.

0.955 is for SF LazySMP implementation as I stressed out a few times in my comments and the testbench from the OP is using latest SF with its standard LazySMP implementation that we have confirmed up to 16 cores has a parallel efficiency of 0.955.
It is also known that SF's LazySMP is not the most efficient LazySMP and overall SMP implementation. There are recent threads in programming section demonstrating better efficiency of even simple ABDADA.
It is quite possible that LazySMP in Houdini is more efficient. You are certainly the one to have proper numbers so that correct coefficient can be derived.

P.S. It is also known that Komodo has slightly better SMP scaling than SF coz it has some more clever tricks implemented. Example of even better scaling for extreme number of cores is Johny that is using speculative pondering up to the depth of 6 or even more.

Houdini · Post by **Houdini** » Tue Aug 29, 2017 3:00 pm

Pretending that the value of 0.955 is determined with accuracy for Stockfish is quite misleading. The data on which the 0.955 coefficient is based is limited (small number of threads) and has some methodological issues (e.g. did the test maintain constant draw rates across the different matches?).

At the risk of repeating myself, only well-controlled tests with high number of threads will be able to move this topic forward.

Laskos · Post by **Laskos** » Tue Aug 29, 2017 4:04 pm

Houdini wrote:From all my recent experience with Houdini 5 and 6 I'm quite confident in saying that 24 hyper-threads are slightly stronger than 12 threads.
To make this behavior fit Amdahl's formula, the coefficient would need to be about 0.975. In other words [speed-up = 1 / (1 - 0.975 + 0.975/n_cores)]:
- 12 cores gives speed-up of 9.4
- 24 cores gives speed-up of 15.2; multiplied by 0.65x gives 9.9

These numbers demonstrate that your 0.955 coefficient in the Amdahl formula is too inaccurate (based on too low thread numbers) for you or Milos to be making any quantitative claims about thread counts higher than 16.

Alternatively, if you're convinced that the 0.955 is correct up to 16 threads, it suggests that Amdahl's law is not a good description of the SMP behavior of chess engines.

Only well-controlled tests with high number of threads will be able to move this topic forward.

Even if 0.955 is quite imprecise and varying across the engines, the function itself is tame. Milos result was 2.4 with 0.955 efficiency comparing Ryzen to that 224-threaded Xeon. With much higher 0.975 efficiency, fitted to what you say you got with Houdini, this factor becomes 3.4, pretty far from your factor of 6 you calculated. And I think that Amdahl's law is the one to be applied to parallel search with alpha-beta.

corres · Post by **corres** » Tue Aug 29, 2017 5:41 pm

[quote="Houdini"]

Generally speaking, experience with current engines suggest an effective speed-up of about 1.75 for each doubling of the number of threads (at constant node speed per thread).
Using this formula, the effective strength of 275 MN/s with 112 cores will be close to 160 MN/s with 8 cores.
Inasmuch as the Ryzen 1700 produces about 25 MN/s, it requires 6x more time to achieve this strength.

[/quote]

I am afraid 6x more times is small because the 1.75 doubling factor decreases with increasing number of doubling moreover the effectiveness of lazy SMP also decreases with increasing number of threads. Those is a question too that in the case of lazy SMP the doubling factor is 1.75 or not. I think it is smaller than 1.75.
In my opinion you can determine the real time factor with experiments only.

Houdini · Post by **Houdini** » Tue Aug 29, 2017 6:33 pm

Laskos wrote:Even if 0.955 is quite imprecise and varying across the engines, the function itself is tame. Milos result was 2.4 with 0.955 efficiency comparing Ryzen to that 224-threaded Xeon. With much higher 0.975 efficiency, fitted to what you say you got with Houdini, this factor becomes 3.4, pretty far from your factor of 6 you calculated. And I think that Amdahl's law is the one to be applied to parallel search with alpha-beta.

Chess engine alpha-beta is a lot more complex than the tasks for which Amdahl's law usually is applied.
I see no fundamental reason why the coefficient to use in Amdahl's formula could not depend on the number of threads.
For example:
- 0.955 for 8 threads.
- 0.975 for 24 threads
- 0.990 for 80 threads
This, of course, would imply that Amdahl's law is not a very good model of multi-threaded chess engines.

I will end my contribution to this thread by once again saying that only well-controlled tests with high number of threads will allow us to move beyond the current, rather idle speculations

.

Laskos · Post by **Laskos** » Tue Aug 29, 2017 8:50 pm

Houdini wrote:
Laskos wrote:Even if 0.955 is quite imprecise and varying across the engines, the function itself is tame. Milos result was 2.4 with 0.955 efficiency comparing Ryzen to that 224-threaded Xeon. With much higher 0.975 efficiency, fitted to what you say you got with Houdini, this factor becomes 3.4, pretty far from your factor of 6 you calculated. And I think that Amdahl's law is the one to be applied to parallel search with alpha-beta.
Chess engine alpha-beta is a lot more complex than the tasks for which Amdahl's law usually is applied.
I see no fundamental reason why the coefficient to use in Amdahl's formula could not depend on the number of threads.
For example:
- 0.955 for 8 threads.
- 0.975 for 24 threads
- 0.990 for 80 threads
This, of course, would imply that Amdahl's law is not a very good model of multi-threaded chess engines.

I will end my contribution to this thread by once again saying that only well-controlled tests with high number of threads will allow us to move beyond the current, rather idle speculations .

Well, it's quite possible. Peter Österlund got crazy results with his Lazy Texel.
http://www.talkchess.com/forum/viewtopic.php?t=64824

I pute the table with Lazy Texel doublings in a more readable way, and added effective speedup per doubling for the number of cores and Amdahl's efficiency if assumed to be a law here. The numbers I added are not perfect, because doubling time line is not equal in base strength to doubling cores line (only the first row), but quite close.

ELO gain
Errors about 10 ELO points 2 SD.

Code: Select all

                                                   NUMA=2
Config          X1     X2     X4     X8    X16      X32 
----------------------------------------------------------
Doubling time    -    112    100     85     72       63 
Doubling cores   -     72     76     73     66       44
==========================================================
Effect. speedup       
per doubling         1.56   1.69   1.81    1.89    1.62

Amdahl's
efficiency          0.719  0.900  0.972   0.992   0.981

It's not Amdahl by fair shot, it's not even your formula. As Peter showed, the smaller speedup for 32 vs 16 cores of 1.62 is due mainly to NUMA=2 configuration. The scaling seems to improve with cores, pretty crazy, close to Robert Hyatt crazy formula for effective speedup of 1 + 0.7 * (n_cores-1).

corres · Post by **corres** » Tue Aug 29, 2017 9:52 pm

[quote="Houdini"]

If one plays a match on a 12-core Xeon with 12 threads running against 24 threads, the 24 (hyper-)threads are running at .....

[/quote]

Gentlemen,
In the heat of debate you forgot that Amdahl's formula concern to physical cores and no virtual cores. In the case of the same clock frequency a PC with 12 physical cores + 12 virtual cores (hyper threading) is weaker than a PC with 24 physical cores. This is fact.
Generally, I agree the Robert's establishing what can be read on his web page.

Cardoso · Post by **Cardoso** » Wed Aug 30, 2017 7:35 pm

Vael,
could you do me a favor?
Donwload and install to the default folder (c:\profound) the following:
https://sites.google.com/view/deep-profound
Start the program, You may have to switch to the english language.
Go to the cpus button and select 16 cpus, click ok.
Then click "Meditate", it will stop at about 150million nodes since it's a demo version. Could you send me the KNs after the analisys is finished?
Or even better do Alt + Print Screen and send me the main window picture to:
deep.profound@gmail.com
I'm really curious about those KNs.

best regards,
Alvaro

Leto · Post by **Leto** » Wed Aug 30, 2017 10:41 pm

If anyone's interested here are the results I got with my 12 core Xeon and my 4 thread HP laptop, using the 2017-5-22 asmfishw:

Intel Xeon X5650:
Total time: 169889 ms
Nodes Searched: 3593687704
Nodes/Second: 21153151

HP Pavilion Intel i3-3130M:
Total time: 507486 ms
Nodes Searched: 195599135
Nodes/Second: 3854291

Nodes/sec. with last new CPU's!

Re: Nodes/sec. with last new CPU's!

Re: Nodes/sec. with last new CPU's!

Re: Nodes/sec. with last new CPU's!

Re: Nodes/sec. with last new CPU's!

Re: Nodes/sec. with last new CPU's!

Re: Nodes/sec. with last new CPU's!

Re: Nodes/sec. with last new CPU's!

Re: Nodes/sec. with last new CPU's!

Re: Nodes/sec. with last new CPU's!

Re: Nodes/sec. with last new CPU's!