From all my recent experience with Houdini 5 and 6 I'm quite confident in saying that 24 hyper-threads are slightly stronger than 12 threads.
To make this behavior fit Amdahl's formula, the coefficient would need to be about 0.975. In other words [speed-up = 1 / (1 - 0.975 + 0.975/n_cores)]:
- 12 cores gives speed-up of 9.4
- 24 cores gives speed-up of 15.2; multiplied by 0.65x gives 9.9
These numbers demonstrate that your 0.955 coefficient in the Amdahl formula is too inaccurate (based on too low thread numbers) for you or Milos to be making any quantitative claims about thread counts higher than 16.
Alternatively, if you're convinced that the 0.955 is correct up to 16 threads, it suggests that Amdahl's law is not a good description of the SMP behavior of chess engines.
Only well-controlled tests with high number of threads will be able to move this topic forward.
Nodes/sec. with last new CPU's!
Moderators: hgm, Rebel, chrisw
-
- Posts: 4190
- Joined: Wed Nov 25, 2009 1:47 am
Re: Nodes/sec. with last new CPU's!
0.955 is for SF LazySMP implementation as I stressed out a few times in my comments and the testbench from the OP is using latest SF with its standard LazySMP implementation that we have confirmed up to 16 cores has a parallel efficiency of 0.955.Houdini wrote:From all my recent experience with Houdini 5 and 6 I'm quite confident in saying that 24 hyper-threads are slightly stronger than 12 threads.
To make this behavior fit Amdahl's formula, the coefficient would need to be about 0.975. In other words [speed-up = 1 / (1 - 0.975 + 0.975/n_cores)]:
- 12 cores gives speed-up of 9.4
- 24 cores gives speed-up of 15.2; multiplied by 0.65x gives 9.9
These numbers demonstrate that your 0.955 coefficient in the Amdahl formula is too inaccurate (based on too low thread numbers) for you or Milos to be making any quantitative claims about thread counts higher than 16.
Alternatively, if you're convinced that the 0.955 is correct up to 16 threads, it suggests that Amdahl's law is not a good description of the SMP behavior of chess engines.
Only well-controlled tests with high number of threads will be able to move this topic forward.
It is also known that SF's LazySMP is not the most efficient LazySMP and overall SMP implementation. There are recent threads in programming section demonstrating better efficiency of even simple ABDADA.
It is quite possible that LazySMP in Houdini is more efficient. You are certainly the one to have proper numbers so that correct coefficient can be derived.
P.S. It is also known that Komodo has slightly better SMP scaling than SF coz it has some more clever tricks implemented. Example of even better scaling for extreme number of cores is Johny that is using speculative pondering up to the depth of 6 or even more.
-
- Posts: 1471
- Joined: Tue Mar 16, 2010 12:00 am
Re: Nodes/sec. with last new CPU's!
Pretending that the value of 0.955 is determined with accuracy for Stockfish is quite misleading. The data on which the 0.955 coefficient is based is limited (small number of threads) and has some methodological issues (e.g. did the test maintain constant draw rates across the different matches?).
At the risk of repeating myself, only well-controlled tests with high number of threads will be able to move this topic forward.
At the risk of repeating myself, only well-controlled tests with high number of threads will be able to move this topic forward.
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: Nodes/sec. with last new CPU's!
Even if 0.955 is quite imprecise and varying across the engines, the function itself is tame. Milos result was 2.4 with 0.955 efficiency comparing Ryzen to that 224-threaded Xeon. With much higher 0.975 efficiency, fitted to what you say you got with Houdini, this factor becomes 3.4, pretty far from your factor of 6 you calculated. And I think that Amdahl's law is the one to be applied to parallel search with alpha-beta.Houdini wrote:From all my recent experience with Houdini 5 and 6 I'm quite confident in saying that 24 hyper-threads are slightly stronger than 12 threads.
To make this behavior fit Amdahl's formula, the coefficient would need to be about 0.975. In other words [speed-up = 1 / (1 - 0.975 + 0.975/n_cores)]:
- 12 cores gives speed-up of 9.4
- 24 cores gives speed-up of 15.2; multiplied by 0.65x gives 9.9
These numbers demonstrate that your 0.955 coefficient in the Amdahl formula is too inaccurate (based on too low thread numbers) for you or Milos to be making any quantitative claims about thread counts higher than 16.
Alternatively, if you're convinced that the 0.955 is correct up to 16 threads, it suggests that Amdahl's law is not a good description of the SMP behavior of chess engines.
Only well-controlled tests with high number of threads will be able to move this topic forward.
-
- Posts: 3657
- Joined: Wed Nov 18, 2015 11:41 am
- Location: hungary
Re: Nodes/sec. with last new CPU's!
[quote="Houdini"]
Generally speaking, experience with current engines suggest an effective speed-up of about 1.75 for each doubling of the number of threads (at constant node speed per thread).
Using this formula, the effective strength of 275 MN/s with 112 cores will be close to 160 MN/s with 8 cores.
Inasmuch as the Ryzen 1700 produces about 25 MN/s, it requires 6x more time to achieve this strength.
[/quote]
I am afraid 6x more times is small because the 1.75 doubling factor decreases with increasing number of doubling moreover the effectiveness of lazy SMP also decreases with increasing number of threads. Those is a question too that in the case of lazy SMP the doubling factor is 1.75 or not. I think it is smaller than 1.75.
In my opinion you can determine the real time factor with experiments only.
Generally speaking, experience with current engines suggest an effective speed-up of about 1.75 for each doubling of the number of threads (at constant node speed per thread).
Using this formula, the effective strength of 275 MN/s with 112 cores will be close to 160 MN/s with 8 cores.
Inasmuch as the Ryzen 1700 produces about 25 MN/s, it requires 6x more time to achieve this strength.
[/quote]
I am afraid 6x more times is small because the 1.75 doubling factor decreases with increasing number of doubling moreover the effectiveness of lazy SMP also decreases with increasing number of threads. Those is a question too that in the case of lazy SMP the doubling factor is 1.75 or not. I think it is smaller than 1.75.
In my opinion you can determine the real time factor with experiments only.
-
- Posts: 1471
- Joined: Tue Mar 16, 2010 12:00 am
Re: Nodes/sec. with last new CPU's!
Chess engine alpha-beta is a lot more complex than the tasks for which Amdahl's law usually is applied.Laskos wrote:Even if 0.955 is quite imprecise and varying across the engines, the function itself is tame. Milos result was 2.4 with 0.955 efficiency comparing Ryzen to that 224-threaded Xeon. With much higher 0.975 efficiency, fitted to what you say you got with Houdini, this factor becomes 3.4, pretty far from your factor of 6 you calculated. And I think that Amdahl's law is the one to be applied to parallel search with alpha-beta.
I see no fundamental reason why the coefficient to use in Amdahl's formula could not depend on the number of threads.
For example:
- 0.955 for 8 threads.
- 0.975 for 24 threads
- 0.990 for 80 threads
This, of course, would imply that Amdahl's law is not a very good model of multi-threaded chess engines.
I will end my contribution to this thread by once again saying that only well-controlled tests with high number of threads will allow us to move beyond the current, rather idle speculations .
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: Nodes/sec. with last new CPU's!
Well, it's quite possible. Peter Österlund got crazy results with his Lazy Texel.Houdini wrote:Chess engine alpha-beta is a lot more complex than the tasks for which Amdahl's law usually is applied.Laskos wrote:Even if 0.955 is quite imprecise and varying across the engines, the function itself is tame. Milos result was 2.4 with 0.955 efficiency comparing Ryzen to that 224-threaded Xeon. With much higher 0.975 efficiency, fitted to what you say you got with Houdini, this factor becomes 3.4, pretty far from your factor of 6 you calculated. And I think that Amdahl's law is the one to be applied to parallel search with alpha-beta.
I see no fundamental reason why the coefficient to use in Amdahl's formula could not depend on the number of threads.
For example:
- 0.955 for 8 threads.
- 0.975 for 24 threads
- 0.990 for 80 threads
This, of course, would imply that Amdahl's law is not a very good model of multi-threaded chess engines.
I will end my contribution to this thread by once again saying that only well-controlled tests with high number of threads will allow us to move beyond the current, rather idle speculations .
http://www.talkchess.com/forum/viewtopic.php?t=64824
I pute the table with Lazy Texel doublings in a more readable way, and added effective speedup per doubling for the number of cores and Amdahl's efficiency if assumed to be a law here. The numbers I added are not perfect, because doubling time line is not equal in base strength to doubling cores line (only the first row), but quite close.
ELO gain
Errors about 10 ELO points 2 SD.
Code: Select all
NUMA=2
Config X1 X2 X4 X8 X16 X32
----------------------------------------------------------
Doubling time - 112 100 85 72 63
Doubling cores - 72 76 73 66 44
==========================================================
Effect. speedup
per doubling 1.56 1.69 1.81 1.89 1.62
Amdahl's
efficiency 0.719 0.900 0.972 0.992 0.981
-
- Posts: 3657
- Joined: Wed Nov 18, 2015 11:41 am
- Location: hungary
Re: Nodes/sec. with last new CPU's!
[quote="Houdini"]
If one plays a match on a 12-core Xeon with 12 threads running against 24 threads, the 24 (hyper-)threads are running at .....
[/quote]
Gentlemen,
In the heat of debate you forgot that Amdahl's formula concern to physical cores and no virtual cores. In the case of the same clock frequency a PC with 12 physical cores + 12 virtual cores (hyper threading) is weaker than a PC with 24 physical cores. This is fact.
Generally, I agree the Robert's establishing what can be read on his web page.
If one plays a match on a 12-core Xeon with 12 threads running against 24 threads, the 24 (hyper-)threads are running at .....
[/quote]
Gentlemen,
In the heat of debate you forgot that Amdahl's formula concern to physical cores and no virtual cores. In the case of the same clock frequency a PC with 12 physical cores + 12 virtual cores (hyper threading) is weaker than a PC with 24 physical cores. This is fact.
Generally, I agree the Robert's establishing what can be read on his web page.
-
- Posts: 362
- Joined: Thu Mar 16, 2006 7:39 pm
- Location: Portugal
- Full name: Alvaro Cardoso
Re: Nodes/sec. with last new CPU's!
Vael,
could you do me a favor?
Donwload and install to the default folder (c:\profound) the following:
https://sites.google.com/view/deep-profound
Start the program, You may have to switch to the english language.
Go to the cpus button and select 16 cpus, click ok.
Then click "Meditate", it will stop at about 150million nodes since it's a demo version. Could you send me the KNs after the analisys is finished?
Or even better do Alt + Print Screen and send me the main window picture to:
deep.profound@gmail.com
I'm really curious about those KNs.
best regards,
Alvaro
could you do me a favor?
Donwload and install to the default folder (c:\profound) the following:
https://sites.google.com/view/deep-profound
Start the program, You may have to switch to the english language.
Go to the cpus button and select 16 cpus, click ok.
Then click "Meditate", it will stop at about 150million nodes since it's a demo version. Could you send me the KNs after the analisys is finished?
Or even better do Alt + Print Screen and send me the main window picture to:
deep.profound@gmail.com
I'm really curious about those KNs.
best regards,
Alvaro
-
- Posts: 2071
- Joined: Thu May 04, 2006 3:40 am
- Location: Dune
Re: Nodes/sec. with last new CPU's!
If anyone's interested here are the results I got with my 12 core Xeon and my 4 thread HP laptop, using the 2017-5-22 asmfishw:
Intel Xeon X5650:
Total time: 169889 ms
Nodes Searched: 3593687704
Nodes/Second: 21153151
HP Pavilion Intel i3-3130M:
Total time: 507486 ms
Nodes Searched: 195599135
Nodes/Second: 3854291
Intel Xeon X5650:
Total time: 169889 ms
Nodes Searched: 3593687704
Nodes/Second: 21153151
HP Pavilion Intel i3-3130M:
Total time: 507486 ms
Nodes Searched: 195599135
Nodes/Second: 3854291