Hi,
tpoppins from ccrl noticed that Embla slows down when the number of threads increases (when using lazy smp).
I thought I had seen that this only happened on windows but that is not correct: it also happens on Linux.
Number of threads versus nps on a threadripper 1950x
The dramatic slow-down is probably because other things were running on it (e.g. the chrome browser).
Now my question is: what are strategies for finding what causes this slow down?
The threads share no common variables apart from the transposition table. That tt has no locks, it uses the xor-trick.
strategies for finding slowdows in lazy smp
Moderators: hgm, Rebel, chrisw
-
- Posts: 481
- Joined: Tue Jul 03, 2018 10:19 am
- Full name: Folkert van Heusden
-
- Posts: 819
- Joined: Fri Dec 01, 2006 10:46 pm
- Location: Mountain View, CA, USA
- Full name: Aart Bik
Re: strategies for finding slowdows in lazy smp
This is always an interesting computer architecture question, but you give very few details on the system used. Many factors can cause this. For example, cache coherence overhead may play a role here (just making the access lockless does not mean there is not a penalty moving data between processors). Or CPU throttling may kick in if the system runs hot.
I would start eliminating the issues one by one. For example, start running all threads with no or a private transposition table (yes, I know, that is awful for actual chess performance), just to see if you observe the same slowdown.
-
- Posts: 4367
- Joined: Fri Mar 10, 2006 5:23 am
- Location: http://www.arasanchess.org
Re: strategies for finding slowdows in lazy smp
A profiler might also be helpful. You could try OProfile for Linux (http://oprofile.sourceforge.net/news/).
--Jon
--Jon
-
- Posts: 12542
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: strategies for finding slowdows in lazy smp
Interesting that there are 16 cores and 32 active threads for that CPU.flok wrote: ↑Tue Jun 04, 2019 9:06 pm Hi,
tpoppins from ccrl noticed that Embla slows down when the number of threads increases (when using lazy smp).
I thought I had seen that this only happened on windows but that is not correct: it also happens on Linux.
Number of threads versus nps on a threadripper 1950x
The dramatic slow-down is probably because other things were running on it (e.g. the chrome browser).
Now my question is: what are strategies for finding what causes this slow down?
The threads share no common variables apart from the transposition table. That tt has no locks, it uses the xor-trick.
There is a huge nosedive at 33 cores.
I think that the graph is exactly what we would expect.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
-
- Posts: 12542
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: strategies for finding slowdows in lazy smp
Err, I have a question.
I assumed that the graph was NPS per core. Is that correct?
If that is NPS for the program, then something is totally broken.
I assumed that the graph was NPS per core. Is that correct?
If that is NPS for the program, then something is totally broken.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: strategies for finding slowdows in lazy smp
Good observation. Either this is (a) total-NPS divided by threads or else (b) it is broken. About all NPS is useful for is to detect architectural issues, such as cache thrashing / false sharing or bandwidth issues, processor throttling due to heat, memory bottlenecks, etc. The number of cores is getting large enough that it becomes interesting to figure out what is going on sometimes.
-
- Posts: 12542
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: strategies for finding slowdows in lazy smp
I went to the github site:
https://github.com/flok99/Embla
And expected to see threads.h or something of that nature for the threading code.
Where is your SMP stuff at?
https://github.com/flok99/Embla
And expected to see threads.h or something of that nature for the threading code.
Where is your SMP stuff at?
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
-
- Posts: 481
- Joined: Tue Jul 03, 2018 10:19 am
- Full name: Folkert van Heusden
Re: strategies for finding slowdows in lazy smp
Hi Aart,
Will try that.
I'll also pin each thread to a core/thread.
This happens on many systems. My laptop (core i7) and desktop (threadripper 1950x) but also the dell i5 at work. For the first two (which are linux) the scheduler has been set to "performance".abik wrote: ↑Tue Jun 04, 2019 9:51 pmThis is always an interesting computer architecture question, but you give very few details on the system used. Many factors can cause this. For example, cache coherence overhead may play a role here (just making the access lockless does not mean there is not a penalty moving data between processors). Or CPU throttling may kick in if the system runs hot.
Good point!I would start eliminating the issues one by one. For example, start running all threads with no or a private transposition table (yes, I know, that is awful for actual chess performance), just to see if you observe the same slowdown.
Will try that.
I'll also pin each thread to a core/thread.
-
- Posts: 481
- Joined: Tue Jul 03, 2018 10:19 am
- Full name: Folkert van Heusden
Re: strategies for finding slowdows in lazy smp
Yes, this cpu has 2 threads per core.Dann Corbit wrote: ↑Wed Jun 05, 2019 2:25 amInteresting that there are 16 cores and 32 active threads for that CPU.flok wrote: ↑Tue Jun 04, 2019 9:06 pm Hi,
tpoppins from ccrl noticed that Embla slows down when the number of threads increases (when using lazy smp).
I thought I had seen that this only happened on windows but that is not correct: it also happens on Linux.
Number of threads versus nps on a threadripper 1950x
The dramatic slow-down is probably because other things were running on it (e.g. the chrome browser).
Now my question is: what are strategies for finding what causes this slow down?
The threads share no common variables apart from the transposition table. That tt has no locks, it uses the xor-trick.
That's at 32 actually.There is a huge nosedive at 33 cores.
Is it? Because stockfish for example shows noise in the nps but no nose-dive (well a tiny one but it had to share that laptop with a browser and other mess):I think that the graph is exactly what we would expect.
Code: Select all
# threads nps nps/thread
1: 1924447 1924447
2: 3516980 1758490
3: 5726175 1908725
4: 7295187 1823796
5: 9950080 1990016
6: 9704585 1617430
Dann Corbit wrote: ↑Wed Jun 05, 2019 2:27 am Err, I have a question.
I assumed that the graph was NPS per core. Is that correct?
If that is NPS for the program, then something is totally broken.
That graph showed the nps for 1 thread.bob wrote: ↑Wed Jun 05, 2019 5:27 am Good observation. Either this is (a) total-NPS divided by threads or else (b) it is broken. About all NPS is useful for is to detect architectural issues, such as cache thrashing / false sharing or bandwidth issues, processor throttling due to heat, memory bottlenecks, etc. The number of cores is getting large enough that it becomes interesting to figure out what is going on sometimes.
This new graph shows the average nps for all threads:
The version at github is a new rewrite, not the one working on currently. I'm going to drop that rewrite as its movegen is slower than the previous version.Dann Corbit wrote: ↑Wed Jun 05, 2019 7:54 am I went to the github site:
https://github.com/flok99/Embla
And expected to see threads.h or something of that nature for the threading code.
Where is your SMP stuff at?
Anyway, this is the code: https://vanheusden.com/Embla/files/embla-2.0.8.tgz
Brain.cpp contains search and eval and threading.
there's a "thread()" function and a "calculateMove" method which do the searching. calculateMove invokes search() and starts n - 1 threads via the thread() function.
Tpt.cpp is the hashtable. As you see the transpositiontable has been disabled for the tests.
-
- Posts: 12542
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: strategies for finding slowdows in lazy smp
Something is very wrong with the calculation.That graph showed the nps for 1 thread.
This new graph shows the average nps for all threads:
The aggregate NPS is the sum of the NPS for all threads.
How can it be less than the NPS for one thread?
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.