strategies for finding slowdows in lazy smp

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
flok
Posts: 481
Joined: Tue Jul 03, 2018 10:19 am
Full name: Folkert van Heusden

strategies for finding slowdows in lazy smp

Post by flok »

Hi,

tpoppins from ccrl noticed that Embla slows down when the number of threads increases (when using lazy smp).
I thought I had seen that this only happened on windows but that is not correct: it also happens on Linux.

Number of threads versus nps on a threadripper 1950x
Image

The dramatic slow-down is probably because other things were running on it (e.g. the chrome browser).

Now my question is: what are strategies for finding what causes this slow down?
The threads share no common variables apart from the transposition table. That tt has no locks, it uses the xor-trick.
User avatar
abik
Posts: 819
Joined: Fri Dec 01, 2006 10:46 pm
Location: Mountain View, CA, USA
Full name: Aart Bik

Re: strategies for finding slowdows in lazy smp

Post by abik »

flok wrote: Tue Jun 04, 2019 9:06 pmNow my question is: what are strategies for finding what causes this slow down?
The threads share no common variables apart from the transposition table. That tt has no locks, it uses the xor-trick.
This is always an interesting computer architecture question, but you give very few details on the system used. Many factors can cause this. For example, cache coherence overhead may play a role here (just making the access lockless does not mean there is not a penalty moving data between processors). Or CPU throttling may kick in if the system runs hot.

I would start eliminating the issues one by one. For example, start running all threads with no or a private transposition table (yes, I know, that is awful for actual chess performance), just to see if you observe the same slowdown.
jdart
Posts: 4366
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: strategies for finding slowdows in lazy smp

Post by jdart »

A profiler might also be helpful. You could try OProfile for Linux (http://oprofile.sourceforge.net/news/).

--Jon
Dann Corbit
Posts: 12538
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: strategies for finding slowdows in lazy smp

Post by Dann Corbit »

flok wrote: Tue Jun 04, 2019 9:06 pm Hi,

tpoppins from ccrl noticed that Embla slows down when the number of threads increases (when using lazy smp).
I thought I had seen that this only happened on windows but that is not correct: it also happens on Linux.

Number of threads versus nps on a threadripper 1950x
Image

The dramatic slow-down is probably because other things were running on it (e.g. the chrome browser).

Now my question is: what are strategies for finding what causes this slow down?
The threads share no common variables apart from the transposition table. That tt has no locks, it uses the xor-trick.
Interesting that there are 16 cores and 32 active threads for that CPU.
There is a huge nosedive at 33 cores.
I think that the graph is exactly what we would expect.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
Dann Corbit
Posts: 12538
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: strategies for finding slowdows in lazy smp

Post by Dann Corbit »

Err, I have a question.
I assumed that the graph was NPS per core. Is that correct?
If that is NPS for the program, then something is totally broken.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: strategies for finding slowdows in lazy smp

Post by bob »

Good observation. Either this is (a) total-NPS divided by threads or else (b) it is broken. About all NPS is useful for is to detect architectural issues, such as cache thrashing / false sharing or bandwidth issues, processor throttling due to heat, memory bottlenecks, etc. The number of cores is getting large enough that it becomes interesting to figure out what is going on sometimes.
Dann Corbit
Posts: 12538
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: strategies for finding slowdows in lazy smp

Post by Dann Corbit »

I went to the github site:
https://github.com/flok99/Embla
And expected to see threads.h or something of that nature for the threading code.
Where is your SMP stuff at?
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
User avatar
flok
Posts: 481
Joined: Tue Jul 03, 2018 10:19 am
Full name: Folkert van Heusden

Re: strategies for finding slowdows in lazy smp

Post by flok »

Hi Aart,
abik wrote: Tue Jun 04, 2019 9:51 pm
flok wrote: Tue Jun 04, 2019 9:06 pmNow my question is: what are strategies for finding what causes this slow down?
The threads share no common variables apart from the transposition table. That tt has no locks, it uses the xor-trick.
This is always an interesting computer architecture question, but you give very few details on the system used. Many factors can cause this. For example, cache coherence overhead may play a role here (just making the access lockless does not mean there is not a penalty moving data between processors). Or CPU throttling may kick in if the system runs hot.
This happens on many systems. My laptop (core i7) and desktop (threadripper 1950x) but also the dell i5 at work. For the first two (which are linux) the scheduler has been set to "performance".
I would start eliminating the issues one by one. For example, start running all threads with no or a private transposition table (yes, I know, that is awful for actual chess performance), just to see if you observe the same slowdown.
Good point!
Will try that.

I'll also pin each thread to a core/thread.
User avatar
flok
Posts: 481
Joined: Tue Jul 03, 2018 10:19 am
Full name: Folkert van Heusden

Re: strategies for finding slowdows in lazy smp

Post by flok »

Dann Corbit wrote: Wed Jun 05, 2019 2:25 am
flok wrote: Tue Jun 04, 2019 9:06 pm Hi,

tpoppins from ccrl noticed that Embla slows down when the number of threads increases (when using lazy smp).
I thought I had seen that this only happened on windows but that is not correct: it also happens on Linux.

Number of threads versus nps on a threadripper 1950x
Image

The dramatic slow-down is probably because other things were running on it (e.g. the chrome browser).

Now my question is: what are strategies for finding what causes this slow down?
The threads share no common variables apart from the transposition table. That tt has no locks, it uses the xor-trick.
Interesting that there are 16 cores and 32 active threads for that CPU.
Yes, this cpu has 2 threads per core.
There is a huge nosedive at 33 cores.
That's at 32 actually.
I think that the graph is exactly what we would expect.
Is it? Because stockfish for example shows noise in the nps but no nose-dive (well a tiny one but it had to share that laptop with a browser and other mess):

Code: Select all

  # threads nps    nps/thread
        1: 1924447 1924447
        2: 3516980 1758490
        3: 5726175 1908725
        4: 7295187 1823796
        5: 9950080 1990016
        6: 9704585 1617430
Dann Corbit wrote: Wed Jun 05, 2019 2:27 am Err, I have a question.
I assumed that the graph was NPS per core. Is that correct?
If that is NPS for the program, then something is totally broken.
bob wrote: Wed Jun 05, 2019 5:27 am Good observation. Either this is (a) total-NPS divided by threads or else (b) it is broken. About all NPS is useful for is to detect architectural issues, such as cache thrashing / false sharing or bandwidth issues, processor throttling due to heat, memory bottlenecks, etc. The number of cores is getting large enough that it becomes interesting to figure out what is going on sometimes.
That graph showed the nps for 1 thread.
This new graph shows the average nps for all threads:

Image
Dann Corbit wrote: Wed Jun 05, 2019 7:54 am I went to the github site:
https://github.com/flok99/Embla
And expected to see threads.h or something of that nature for the threading code.
Where is your SMP stuff at?
The version at github is a new rewrite, not the one working on currently. I'm going to drop that rewrite as its movegen is slower than the previous version.

Anyway, this is the code: https://vanheusden.com/Embla/files/embla-2.0.8.tgz
Brain.cpp contains search and eval and threading.
there's a "thread()" function and a "calculateMove" method which do the searching. calculateMove invokes search() and starts n - 1 threads via the thread() function.
Tpt.cpp is the hashtable. As you see the transpositiontable has been disabled for the tests.
Dann Corbit
Posts: 12538
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: strategies for finding slowdows in lazy smp

Post by Dann Corbit »

That graph showed the nps for 1 thread.
This new graph shows the average nps for all threads:
Something is very wrong with the calculation.
The aggregate NPS is the sum of the NPS for all threads.
How can it be less than the NPS for one thread?
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.