strategies for finding slowdows in lazy smp

flok · Post by **flok** » Tue Jun 04, 2019 9:06 pm

Hi,

tpoppins from ccrl noticed that Embla slows down when the number of threads increases (when using lazy smp).
I thought I had seen that this only happened on windows but that is not correct: it also happens on Linux.

Number of threads versus nps on a threadripper 1950x

The dramatic slow-down is probably because other things were running on it (e.g. the chrome browser).

Now my question is: what are strategies for finding what causes this slow down?
The threads share no common variables apart from the transposition table. That tt has no locks, it uses the xor-trick.

abik · Post by **abik** » Tue Jun 04, 2019 9:51 pm

flok wrote: ↑Tue Jun 04, 2019 9:06 pmNow my question is: what are strategies for finding what causes this slow down?
The threads share no common variables apart from the transposition table. That tt has no locks, it uses the xor-trick.

This is always an interesting computer architecture question, but you give very few details on the system used. Many factors can cause this. For example, cache coherence overhead may play a role here (just making the access lockless does not mean there is not a penalty moving data between processors). Or CPU throttling may kick in if the system runs hot.

I would start eliminating the issues one by one. For example, start running all threads with no or a private transposition table (yes, I know, that is awful for actual chess performance), just to see if you observe the same slowdown.

jdart · Post by **jdart** » Wed Jun 05, 2019 12:20 am

A profiler might also be helpful. You could try OProfile for Linux (http://oprofile.sourceforge.net/news/).

--Jon

Dann Corbit · Post by **Dann Corbit** » Wed Jun 05, 2019 2:25 am

flok wrote: ↑Tue Jun 04, 2019 9:06 pm Hi,

tpoppins from ccrl noticed that Embla slows down when the number of threads increases (when using lazy smp).
I thought I had seen that this only happened on windows but that is not correct: it also happens on Linux.

Number of threads versus nps on a threadripper 1950x

The dramatic slow-down is probably because other things were running on it (e.g. the chrome browser).

Now my question is: what are strategies for finding what causes this slow down?
The threads share no common variables apart from the transposition table. That tt has no locks, it uses the xor-trick.

Interesting that there are 16 cores and 32 active threads for that CPU.
There is a huge nosedive at 33 cores.
I think that the graph is exactly what we would expect.

Dann Corbit · Post by **Dann Corbit** » Wed Jun 05, 2019 2:27 am

Err, I have a question.
I assumed that the graph was NPS per core. Is that correct?
If that is NPS for the program, then something is totally broken.

bob · Post by **bob** » Wed Jun 05, 2019 5:27 am

Good observation. Either this is (a) total-NPS divided by threads or else (b) it is broken. About all NPS is useful for is to detect architectural issues, such as cache thrashing / false sharing or bandwidth issues, processor throttling due to heat, memory bottlenecks, etc. The number of cores is getting large enough that it becomes interesting to figure out what is going on sometimes.

Dann Corbit · Post by **Dann Corbit** » Wed Jun 05, 2019 7:54 am

I went to the github site:
https://github.com/flok99/Embla
And expected to see threads.h or something of that nature for the threading code.
Where is your SMP stuff at?

flok · Post by **flok** » Wed Jun 05, 2019 9:42 am

Hi Aart,

abik wrote: ↑Tue Jun 04, 2019 9:51 pm
flok wrote: ↑Tue Jun 04, 2019 9:06 pmNow my question is: what are strategies for finding what causes this slow down?
The threads share no common variables apart from the transposition table. That tt has no locks, it uses the xor-trick.
This is always an interesting computer architecture question, but you give very few details on the system used. Many factors can cause this. For example, cache coherence overhead may play a role here (just making the access lockless does not mean there is not a penalty moving data between processors). Or CPU throttling may kick in if the system runs hot.

This happens on many systems. My laptop (core i7) and desktop (threadripper 1950x) but also the dell i5 at work. For the first two (which are linux) the scheduler has been set to "performance".

I would start eliminating the issues one by one. For example, start running all threads with no or a private transposition table (yes, I know, that is awful for actual chess performance), just to see if you observe the same slowdown.

Good point!
Will try that.

I'll also pin each thread to a core/thread.

flok · Post by **flok** » Wed Jun 05, 2019 10:06 am

Dann Corbit wrote: ↑Wed Jun 05, 2019 2:25 am
flok wrote: ↑Tue Jun 04, 2019 9:06 pm Hi,

tpoppins from ccrl noticed that Embla slows down when the number of threads increases (when using lazy smp).
I thought I had seen that this only happened on windows but that is not correct: it also happens on Linux.

Number of threads versus nps on a threadripper 1950x

The dramatic slow-down is probably because other things were running on it (e.g. the chrome browser).

Now my question is: what are strategies for finding what causes this slow down?
The threads share no common variables apart from the transposition table. That tt has no locks, it uses the xor-trick.
Interesting that there are 16 cores and 32 active threads for that CPU.

Yes, this cpu has 2 threads per core.

There is a huge nosedive at 33 cores.

That's at 32 actually.

I think that the graph is exactly what we would expect.

Is it? Because stockfish for example shows noise in the nps but no nose-dive (well a tiny one but it had to share that laptop with a browser and other mess):

Code: Select all

  # threads nps    nps/thread
        1: 1924447 1924447
        2: 3516980 1758490
        3: 5726175 1908725
        4: 7295187 1823796
        5: 9950080 1990016
        6: 9704585 1617430

Dann Corbit wrote: ↑Wed Jun 05, 2019 2:27 am Err, I have a question.
I assumed that the graph was NPS per core. Is that correct?
If that is NPS for the program, then something is totally broken.

bob wrote: ↑Wed Jun 05, 2019 5:27 am Good observation. Either this is (a) total-NPS divided by threads or else (b) it is broken. About all NPS is useful for is to detect architectural issues, such as cache thrashing / false sharing or bandwidth issues, processor throttling due to heat, memory bottlenecks, etc. The number of cores is getting large enough that it becomes interesting to figure out what is going on sometimes.

That graph showed the nps for 1 thread.
This new graph shows the average nps for all threads:

Dann Corbit wrote: ↑Wed Jun 05, 2019 7:54 am I went to the github site:
https://github.com/flok99/Embla
And expected to see threads.h or something of that nature for the threading code.
Where is your SMP stuff at?

The version at github is a new rewrite, not the one working on currently. I'm going to drop that rewrite as its movegen is slower than the previous version.

Anyway, this is the code: https://vanheusden.com/Embla/files/embla-2.0.8.tgz
Brain.cpp contains search and eval and threading.
there's a "thread()" function and a "calculateMove" method which do the searching. calculateMove invokes search() and starts n - 1 threads via the thread() function.
Tpt.cpp is the hashtable. As you see the transpositiontable has been disabled for the tests.

Dann Corbit · Post by **Dann Corbit** » Wed Jun 05, 2019 10:12 am

That graph showed the nps for 1 thread.
This new graph shows the average nps for all threads:

Something is very wrong with the calculation.
The aggregate NPS is the sum of the NPS for all threads.
How can it be less than the NPS for one thread?

strategies for finding slowdows in lazy smp

strategies for finding slowdows in lazy smp

Re: strategies for finding slowdows in lazy smp

Re: strategies for finding slowdows in lazy smp

Re: strategies for finding slowdows in lazy smp

Re: strategies for finding slowdows in lazy smp

Re: strategies for finding slowdows in lazy smp

Re: strategies for finding slowdows in lazy smp

Re: strategies for finding slowdows in lazy smp

Re: strategies for finding slowdows in lazy smp

Re: strategies for finding slowdows in lazy smp