Shared cache lines. Use per-thread node counters, etc.
Are you using locked instructions or heavily accessed atomic variables?
On Linux, use "perf c2c" to find the cache lines that are shared, either because of a real shared variable or because of falsely shared variables (thread-specific ones that happen to sit in the same cache line).
Lazy SMP >4 Thread Slowdown
Moderators: hgm, Rebel, chrisw
-
- Posts: 5569
- Joined: Tue Feb 28, 2012 11:56 pm
-
- Posts: 12604
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: Lazy SMP >4 Thread Slowdown
Perhaps there is a shared variable that is a bottleneck.nitrocan wrote:I have recently implemented lazy smp and it works remarkably well with up to 4 cores. One weird thing that I did notice though is that as soon as I start using 8/16 cores, the nps actually starts decreasing instead of increasing. Same with the depth that it searches per unit time. Any ideas on why this might be happening? Could it be that too many threads are contending for the transposition table? All ideas are very welcome, thanks!
E.g. an eval hash (if global) might be better as a thread local storage so that each thread gets its own.
Besides the main hash table, what else is a public object in your program?
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
-
- Posts: 17
- Joined: Fri Sep 15, 2017 11:52 pm
Re: Lazy SMP >4 Thread Slowdown
That was pretty much exactly it. After making:Dann Corbit wrote:Perhaps there is a shared variable that is a bottleneck.nitrocan wrote:I have recently implemented lazy smp and it works remarkably well with up to 4 cores. One weird thing that I did notice though is that as soon as I start using 8/16 cores, the nps actually starts decreasing instead of increasing. Same with the depth that it searches per unit time. Any ideas on why this might be happening? Could it be that too many threads are contending for the transposition table? All ideas are very welcome, thanks!
E.g. an eval hash (if global) might be better as a thread local storage so that each thread gets its own.
Besides the main hash table, what else is a public object in your program?
-Node counter
-Killer moves
-Counter moves
-History
thread specific, the scalability of threads is now working as intended. I'm sure there are other things I should look into as well but so far so good! Thanks everyone!
If anyone's interested, here's the pull request that I have made that addresses this issue:
https://github.com/nitrocan/sctr/pull/24/files