Thread count limits and core counts

sje · Post by **sje** » Sun Mar 22, 2015 3:33 pm

Thread count limits and core counts

A common approach to handling thread allocation is to dynamically set the maximum number of worker threads to the number of processor cores available. The core count can be either the number of physical cores or the number of hyper threading virtual cores. The primary motivation with such an assignment is to avoid overhead due to unnecessary context switching.

But is a one-to-one mapping strongly required? How much do context switches cost? Back in the Old Days, I did much of my undergraduate work using a mainframe which had a basic time slice frequency of 4 Hz. At the time, this relatively slow slicing was justified by the time consumed by saving/restoring registers and the time used by the kernel to see if there was something to do other than resuming the user's job.

But that was then, and this is now. Basic time slice frequencies on modern hardware are in the 100 Hz to 1 KHz range, making for smoother operation. A context switch does take time, but not near as much as it did in days long ago.

With this in mind, I ran a set of experiments with Symbolic implementing a run time option in the program which, when set, told Symbolic that it had 64 cores instead of whatever the real number was.

The basic result: even with many threads instead of just two or four running multi-threaded perft(), the total time overhead was less than three percent. I suspect this could be reduced a bit if I tried some tricky optimizations. It might even be possible to have a negative overhead with search algorithms where the threads can share partial results to reduce the total amount of calculations.

So I would suggest to my fellow authors that they might try a similar experiment with their programs. Just implement an option where the program's FetchCoreCount() function returns 64 or some other number larger than the number of available moves in most reasonable positions. Run some tests and see what happens. If the observed overhead is small or even negative most of the time, then it might be worthwhile to use an inflated core count if such simplifies the program.

bob · Post by **bob** » Sun Mar 22, 2015 10:55 pm

sje wrote:Thread count limits and core counts

A common approach to handling thread allocation is to dynamically set the maximum number of worker threads to the number of processor cores available. The core count can be either the number of physical cores or the number of hyper threading virtual cores. The primary motivation with such an assignment is to avoid overhead due to unnecessary context switching.

But is a one-to-one mapping strongly required? How much do context switches cost? Back in the Old Days, I did much of my undergraduate work using a mainframe which had a basic time slice frequency of 4 Hz. At the time, this relatively slow slicing was justified by the time consumed by saving/restoring registers and the time used by the kernel to see if there was something to do other than resuming the user's job.

But that was then, and this is now. Basic time slice frequencies on modern hardware are in the 100 Hz to 1 KHz range, making for smoother operation. A context switch does take time, but not near as much as it did in days long ago.

With this in mind, I ran a set of experiments with Symbolic implementing a run time option in the program which, when set, told Symbolic that it had 64 cores instead of whatever the real number was.

The basic result: even with many threads instead of just two or four running multi-threaded perft(), the total time overhead was less than three percent. I suspect this could be reduced a bit if I tried some tricky optimizations. It might even be possible to have a negative overhead with search algorithms where the threads can share partial results to reduce the total amount of calculations.

So I would suggest to my fellow authors that they might try a similar experiment with their programs. Just implement an option where the program's FetchCoreCount() function returns 64 or some other number larger than the number of available moves in most reasonable positions. Run some tests and see what happens. If the observed overhead is small or even negative most of the time, then it might be worthwhile to use an inflated core count if such simplifies the program.

Context switching is not the issue. Locks are. If you use spin locks, which are far faster than pthreads mutex locks, you have a problem. You can get a thread that is trying to acquire a lock to spin in a tight loop until it exhausts its time quantum and gets moved to the expired queue (this is Linux terminology). The thread that actually holds the lock, and simply needs to execute a few instructions and then clear the lock has to wait for some other process to "expire" before it can release the lock.

Operating system is not smart enough to realize that one thread is doing useful work in a critical section, the other thread is spinning waiting to gain access to the critical section, and if it chooses the wrong one, you just added a couple of hundred millisecond delay to the wait-for-critical-section time...

The typical linux time-slice is on the order of 100-200ms. Several things play a roll in the exact number including process priority, nice value (if any), "sleep time" (which I would tend to call I/O wait time) and such.

sje · Post by **sje** » Sun Mar 22, 2015 11:20 pm

bob wrote:Context switching is not the issue. Locks are. If you use spin locks, which are far faster than pthreads mutex locks, you have a problem. You can get a thread that is trying to acquire a lock to spin in a tight loop until it exhausts its time quantum and gets moved to the expired queue (this is Linux terminology). The thread that actually holds the lock, and simply needs to execute a few instructions and then clear the lock has to wait for some other process to "expire" before it can release the lock.

Operating system is not smart enough to realize that one thread is doing useful work in a critical section, the other thread is spinning waiting to gain access to the critical section, and if it chooses the wrong one, you just added a couple of hundred millisecond delay to the wait-for-critical-section time...

The typical linux time-slice is on the order of 100-200ms. Several things play a roll in the exact number including process priority, nice value (if any), "sleep time" (which I would tend to call I/O wait time) and such.

Interestingly, the 3% figure I reported came from a multi-threaded perft() with 44 threads (BusyFEN, split at root) sharing the same transposition table using spinlocks. (One spinlock per each of 256 equi-partitioned table regions.)

On my quad core AMD Linux box, from /boot/config-3.2.0-4-amd64

Code: Select all

CONFIG_HZ_250=y
CONFIG_HZ=250

Which is 4 msec.

bob · Post by **bob** » Mon Mar 23, 2015 12:48 am

sje wrote:
bob wrote:Context switching is not the issue. Locks are. If you use spin locks, which are far faster than pthreads mutex locks, you have a problem. You can get a thread that is trying to acquire a lock to spin in a tight loop until it exhausts its time quantum and gets moved to the expired queue (this is Linux terminology). The thread that actually holds the lock, and simply needs to execute a few instructions and then clear the lock has to wait for some other process to "expire" before it can release the lock.

Operating system is not smart enough to realize that one thread is doing useful work in a critical section, the other thread is spinning waiting to gain access to the critical section, and if it chooses the wrong one, you just added a couple of hundred millisecond delay to the wait-for-critical-section time...

The typical linux time-slice is on the order of 100-200ms. Several things play a roll in the exact number including process priority, nice value (if any), "sleep time" (which I would tend to call I/O wait time) and such.
Interestingly, the 3% figure I reported came from a multi-threaded perft() with 44 threads (BusyFEN, split at root) sharing the same transposition table using spinlocks. (One spinlock per each of 256 equi-partitioned table regions.)

On my quad core AMD Linux box, from /boot/config-3.2.0-4-amd64
Code: Select all
CONFIG_HZ_250=y
CONFIG_HZ=250
Which is 4 msec.

That is the timer interrupt frequency, NOT the quantum a process gets before it expires and the scheduler selects the next process to run. Or at least it was the last I looked at that specific code.

I don't think it has changed since Ingo's O(1) scheduler was added.

sje · Post by **sje** » Mon Mar 23, 2015 1:50 am

bob wrote:
sje wrote:Interestingly, the 3% figure I reported came from a multi-threaded perft() with 44 threads (BusyFEN, split at root) sharing the same transposition table using spinlocks. (One spinlock per each of 256 equi-partitioned table regions.)

On my quad core AMD Linux box, from /boot/config-3.2.0-4-amd64
Code: Select all
CONFIG_HZ_250=y
CONFIG_HZ=250
Which is 4 msec.
That is the timer interrupt frequency, NOT the quantum a process gets before it expires and the scheduler selects the next process to run. Or at least it was the last I looked at that specific code.

Yes. Or maybe, as the details of the scheduler are somewhat opaque.

Every interrupt causes a context switch, and any timer interrupt also has the possibility of running any periodic task the kernel may have including user thread reactivation.

What I'm saying is that choosing a one-to-one map of worker threads to cores based on the fear of context switch overhead is not justified with modern hardware. The question becomes of how much the extra overhead, if any, is incurred by suspending a thread which holds a spinlock while some other non suspended thread is spinning waiting for that lock.

The answer here is that it depends on various factors, mostly the probability of spinlocked resource contention. In my tests, the overhead was under 3%. I could reduce this by perhaps a factor of four simply by decreasing the size of a transposition table spinlock region by a factor of four by increasing the number of spinlocks per table from 256 to 1,024.

Perhaps we need more data points. Could you try a simple test with Crafty? Build a twin version which is identical except that it thinks it's running on a 64 core processor, and run a few benchmarks which exercise multi-threaded operation. I'll guess that the observed overhead for search will be less than my 3% figure for perft().

bob · Post by **bob** » Mon Mar 23, 2015 1:58 am

sje wrote:
bob wrote:
sje wrote:Interestingly, the 3% figure I reported came from a multi-threaded perft() with 44 threads (BusyFEN, split at root) sharing the same transposition table using spinlocks. (One spinlock per each of 256 equi-partitioned table regions.)

On my quad core AMD Linux box, from /boot/config-3.2.0-4-amd64
Code: Select all
CONFIG_HZ_250=y
CONFIG_HZ=250
Which is 4 msec.
That is the timer interrupt frequency, NOT the quantum a process gets before it expires and the scheduler selects the next process to run. Or at least it was the last I looked at that specific code.
Yes. Or maybe, as the details of the scheduler are somewhat opaque.

Every interrupt causes a context switch, and any timer interrupt also has the possibility of running any periodic task the kernel may have including user thread reactivation.

What I'm saying is that choosing a one-to-one map of worker threads to cores based on the fear of context switch overhead is not justified with modern hardware. The question becomes of how much the extra overhead, if any, is incurred by suspending a thread which holds a spinlock while some other non suspended thread is spinning waiting for that lock.

The answer here is that it depends on various factors, mostly the probability of spinlocked resource contention. In my tests, the overhead was under 3%. I could reduce this by perhaps a factor of four simply by decreasing the size of a transposition table spinlock region by a factor of four by increasing the number of spinlocks per table from 256 to 1,024.

Perhaps we need more data points. Could you try a simple test with Crafty? Build a twin version which is identical except that it thinks it's running on a 64 core processor, and run a few benchmarks which exercise multi-threaded operation. I'll guess that the observed overhead for search will be less than my 3% figure for perft().

The o(1) scheduler is pretty simple. It uses a basic quantum of 100ms the last time I looked, maybe a month ago for my O/S class. This can be ramped up or down via nice. A negative nice value increases this to give you a bigger fair share, a positive nice value decreases this to give you a smaller fair share. I/O activity increases this as when all processes in the current queue are blocked, leaving nothing to run, where all the runnable processes have been moved to expired, because they have burned their quantum to zero, the processes in current (all blocked for I/O) get moved over to current, and their quantum grows. Doesn't really matter much since they are blocking to do I/O anyway, but it does keep 'em from expiring which is when a new priority calculation for a process is done.

I can't imagine any process scheduler wanting to do a context switch every 4ms or so, when you are choosing between compute-bound processes anyway. That wipes out the TLB, part of cache, branch target buffer data, and such. Machine would really not perform very well.

BTW there are lightweight kernels around that dispense with this process scheduling stuff. You just run one process with no quantum, etc, for ultra-fast execution.

sje · Post by **sje** » Mon Mar 23, 2015 10:17 am

Consider a traditional multi-threaded core-count limited A/B searcher. The program hits a probable CUT or PV node. It could send out a worker thread for, say, the first four moves because there are four cores available. But a modified version might send out 32 threads, one for each move. Which version will get the cut-off move first? Or maybe just raise alpha first? There are many factors at play; only some testing will really help determine the winner.

bob · Post by **bob** » Mon Mar 23, 2015 6:28 pm

sje wrote:Consider a traditional multi-threaded core-count limited A/B searcher. The program hits a probable CUT or PV node. It could send out a worker thread for, say, the first four moves because there are four cores available. But a modified version might send out 32 threads, one for each move. Which version will get the cut-off move first? Or maybe just raise alpha first? There are many factors at play; only some testing will really help determine the winner.

Since 90%+ of the cutoffs occur on the first move, and most of what is left occur on the 2nd, I don't want 32 threads competing for 4 cores, spinning waiting on locks while the threads that hold the locks are waiting on CPU access, etc... Not to mention the search nodes overhead this would add.

sje · Post by **sje** » Wed Mar 25, 2015 11:49 pm

Experiments are what's needed, so I added some commands to Symbolic which allow interactive adjustment of the number of threads allowed for distribution for perft() and other tasks.

This seems to test okay and in fact I'm using it now to force a limit on CPU power given to Symbolic while a couple of other unrelated processes are running.

Thread count limits and core counts

Thread count limits and core counts

Re: Thread count limits and core counts

Re: Thread count limits and core counts

Re: Thread count limits and core counts

Re: Thread count limits and core counts

Re: Thread count limits and core counts

Re: Thread count limits and core counts

Re: Thread count limits and core counts

Experiments