Smp concepts

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Smp concepts

Post by Desperado »

Hi everybody,

a long time has passed and now, and i want to spend some time on chess programming again.
Not sure yet, where to take the time from, but i guess with less sleep there will be a way ;-).

It is high time to solve my SMP puzzle, so i do have some questions about SMP program flow concerning
different approaches i got to work. In this thread i don't want to ask to many questions about implementation
details, but about basics concepts i figured out by myself. So, i am not sure which way to go because
of less experience in this area. Here i go, maybe you like to comment on the ideas or even add some others.

Some parameters i like to know more about in context of the following approaches:

* SPLIT_DEPTH
* MAX_THREADS_PER_SPLITPOINT
* MAX_SPLITPOINTS
* FORWARD_SPLITTING
* SHARED_DATA ( TRANS_TABLE )
* SPLITPOINT CHAINS
* SPLITPOINT PRIORITIZATION
* SYNCHRON vs. ASYNCHRON MULTITHREADING

Approach 1: Polling Threads ( Ants :-) )
=========================================

This was the most intuitive approach to me. Alls helpers loop, while the engine
is in search mode ( burning cycles ), and are checking some kind of shared split point stack.
If the conditions to join a split point are fulfilled, a thread works as long as there is work
to do on this split point. The eye catchers have been.

- a thread can join anytime anywhere, as long there are open split points ( no idle time until it is signaled )

- Even a helping master can join not only a slave but elsewhere. Joining a slave of the splitpoint is optional.

Well, here the question is, if to use the option as requirement, so to finish the current split point as soon as possible.

- MAX_THREADS_PER_SPLITPOINT is harder to control, because if a splitpoint is finished already by less than MAX_THREADS_PER_SPLITPOINT,
it has to be avoided that another thread joins although the synchronisation condition has been occured.
( eg: 2 threads solved the splitpoint. The master waits for its slave. Now, the split point is left while another would like to join.
This is not a big problem but different to other approches where master books its slaves )

- the controls are done over the split point objects. Threads don't care about the states of other threads.

Now, if someone has experience with this approach, negative or positive, i like to hear about it, especially regarding
the points listed ahead.

Approach 2: Staking Master Concept (SMC)
=========================================

I guess this is the most common used approach. When doing a split, the master checks which thread is available,
then it is staked by the master to the splitpoint. The eye catchers have been.

- the helpers have to be setup and signaled / state-changed by the master.
- Splits have to be avoided when no thread is available. Here the idea of late joins may improve the situation.
Meaning that after a thread returns from a sub task, there will be another check for free threads that can join.
- Threads can stay in idle state,

Approach 3: Asynchron Anytime Anywhere (AAA)
==============================================

This approach has not been tested by me until now. It is a modification of Ants, where the idea is
that anytime anywhere means, that a master can exit immediately the splitpoint. The only thing which
must be changed would be, that the path to a splitpoint must be outplayed by a slave ( at least ), so
the recursion stack is rebuilt. In that case any thread can exit anytime and join anytime anywhere or do whatever it likes.
It is like assigning the current tree state where the thread dependencies are dissolved.

- copy the complete tree state for any splitpoint.


Stressing SMP controls
========================

Doesn't matter where the bottle necks in any approach are, but i guess that with more spanning the tree, the more unimportant
the stress factors are, simply because with limited split points the time a thread is really working gets a higher ratio.
So communication and organization becomes less.

Finally the ahead mentioned catchwords are not explained in detail because i guess within this thread each of it will be
discussed anyway, if not, there is no need to explain it either :-).

Happy to be back again and curious for replies :D

Michael
syzygy
Posts: 5557
Joined: Tue Feb 28, 2012 11:56 pm

Re: Smp concepts

Post by syzygy »

Desperado wrote:- Even a helping master can join not only a slave but elsewhere. Joining a slave of the splitpoint is optional.
Joining anywhere else is problematic and not really necessary, I think.

As long as you keep a recursive search, responsibility for the split node remains with the master. Only that thread can return the result of the search on that node to the parent node.

Once all the slaves of the master have finished their work, it seems to make sense that the master should continue the search on its branch of the tree as soon as possible. If the master is busy somewhere else in the tree, that means it would have to be aborted. Very messy. The alternative is to wait until it has finished helping the other master. Intuitively that doesn't seem optimal.

If you restrict a master to helping its slaves, it is guaranteed that the master is finished and ready to return the search result to the parent node once all its slaves are finished.

My "seems to" and "intuitively" might be wrong here. It is certainly possible to experiment. But I don't think there is any fundamental difference here with the more common non-polling approach. The master having to help its slaves has to do with the recursive nature of the search.

Btw, ignoring context switching overhead another solution would be to let an idle master sleep and in its place spawn (or wake up) a reserve thread. The hardware core that was running the master thread will switch to running that reserve thread and that one can join wherever it likes. You would then have to make sure that the last slave to finish searching the split node will go to sleep (in the reserve thread queue) and the master is woken up so that the hardware core running that slave thread can switch to running the master thread. Effectively this overcomes the single limitation that a recursive search has.

Instead of letting threads sleep and wake up (and hoping the OS does the right thing) it might be possible to use lower level context switching primitives.
Well, here the question is, if to use the option as requirement, so to finish the current split point as soon as possible.
OK, you had already in mind what I just wrote ;-).
- MAX_THREADS_PER_SPLITPOINT is harder to control, because if a splitpoint is finished already by less than MAX_THREADS_PER_SPLITPOINT,
it has to be avoided that another thread joins although the synchronisation condition has been occured.
( eg: 2 threads solved the splitpoint. The master waits for its slave. Now, the split point is left while another would like to join.
This is not a big problem but different to other approches where master books its slaves )
I don't see a need for limiting the number of threads per splitpoint. The whole point of the "polling" approach is to join splitpoints that are as close to the root as possible. Each move that you are searching there represents a decent amount of work. There is no problem in letting each of these moves be searched by a different thread.
Now, if someone has experience with this approach, negative or positive, i like to hear about it, especially regarding
the points listed ahead.
In my engine, each position (at each ply of the search) has a splitpoint data structure. Once the first move has been searched (for non-PV nodes: after captures and killers), a flag is set and other threads can join it. Each thread has a 64-bit mask that shows at which plies potential split points can be found. Once a split point runs out of moves, the corresponding bit is cleared.

It seems to work fine for my engine, but I cannot compare to other approaches. I am currently trying to hack my approach into Stockfish to see how it works there.
User avatar
Graham Banks
Posts: 41415
Joined: Sun Feb 26, 2006 10:52 am
Location: Auckland, NZ

Re: Smp concepts

Post by Graham Banks »

Hi Michael,

great to see you back! :)

Graham.
gbanksnz at gmail.com
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: Smp concepts

Post by Desperado »

... responsibility for the split node remains with the master ...
This is why i alread mentioned approach 3. Although the responsibility can be switched to another thread, which gives more flexibility, i am not sure what to do with such more flexibility. :shock: :lol:
Btw, ignoring context switching overhead another solution would be to let an idle master sleep and in its place spawn (or wake up) a reserve thread. The hardware core that was running the master thread will switch to running that reserve thread and that one can join wherever it likes. You would then have to make sure that the last slave to finish searching the split node will go to sleep (in the reserve thread queue) and the master is woken up so that the hardware core running that slave thread can switch to running the master thread. Effectively this overcomes the single limitation that a recursive search has.
I don't understand what this should be good for and if i understand you at all. The master can simply...

Code: Select all


LoopThread(ExitCondition)
{
   // Threads polling for split points
   while(true)
  {
    // Join as slave
    // Alternatively GetAvailableSplitpointFromSlave()
    splitpoint join = GetAvailableSplitpoint()
    if( join )
    {
    }
 
    if( ExitCondition )
    { return; }
  }
}

SplitSearch()
{
  ...

 // Waiting / Helping Master
 while( waitingForSlaves )
 {
   LoopThread( WithExitCondition )
 }
}
poll for a splitpoint like any other thread. No context switch is necessary. Do i miss something ?
I don't see a need for limiting the number of threads per splitpoint. The whole point of the "polling" approach is to join splitpoints that are as close to the root as possible. Each move that you are searching there represents a decent amount of work. There is no problem in letting each of these moves be searched by a different thread.
Hmm, don't see why this is exclusive for the polling approach.
In my engine, each position (at each ply of the search) has a splitpoint data structure. Once the first move has been searched (for non-PV nodes: after captures and killers), a flag is set and other threads can join it. Each thread has a 64-bit mask that shows at which plies potential split points can be found. Once a split point runs out of moves, the corresponding bit is cleared.
Does that mean you have sth. like splitpoint[THREADS][MAYPLY]; ? :?
Or do you simply organize it over the search stack ?

Code: Select all

void Split(void)
{
   Splitpoint s = new Splitpoint(); // maybe per thread ??
}
Thank you for your first thoughts !
flok

Re: Smp concepts

Post by flok »

I wonder in what category my approach fits in:

When going through the movelist, see if any of the threads is idle. If it is, ask it to search the current move, if not invoke it in the current thread context. Note that in this method threads can also start threads. So you want to group threads in groups to prevent lock-contention for the thread administration.
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: Smp concepts

Post by Desperado »

flok wrote:I wonder in what category my approach fits in:

When going through the movelist, see if any of the threads is idle. If it is, ask it to search the current move, if not invoke it in the current thread context. Note that in this method threads can also start threads. So you want to group threads in groups to prevent lock-contention for the thread administration.
Hi, isn't it similar to the late join optimization i described above. Finally, threads are signaled by the master to search after the work has been assigned. So, similar like

Code: Select all


SplitSearch()
{
  // Master checks available threads
  CheckAvailableThreadAndStake();

  while( SmpGetMoves() )
  {
    search()
    
    // late joining if more helpers are available
    CheckAvailableThreadAndStake();
  }
}

So, the category is the SMC ( staking master concept ) i would say. It is not important if it is done before working on the movelist or additionally within the movelist. Are there details which differ ?

In the other category the threads ignore the states of all other threads, so simply organize their work itself.

Thx
syzygy
Posts: 5557
Joined: Tue Feb 28, 2012 11:56 pm

Re: Smp concepts

Post by syzygy »

Desperado wrote:
Btw, ignoring context switching overhead another solution would be to let an idle master sleep and in its place spawn (or wake up) a reserve thread. The hardware core that was running the master thread will switch to running that reserve thread and that one can join wherever it likes. You would then have to make sure that the last slave to finish searching the split node will go to sleep (in the reserve thread queue) and the master is woken up so that the hardware core running that slave thread can switch to running the master thread. Effectively this overcomes the single limitation that a recursive search has.
I don't understand what this should be good for and if i understand you at all.
What I describe is an alternative solution to the "problem" of an idle master thread.

My main point is this: one can, but does not have to, equate hardware cores (or hardware threads when using HT) to search threads. It is not so important that search threads are always busy. It is important that hardware cores are always busy.

The "helpful master" concept tries to keep the master thread busy while it waits for its slaves to finish. Instead of trying to keep the master thread busy, it would make sense to instead let the hardware core running the master thread do something useful, by switching it to another search thread. Conceptually and from a programming point of view this might be much simpler.

So conceptually this is what would be happening: the master thread simply waits until its slaves are done. (But the hardware core previously running the master thread does not sit still. It just runs another search thread. No hardware resources are wasted.)

This could be implemented by letting threads sleep and waking up other threads, hoping the OS will distribute running threads wisely over the available hardware cores. More efficient would probably be to implement it using some form of user-level threading / coroutines.

It seems setjmp/longjmp does not work across threads, but suppose it did.

If the master thread become idle, it could setjmp() its state and longjmp() to a reserve search thread state that can then join anywhere in the tree.

When the last slave of the master thread becomes idle, it setjmp()s its idle state and longjmp()s to the state of the master thread. It is now the master thread and continues the search of the parent node.
syzygy
Posts: 5557
Joined: Tue Feb 28, 2012 11:56 pm

Re: Smp concepts

Post by syzygy »

Desperado wrote:
I don't see a need for limiting the number of threads per splitpoint. The whole point of the "polling" approach is to join splitpoints that are as close to the root as possible. Each move that you are searching there represents a decent amount of work. There is no problem in letting each of these moves be searched by a different thread.
Hmm, don't see why this is exclusive for the polling approach.
Imo the "polling approach" makes it even less necessary to have a restriction on the number of threads per splitpoint. But it seems to me that such a restriction is almost always bad anyway, polling or non-polling. (It could make sense at splitpoints that are at the minimum split depth. But increasing minimum split depth by 1 would more or less do the same thing.)
In my engine, each position (at each ply of the search) has a splitpoint data structure. Once the first move has been searched (for non-PV nodes: after captures and killers), a flag is set and other threads can join it. Each thread has a 64-bit mask that shows at which plies potential split points can be found. Once a split point runs out of moves, the corresponding bit is cleared.
Does that mean you have sth. like splitpoint[THREADS][MAYPLY]; ? :?
Or do you simply organize it over the search stack ?
It's part of the search stack. Part of the "copy/make" array part of the position structure. But it is equivalent to splitpoint[THREADS][MAXPLY], indeed.

Code: Select all

void Split(void)
{
   Splitpoint s = new Splitpoint(); // maybe per thread ??
}
My split() just stores alpha, beta, depth, etc. and sets the flag.
My join() locks the splitpoint, copies the move list from root to split node, unlocks. The joining thread then makes these moves on its own board and starts to search.

Btw, I haven't really tried to read and understand your third approach, yet.
syzygy
Posts: 5557
Joined: Tue Feb 28, 2012 11:56 pm

Re: Smp concepts

Post by syzygy »

Desperado wrote:Approach 3: Asynchron Anytime Anywhere (AAA)
==============================================

This approach has not been tested by me until now. It is a modification of Ants, where the idea is
that anytime anywhere means, that a master can exit immediately the splitpoint. The only thing which
must be changed would be, that the path to a splitpoint must be outplayed by a slave ( at least ), so
the recursion stack is rebuilt. In that case any thread can exit anytime and join anytime anywhere or do whatever it likes.
It is like assigning the current tree state where the thread dependencies are dissolved.

- copy the complete tree state for any splitpoint.
I think this a bit like my "alternative solution" (which I have never implemented), except that my alternative solution focusses more on the hardware core/thread that was running the master thread.

Rewinding a recursive search and having it rebuilt by another thread might be possible, but is certainly much more complicated than using a context switch mechanism to let another hardware core/thread take over the master (software) thread.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Smp concepts

Post by bob »

syzygy wrote:
Desperado wrote:- Even a helping master can join not only a slave but elsewhere. Joining a slave of the splitpoint is optional.
Joining anywhere else is problematic and not really necessary, I think.

As long as you keep a recursive search, responsibility for the split node remains with the master. Only that thread can return the result of the search on that node to the parent node.

Once all the slaves of the master have finished their work, it seems to make sense that the master should continue the search on its branch of the tree as soon as possible. If the master is busy somewhere else in the tree, that means it would have to be aborted. Very messy. The alternative is to wait until it has finished helping the other master. Intuitively that doesn't seem optimal.

If you restrict a master to helping its slaves, it is guaranteed that the master is finished and ready to return the search result to the parent node once all its slaves are finished.

My "seems to" and "intuitively" might be wrong here. It is certainly possible to experiment. But I don't think there is any fundamental difference here with the more common non-polling approach. The master having to help its slaves has to do with the recursive nature of the search.

Btw, ignoring context switching overhead another solution would be to let an idle master sleep and in its place spawn (or wake up) a reserve thread. The hardware core that was running the master thread will switch to running that reserve thread and that one can join wherever it likes. You would then have to make sure that the last slave to finish searching the split node will go to sleep (in the reserve thread queue) and the master is woken up so that the hardware core running that slave thread can switch to running the master thread. Effectively this overcomes the single limitation that a recursive search has.

Instead of letting threads sleep and wake up (and hoping the OS does the right thing) it might be possible to use lower level context switching primitives.
Well, here the question is, if to use the option as requirement, so to finish the current split point as soon as possible.
OK, you had already in mind what I just wrote ;-).
- MAX_THREADS_PER_SPLITPOINT is harder to control, because if a splitpoint is finished already by less than MAX_THREADS_PER_SPLITPOINT,
it has to be avoided that another thread joins although the synchronisation condition has been occured.
( eg: 2 threads solved the splitpoint. The master waits for its slave. Now, the split point is left while another would like to join.
This is not a big problem but different to other approches where master books its slaves )
I don't see a need for limiting the number of threads per splitpoint. The whole point of the "polling" approach is to join splitpoints that are as close to the root as possible. Each move that you are searching there represents a decent amount of work. There is no problem in letting each of these moves be searched by a different thread.
Now, if someone has experience with this approach, negative or positive, i like to hear about it, especially regarding
the points listed ahead.
In my engine, each position (at each ply of the search) has a splitpoint data structure. Once the first move has been searched (for non-PV nodes: after captures and killers), a flag is set and other threads can join it. Each thread has a 64-bit mask that shows at which plies potential split points can be found. Once a split point runs out of moves, the corresponding bit is cleared.

It seems to work fine for my engine, but I cannot compare to other approaches. I am currently trying to hack my approach into Stockfish to see how it works there.
There are multiple issues making limiting the number of threads at one split point beneficial.

1. NUMA machines, which all multi-socket machines today seem to be. Splitting across NUMA nodes incurs overhead.

2. 24 core machines are not that uncommon. In endgames, there are not 24 legal moves at a single position in many positions. Asking for more threads than legal moves simply ramps up overhead.

3. Limiting cores limits contention for a shared-memory resource that will reduce overhead.

I do allow everyone to split near the root with no limit, but nearer to the tips I impose a limit.