Smp concepts

bob · Post by **bob** » Mon Jun 02, 2014 3:53 am

Desperado wrote:Can someone please clarify the term late join.

For me it is simply a thread that joins the party after a group or single
thread is already working on the splitpoint, and it was not considered when
the splitpoint was opened.

Imo there is even no need that the master is the first thread that joins the
party. After the splitpoint has been setup it is flagged as open and does welcome any thread. If the work is done ( no moves left, stopped ) the splitpoint is closed,
which means the worst thing that can happen that the master is running through the SmpSearch() and frees the splitpoint ressource on the splitpoint stack without doing "real" work.
That means further that the splitpoint owner
does not poll for its own splitpoint. The polling is used only if a thread joins
somewhere as slave which includes a waiting master and is solved by a recursive call from the SmpSearch() to the polling function ( commonly named as idleLoop() ) while it is "waiting" ( == polling ).

I don't quite use "polling" as you define it. IE in Crafty you will never find a thread just sitting and spinning waiting for its siblings to finish whatever they are doing. That "spinner" will quickly be invited into one of its sibling's sub-trees as a result of a split operation.

However, in recursive search, there is exactly one thread that can back up through a split point, that being the thread that originally decided to split at that specific point..

The term "late join" as I use it simply means A splits with B and C at some point, and later D or whatever jumps in at the same split point as if it were in the original split operation. This is not that hard to write. I rewrote the code during the 24.0 development, but I found zero improvement (measurable). Absolutely none. I chose to remove it to keep the code simpler and the bugs more infrequent. I can dig the code up if you'd like to see my take, although it might take a little hunting to find the latest...

michiguel · Post by **michiguel** » Mon Jun 02, 2014 5:59 am

syzygy wrote:
bob wrote:He specifically said "idle threads find a split point." That's a problem because a thread really can't choose a split point in a recursive search, that is up to the thread that is actually searching...
My recursive search has been doing it since October 2012 or so.

I figure I have been doing something similar since Gaviota became smp. A have an array of "move queues" (a.k.a. split points). Each element (queue) is created by the master thread, which immediately wakes up "n" working threads. Then, the working threads start looking for jobs (scanning the array of queues). The master goes to sleep waiting its whole queue to be finished. Each of the working threads could become another master at a different level if there are any sleeping working threads, repeating the concept. At any given point, there will be up to n working thread, but there could be more than one master waiting (NOT spinning). I am not claiming this is better or worse than anything else, but it certainly possible that with a given design a thread could choose where to "join".

Miguel

Edsel Apostol · Post by **Edsel Apostol** » Mon Jun 02, 2014 8:31 am

bob wrote:
syzygy wrote:
bob wrote:He specifically said "idle threads find a split point." That's a problem because a thread really can't choose a split point in a recursive search, that is up to the thread that is actually searching...
My recursive search has been doing it since October 2012 or so.
How does an idle thread force a split with a thread that is busy? Unless you use what I consider an inefficient approach where you create arbitrary split points in the busy thread hoping that one or more might actually be used?

In a DTS-like approach it is trivial since there is no call stack to deal with and all threads can see/modify everything in the search.

Perhaps we are talking about two different things, as I don't see how thread B can split with thread A whenever B chooses. A's call stack is already established to take it back ply by ply using a non-sharable block of memory to maintain the position... Being able to do that would essentially result in a recursive DTS algorithm, which seems difficult unless the extra split blocks are dribbled around hoping to catch a thread later.

No way I want to incur that kind of overhead.

In the latest Hannibal using C++11, we are using this approach. It was inspired by Ronald's previous posts, of course.

What we did in Hannibal is this. When a potential splitpoint is encountered, we do a split even if there are no other idle threads to help. The split point data is stored as usual per thread. Later on when other threads are free, they will loop all over the other threads and search for the best split point to join.

This is like the "late join" idea in SF except that it is more advanced. It is more like DTS except that the master is limited in helping worker threads still working for it. In my tests it is better than the SMP in Hannibal 1.4b which is just plain YBWC.

Of course the devil is in the details. Doing split in Hannibal is not that expensive (maybe) as we only have to copy around 2Kb size of position data structure, as well as some initialization stuff like values of alpha, beta, etc.

mar · Post by **mar** » Mon Jun 02, 2014 12:58 pm

Hi Michael, you forgot Lazy SMP

(you can search for some older posts by Dan Homan)
(=basically shared hash table)

Advantages:
- straightforward to implement
- no need for minimum split depth hack
- virtually zero idle times (except for sync at the end of each ID iteration)

Disadvantages:
- considered inferior by many (I disagree, look at ExChess in CCRL for example - there are SMP engines that scale nowhere near that)

syzygy · Post by **syzygy** » Mon Jun 02, 2014 9:06 pm

bob wrote:A's call stack is already established to take it back ply by ply using a non-sharable block of memory to maintain the position...

So you don't take it back:

syzygy wrote:My split() just stores alpha, beta, depth, etc. and sets the flag.
My join() locks the splitpoint, copies the move list from root to split node, unlocks. The joining thread then makes these moves on its own board and starts to search.

syzygy · Post by **syzygy** » Mon Jun 02, 2014 9:17 pm

michiguel wrote:
syzygy wrote:
bob wrote:He specifically said "idle threads find a split point." That's a problem because a thread really can't choose a split point in a recursive search, that is up to the thread that is actually searching...
My recursive search has been doing it since October 2012 or so.
I figure I have been doing something similar since Gaviota became smp. A have an array of "move queues" (a.k.a. split points). Each element (queue) is created by the master thread, which immediately wakes up "n" working threads. Then, the working threads start looking for jobs (scanning the array of queues). The master goes to sleep waiting its whole queue to be finished. Each of the working threads could become another master at a different level if there are any sleeping working threads, repeating the concept. At any given point, there will be up to n working thread, but there could be more than one master waiting (NOT spinning). I am not claiming this is better or worse than anything else, but it certainly possible that with a given design a thread could choose where to "join".

Do I understand correctly that when your master thread goes to sleep it makes sure to wake up a sufficient number of worker threads that the hardware core previously running the master thread will find a worker thread to run?

That would not be the same as what I do, but it is similar to what I was thinking of here. It is natural to equate hardware cores (or hardware threads) to search threads, but doing that means that a waiting master thread "must" be given some work to do or else hardware resources go wasted. Decoupling hardware cores from search threads solves this problem nicely.

My engine does not decouple them, so has to do the helpful master thing.

michiguel · Post by **michiguel** » Mon Jun 02, 2014 9:49 pm

syzygy wrote:
michiguel wrote:
syzygy wrote:
bob wrote:He specifically said "idle threads find a split point." That's a problem because a thread really can't choose a split point in a recursive search, that is up to the thread that is actually searching...
My recursive search has been doing it since October 2012 or so.
I figure I have been doing something similar since Gaviota became smp. A have an array of "move queues" (a.k.a. split points). Each element (queue) is created by the master thread, which immediately wakes up "n" working threads. Then, the working threads start looking for jobs (scanning the array of queues). The master goes to sleep waiting its whole queue to be finished. Each of the working threads could become another master at a different level if there are any sleeping working threads, repeating the concept. At any given point, there will be up to n working thread, but there could be more than one master waiting (NOT spinning). I am not claiming this is better or worse than anything else, but it certainly possible that with a given design a thread could choose where to "join".
Do I understand correctly that when your master thread goes to sleep it makes sure to wake up a sufficient number of worker threads that the hardware core previously running the master thread will find a worker thread to run?

Yes, it wakes up 4 threads in a quad, and then the master (which "is in charge" of the split point) goes to sleep (i.e. wait for the threads to finish the job). So, for a quad, there will be always 4 threads working/searching at different levels and a given number of masters waiting for their split point to finish. I always have more threads than cores, but the excess are always sleeping.

I have done this with semaphores (my own version) because they fit conceptually well, but I am not sure it is the best.

Miguel

That would not be the same as what I do, but it is similar to what I was thinking of here. It is natural to equate hardware cores (or hardware threads) to search threads, but doing that means that a waiting master thread "must" be given some work to do or else hardware resources go wasted. Decoupling hardware cores from search threads solves this problem nicely.

My engine does not decouple them, so has to do the helpful master thing.

Desperado · Post by **Desperado** » Tue Jun 03, 2014 9:11 am

Hi Martin, i will have a look at it,thx.
But finally i will follow the way which was most intiutive to me if it does not turn out to be unmanageable.

But you give the next keyword i am interested in.

Did anyone used local hash tables instead of a shared ones.
The point is "used". I already followed some ideas to make the hash
tables lockless and so on. But beside some natural ideas why using
a shared hash table is a good idea, i am not sure if local hash tables
might be equally efficient or even more efficient. Trees today are very
deep and branchy, so maybe a shared ressource does more handicap
than help.

Beside, thank you guys so far, giving me some useful ideas, details to think about. ( Unfortunately i can post very rarely during the week )

jdart · Post by **jdart** » Tue Jun 03, 2014 1:53 pm

I have some small local tables (pawn scoring cache for example). But the main hash table is shared.

Local caches do not necessarily help with NUMA. Your thread can go idle and later when it is awakened its code may be assigned to a different NUMA node. But most OSs do not migrate its memory to the new node. So that thread's code is now doing non-local memory access, even if initially it was not.

--Jon

bob · Post by **bob** » Tue Jun 03, 2014 7:27 pm

syzygy wrote:
bob wrote:A's call stack is already established to take it back ply by ply using a non-sharable block of memory to maintain the position...
So you don't take it back:
syzygy wrote:My split() just stores alpha, beta, depth, etc. and sets the flag.
My join() locks the splitpoint, copies the move list from root to split node, unlocks. The joining thread then makes these moves on its own board and starts to search.

Then when I return to that point in the tree, I will be using a pointer to the WRONG split block.

I don't have this "master waiting" problem in Crafty. ANY thread can split with ANY thread, no matter where the split is done. At an earlier ply, at a later ply, etc.

The only thing I don't like about the current implementation is that at times there is a little wait time lost waiting on one (or more) threads to satisfy the YBW condition at the current ply they are on. However, the alternative is still problematic (as in Cray Blitz) because if you split before YBW is satisfied, you risk a bad split which is wasted effort. I've compared Crafty and Cray Blitz in that regard and the CB approach looks better, but it is hard to say how MUCH better it would be in practice. But I suppose I am going to find out.

In Crafty, thread N has an associated split block. In order to do a split, it copies the relevant parts of the current split block to new split blocks, one per added thread plus a new one for itself. Now all threads have their own split block, except when doing something at the split point, where they use the parent split block (which is obviously shared).

This makes it easy for me to figure out who to "stop" when I discover a fail-high that wasn't expected, the split block "tree" shows me who is helping me directly (all have to be stopped) as well as who is helping anyone that is helping me (who also have to be stopped, recursively).

Smp concepts

Re: Smp concepts

Re: Smp concepts

Re: Smp concepts

Re: Smp concepts

Re: Smp concepts

Re: Smp concepts

Re: Smp concepts

Re: Smp concepts

Re: Smp concepts

Re: Smp concepts