SMP thread goes here

bob · Post by **bob** » Mon Sep 03, 2007 2:46 am

hgm wrote:The same story all over again. So operating systems are crappy... It doesn't seem so difficult to program decent signals. Oh well, let's not go into that...

Yes, your polling is ultra-simple code. But of course you need lots of code lines to set up the linked lists of split blocks. In my scheme that part is nonexistent. And later you have to walk that tree to send all the signals. That is not trivial either.

Here's the code:

Code: Select all

void ThreadStop&#40;TREE * RESTRICT tree&#41;
&#123;
  int proc;

  Lock&#40;tree->lock&#41;;
  tree->stop = 1;
  for &#40;proc = 0; proc < shared->max_threads; proc++)
    if &#40;tree->siblings&#91;proc&#93;)
      ThreadStop&#40;tree->siblings&#91;proc&#93;);
  Unlock&#40;tree->lock&#41;;
#if defined&#40;DEBUGSMP&#41;
  Lock&#40;shared->lock_io&#41;;
  Print&#40;128, "thread %d &#40;block %d&#41; being stopped by beta cutoff.\n",
      tree->thread_id, FindBlockID&#40;tree&#41;);
  Unlock&#40;shared->lock_io&#41;;
#endif
&#125;

So actually, that is pretty trivial. The main point is that the above is not done very often. Ditto for the code to thread the split blocks together so that the above works. It adds 2 lines of code where the split is done and the cost is zero...

Of course you can argue that these lines are almost never executed, but that doesn't make the code any simpler. A bug there would still bring the program to a grinding halt, even if it is executed once per game...

See above. It _really_ is trivial code. For each sibling, I call "ThreadStop() which sets that thread's stop flag for the split block in question, and then recursively calls ThreadStop if there are any siblings that are helping him...

My probe line would be something like
Code: Select all
if&#40;!count--)
&#123; if&#40;flag&#91;nr&#93;) &#123; count=0; return&#40;0&#41;; &#125;
   count = PROBE_INTERVAL;
   if&#40;!nr--) nr = MaxSpliNode;
&#125;
Not so horribly complex either, and most of the time only a decrement and branch. And it would save lots of complexity elsewhere.

hgm · Post by **hgm** » Mon Sep 03, 2007 2:58 pm

Well, so it is all approximately equally trivial, and if there are performance differences they are on the order of 0.1% or so.

I don't see that as a reason to abandon the first ideas that come into my mind, and copy others in stead. For that, my own ideas should be wildly inefficient (say, result in an overall slowdown of 10% or more) or much more complicated (requiring 5 times as many code). In this case my code does not even seem longer. And even if it is not much shorter, the fact that it is somewhat shorter and that it is the way to solve it that seems natural to me seems more than enough compensation for the tiny performance difference it might cause.

The most important thing for me to know was if polling is preferable over signaling, or not, and that question seems resolved. How exactly to poll seems more a matter of taste, and not worth very deep philosophical considerations.

bob · Post by **bob** » Mon Sep 03, 2007 8:17 pm

hgm wrote:Well, so it is all approximately equally trivial, and if there are performance differences they are on the order of 0.1% or so.

I don't see that as a reason to abandon the first ideas that come into my mind, and copy others in stead. For that, my own ideas should be wildly inefficient (say, result in an overall slowdown of 10% or more) or much more complicated (requiring 5 times as many code). In this case my code does not even seem longer. And even if it is not much shorter, the fact that it is somewhat shorter and that it is the way to solve it that seems natural to me seems more than enough compensation for the tiny performance difference it might cause.

The most important thing for me to know was if polling is preferable over signaling, or not, and that question seems resolved. How exactly to poll seems more a matter of taste, and not worth very deep philosophical considerations.

My point was that anything done at _every_ node needs careful examination. Polling/checking a single value has a risk of a cache miss. Checking two doubles that risk, however small it might be. On 8-way and 16-way boxes, I have seen a huge number of active split points (I have exhausted 256 split blocks on an 8-way system for example). That requires a lot of tests in your approach, done every single node. Those are the kinds of things that cause the unexpected speed decrease when something unrelated is changed, as now one of those frequently done things becomes a repeated cache miss, where it wasn't before.

I continually think about the case where I have 16 or more processors, which causes me to do things differently than if I was just planning for 2 or 4 cpus. In bigger machines, small inefficiencies turn into significant penalties.

hgm · Post by **hgm** » Mon Sep 03, 2007 9:41 pm

But is it really necessary to poll at every node? If you have only a thousand or so aborts per search, and each CPU does 1M nps, checking once every 1000 nodes makes you discover aborts on the average 0.5 ms late, so that a CPU that participated in all 1000 aborts still only uses 0.5 sec (out of 30?).

bob · Post by **bob** » Mon Sep 03, 2007 11:49 pm

hgm wrote:But is it really necessary to poll at every node? If you have only a thousand or so aborts per search, and each CPU does 1M nps, checking once every 1000 nodes makes you discover aborts on the average 0.5 ms late, so that a CPU that participated in all 1000 aborts still only uses 0.5 sec (out of 30?).

You could do that. But the counter takes just as long to decrement as it does to simply fetch the abort flag and test for != zero.

hgm · Post by **hgm** » Tue Sep 04, 2007 9:09 am

True, but you could put it in the same cache line with the other (private) variables that are used every node, so you would be sure it always is an L1 hit. (Or if it would be a miss due to an interrupt / task switch, you would have had to process the miss anyway on behalf of the other variables). For a flag somewhere in shared memory that would not be the case.

You could also put the polling in a piece of code that would be executed only in nodes with a larger remaining depth. E.g. if you have a conditional statement to calculate if LMR should be done, one of the conditions would be something like if(d>5). And of course you could refrain from polling in QS, putting the code in the place where the non-captures are generated, rather than on entry of the node.

bob · Post by **bob** » Tue Sep 04, 2007 9:56 pm

hgm wrote:True, but you could put it in the same cache line with the other (private) variables that are used every node, so you would be sure it always is an L1 hit. (Or if it would be a miss due to an interrupt / task switch, you would have had to process the miss anyway on behalf of the other variables). For a flag somewhere in shared memory that would not be the case.

You could also put the polling in a piece of code that would be executed only in nodes with a larger remaining depth. E.g. if you have a conditional statement to calculate if LMR should be done, one of the conditions would be something like if(d>5). And of course you could refrain from polling in QS, putting the code in the place where the non-captures are generated, rather than on entry of the node.

If you don't poll often, you end up with wasted time. It is difficult to predict when a sub-tree will go quickly, vs when one will explode and take a lot longer to search, causing wait time to build up.

I'm not sure I follow your comment about a flag in shared memory. All memory is cached. And just because something sits in shared memory, doesn't mean it is actually accessed by other processors. I have all my "split blocks" in shared memory, although for NUMA machines I try to allocate memory so that the split blocks are scattered through each processor's local (shared) memory, so that a process is accessing its own local memory for these critical values, even though all processors can actually read or write to them if needed (as in when setting the abort flag).

mridul · Post by **mridul** » Mon Sep 24, 2007 9:08 pm

I had not one, but two impl's of DTS working ... and I dont remember asking you any questions either

But yeah, it was more for the sake of doing it, and less for the sake of competitive computer chess.

Mridul

bob · Post by **bob** » Mon Sep 24, 2007 10:07 pm

mridul wrote:I had not one, but two impl's of DTS working ... and I dont remember asking you any questions either

But yeah, it was more for the sake of doing it, and less for the sake of competitive computer chess.

Mridul

I also don't think he was talking about you, either.

Mangar · Post by **Mangar** » Tue Oct 02, 2007 3:17 pm

Hi,

as I am currently optimizing Spike´s parallel altorithm I whant to join the discussion. I like to talk about the way I implemented it as it seems to be a bit other than most approaches.

Spike uses objects (it is not object oriented). Every data Spike needs for search is stored in a "search" object. Global data is declared as static.
Every thread gets an instance of this object.

Spike has a recursive search. But most search related data is stored in an array of positions and not in the system stack.
The data stored in a position is for example alpha, beta, current value, move list, amount of moves, current position value.

The data needed for a split position is as well stored in the stack array. There is no special data block for split points. :
1. A pointer to a stack position
2. A "volatile" thread count

There are some implementations I added for speed but they are not neccessary for the algorithm. I will not mention those here.

Once a thread is beginning to search at a new position I call him "position owner". As a postition owner it will search the first move without any parallelism. After searching the first move it will check if the position he owns is a good-enough split position.
If yes he will. Set the psotion thread count to 1 signaling that any thread may help here.

Any thread not working is polling for thread counts > 0 at any position of any thread. If one or more positions with thread counts > 0 are found the best one is joined.
Joining is done by:
1. Increase the thread count of the owner position
2. Store a pointer to the owner position data
3. Doing all neccessairy moves to reach the position
4. Storing the split positions of the orignal thread up to this position (all those are indirect helped positions)
5. Start searching moves.

Leaving is done by
1. Undoing all moves done by joining
2. Setting the position pointer to zero
3. Decreasing the thread count
4. Only for the position owner: "wait" until the thread count is zero. (He is able to help any thread that is helping somewhere at this position but he cannot leave the position)

As many thread are able to change the position owner data, increase/decrease of thread count, getting moves, changing values, ... are protected by a critical section.

Cutoffs are handled by polling. Helping threads are polling every helped position for alpha >= beta.

In this implementation a thread has no need to know which thread is helping him at a position. But the other way round is neccessairy. A thread has to know where he is helping. Either directly or indirectl thus by helping a thread that itself is helping another thread.

Currently I have a speedup of about 3.05 with 4 CPU at a depth 14 search. It is measured by calculating 30 times 100 positions to a depth of 14 to reduce statistic noises.
Currently I have additional possibilities to increase speedup. These are:
1. Splitting at root (I don´t do yet)
2. Stopping sibling threads on alpha increase without beta-cutoff.
3. Heuristic when better help siblings than searching a whole new move.

Greeting Volker

SMP thread goes here

Re: SMP thread goes here

Re: SMP thread goes here

Re: SMP thread goes here

Re: SMP thread goes here

Re: SMP thread goes here

Re: SMP thread goes here

Re: SMP thread goes here

Re: SMP thread goes here

Re: SMP thread goes here

Re: SMP thread goes here