SMP and questions

zamar · Post by **zamar** » Wed Nov 28, 2012 8:18 pm

This is all very theoretical, but I can give several reasons, why I find it very difficult to believe that stopping/restarting in case of Fail-High can give any measurable benefit in PV:

1) This is rear case. Last time I tested, the probability for best move change in PV node close to root in late iterations was around ~10%.

2) It only affect few moves. In case of quad core, we are searching at maximum three other moves at the same time when fail high occurs. There are on average ~30 moves in a position. So this affects at maximum 10% of the moves.

3) Because of the small aspiration window that all modern programs use, typically around 8-16cp, the change in the window size is relatively small, so the net gain is also relatively small, let's say 20% at maximum (and this is overestimation).

So based on three assumptions above I'd expect we could get 0.1 * 0.1 * 0.2 = 0.2% speedup.

Of course it could be a bit more, but even if it's five times more, it's still too small to be measurable.

Edsel Apostol · Post by **Edsel Apostol** » Thu Nov 29, 2012 12:40 am

Houdini wrote:1) An illegal move from hash should be very rare, it doesn't really matter what you do.
2) 3) 6) Per thread is best. The less writable memory you share between threads, the better.
4) Killers are very volatile, it doesn't really matter what you do.
7) Don't split good captures or killers - the probability of a cut-off is too high.
8) Houdini doesn't split at the root, mostly for practical reasons (for example time control).
9) Have a counter per thread. Same reason as 2) 3) 6).

Robert

Thanks for sharing this. Number 7 didn't work for my implementation. The threads just wastes too much time in the idle loop waiting for work. Maybe if used in conjunction with node types (All, Cut) it will work, I'll try that idea later.

Edsel Apostol · Post by **Edsel Apostol** » Thu Nov 29, 2012 12:58 am

mcostalba wrote:
Edsel Apostol wrote: What are the other tricks used to minimize idle time spent by the threads? I have read "Active reparenting" from the SF team but I have not studied SF yet how this is implemented
or was this implemented at all?
The idea is very simple. Implemenatation, as usual with SMP, is simple only after you made it to work .

Once a slave has finished searching, instead of returning to idle loop waiting to be picked up by a new master, it looks for an active split point, register itself as a split point's slave and starts searching on it. The idea is that average idle time is greatly reduced in this case.

Unfortunaty the patch failed to score an increase and so has been retired:

https://github.com/mcostalba/Stockfish/ ... 95083a3ab9

My guess is that the idle time is anyhow already small on average so that the patch does not change the things a lot. One possibility that I have not tested is to enable active reparenting together with increasing minimum split depth. This is because increasing split depth reduces the SMP overhead but at the same time increases the average slave's idle time. So at first glance could make sense to increase the first and enable the second.

SMP seems simple to me at first in theory, until I started implementing it, just then I realized the complexity. Though when I'm done with it, it seems simple again.

Increasing minimum split depth would mean less splitting and therefore less split points for the active reparenting to attach to. So the gain could only be about break even.

Edsel Apostol · Post by **Edsel Apostol** » Thu Nov 29, 2012 1:08 am

Thanks for the input. My NPS speed-up is now somewhat decent though not as good as Crafty or other top programs.

I have not studied DTS yet. In a nutshell what should be the difference compared to YBW? Why it is not easy to implement it in a recursive search framework?

Edsel Apostol · Post by **Edsel Apostol** » Thu Nov 29, 2012 1:20 am

syzygy wrote:
Edsel Apostol wrote:NPS speed-up in the initial position is around 1.8, 2.3, and 2.6 for 2, 3, and 4 threads respectively. This is in an old quad.
I know the speedup sucks but what can you expect from a 2 weeks old implementation that has recently just working.
Any suggestions where is the potential bottle neck? What is the speed-up in your engine at initial position?
I just did a 75-80 seconds run on the initial position. The NPS speed up I get with 6 cores is 5.84 compared to my smp engine running 1 thread and 5.81 compared to my engine with smp not compiled in.

#9 Should node count be counted per thread and just summed or is this sufficient enough?
For sure it is much better to do this per thread, especially if you also want the node count to be accurate. With a single counter you would have to use atomic increment (either through inline assembly or a compiler builtin), which you can avoid at no cost with a counter per thread. (Atomic instructions aren't terribly expensive on modern cpus when uncontested, but a single node counter would be one of the most contested memory locations (or rather, cache lines) in your engine.)

What are the other tricks used to minimize idle time spent by the threads? I have read "Active reparenting" from the SF team but I have not studied SF yet how this is implemented
or was this implemented at all?
As far as I understand, SF and many other engines let a thread that is searching a node check if there are idle threads when it is ready to split. So an idle thread will have to wait until it is picked up by an active thread. This way you have idle time, and you have no control on how close to the root the idle thread joins the search (except by increasing the minimum split depth).

If I'm not mistaken, the active reparenting idea is to let a thread that has become idle look at the nodes at which the active threads have joined the search, and presumably join the one that is closest to the root. I guess this should in theory help somewhat if you have a large number of threads. If you have just two threads, it will do nothing because there are no "active split points" if one thread is idle (so only one thread is active).

In my engine threads do not try to pick up idle threads. Instead, when a node is ready to be split (after captures and killers have been searched), it sets a flag for that node to mark it as a potential split point (and copies alpha, beta, depth to a place where other threads can find them). When a thread becomes idle, it basically loops over all other threads to find the potential split point with the highest depth. (If a thread is idle because it must wait for slave threads to finish, it only looks at potential split points in the subtree below the node at which it is waiting.) The flags are bits in one 64-bit word per thread (for plies 0-63, although I don't currently split at the root / ply 0), so looking for a split point is just a matter of taking the LSB of this word for each thread and looking up and comparing the corresponding depths.

I don't know if my approach is measurably better than the approach taken by SF, but it seems to work quite well. Initially I feared that the overhead for "potential split points" that never become split points would be high (because those need to take a node-specific spinlock when selecting each next move), but it turns out that it works fine with a minimum split depth of 2 (i.e. I only don't split in the last ply of full width search).

To keep the cost of splitting low, I think it is (always) best to just copy the moves leading from the root to the split node.

Your speed-up is great and your approach is interesting. It should minimize idle time for the threads as long as you have enough split points, which you had since you are splitting at minimum spit depth of 2.

When the flag is set for "potential split points" you're already using locks in selecting the moves even without other threads yet helping search on this split point?

I may experiment with this approach.

Edsel Apostol · Post by **Edsel Apostol** » Thu Nov 29, 2012 1:28 am

jdart wrote:Intel Parallel Studio is free for non-commercial use on Linux and can give you an idea of how efficient your SMP is and where the bottlenecks are.

If you are doing it right, you should get nearly 100% usage on all threads, except for the little bit of startup time before you are hitting the minimum split depth (I do not split if the depth is too low or if in check, because if in check you will typically only have a few moves to search).

With > 2 threads you need to implement the "helpful master" concept to achieve this .. i.e. a master thread should not be waiting for its slave threads to complete, but should help them complete.

--Jon

Unfortunately, my SMP is Windows only compatible as of the moment (CriticalSection only).

Running some analysis on Windows though had my implementation at around 10% thread synchronization overhead, mostly spent on waiting for locks on selecting the moves. I don't have any ideas how to avoid this at the moment, and the overhead tends to increase as the number of threads increases. It might be a good idea to limit the number of threads working in a split point to a minimum, lets say 2, but I have to experiment with it yet.

syzygy · Post by **syzygy** » Thu Nov 29, 2012 2:30 am

Edsel Apostol wrote:I have not studied DTS yet. In a nutshell what should be the difference compared to YBW? Why it is not easy to implement it in a recursive search framework?

From what I understand of DTS, it allows the master of a split node (i.e. the thread that originally started searching it) to transfer responsibility for the node to one of the slave threads when the master has run out of work but some of the slave threads are still searching their moves. The slave newly responsible then has to back up the result to the parent node. With a recursive search you cannot really do this, because only the stack of the master allows a simple backing up to the parent node.

With an iterative search you don't use the stack to keep track of where you are in the tree, but some sort of global data structure that the master can pass to the slave.

Without DTS, masters that have run out of work can only be useful as a "helpful master" as a slave to one of its slaves. But as long as those have work to do, that doesn't seem so bad.

I'm not completely sure where the supposed advantage of DTS lies. The possibilities are:
- reduced idle time and/or splitting overhead, or
- reduced search overhead.

Reduced idle time can only be a significant advantage if there is considerable idle time when using a recursive search. I haven't tried to measure idle time in my implementation, but from my nps measurements I can tell that it must be negligible. My splitting overhead seems negligible for the same reason.

That leaves reduced search overhead. It might be that splitting closer to the root in general reduces search overhead, and in that case allowing idle masters to help closer to the root instead of restricting it to the subtree of one of its slaves would indeed help. But would it help much? I kind of doubt it.

From Hyatt's paper I understand that in DTS, idle threads actively search for a suitable split point "candidate" in one of the trees being searched by the active threads. But I do that as well in my recursive YBW implementation.

Hmm, here is something I definitely don't do (this is about going from the PV node at ply=N to the PV node at ply=N-1):

Since we are using the best-known move ordering heuristics in the alpha/ beta tree, it is likely that by the time the first few moves at ply=N have been searched, we know the actual score for this node, as moves further down the list should be worse and not improve the score further. Then, allowing idle processors to back up to ply=N-1 and start searching there before ply=N is completed is probably safe, and this is what DTS does. On the rare occasions when the ply=N search completes, and the last branch produces an even better score, the processors that have backed up to ply=N-1 already are searching with a less efficient bound. DTS notices this and each processor already searching at N-1 is given the new (correct) bound as soon as it is known. This sounds easy and clean, but it can cause some interesting problems, because this new bound might mean most of what a processor has been busy doing can be thrown away since the new bound would have caused a cutoff much sooner than the original bound used.

This seems contrary to YBW, but this is about PV nodes and it might be OK for them. This is an interesting approach to reducing idle time and/or splitting overhead (to avoid threads splitting close to the leaves near the end of the search of the PV node at ply=N), but I'm not convinced that there is much to gain there, and it can increase search overhead. In any case, I think it is possible to do this in a recursive search as well... (tricky, but doing this in an iterative search might be as tricky).

I think it should be possible to implement DTS without most of the copying and messaging between processors that is described in the paper. That would be necessary to make it competitive on today's processors.

syzygy · Post by **syzygy** » Thu Nov 29, 2012 3:01 am

Edsel Apostol wrote:When the flag is set for "potential split points" you're already using locks in selecting the moves even without other threads yet helping search on this split point?

Yes, because the thread basically does not know whether other threads have joined at the node. (It can figure this out by looking at a node-specific "slaves" field which is a bit mask of all threads registered as slaves for that node, but it would have to take the lock anyway because threads can join at any time as long as the node's bit is set in the potential split points bitmask.)

The locks are spinlocks, and which only take an atomic increment or decrement which is cheap when uncontested (which is always the case if it is not an actual split node). Still, I had expected a higher speed penalty.

Bits in the thread-specific "potential splitpoints" bit mask I set and clear using atomic bit set and bit clear, so that I don't have to take thread-specific locks.

bob · Post by **bob** » Thu Nov 29, 2012 9:12 pm

syzygy wrote:
bob wrote:And one believes that doing this on just a very small number of nodes overall is going to have some big effect?
That the number of PV nodes is very small is not a very good argument. All other nodes are searched starting from a PV node. With tighter bounds, those subtrees will be smaller.

I suppose you wouldn't be in favour of replacing "alpha" by "o_alpha" (original alpha) in each recursive invocation of Search() inside crafty's Search(), even though that only makes a difference in the very small number of nodes that are PV nodes. If updating alpha within a thread is a good idea, it should also be a good idea across threads.

That the number of PV nodes is very small means that updating alpha in those nodes is essentially free.

Anyway, it is just a suggestion. If it really makes no measurable difference, this all doesn't matter. It would surprise me if that is the case, but I have not tried to measure it myself (yet).

I did try it. In fact, I used to do all the classic "update alpha" tricks including circumstances where you have a hash hit that is not good enough to stop the search, but which does narrow the window. After a ton of testing a few years ago, I took that out. Slightly simpler code for no cost in performance.

To properly "update" alpha in a recursive negamax search is difficult. Because it has to be replicated downward through the tree, which is not so easy in a recursive implementation. But I tested this in the Cray Blitz code since it was non-recursive, and I found exactly no measurable benefit... In such a case, I tend to avoid such code because it introduces potential race conditions that can be problematic.

Not saying it is good or bad, or right or wrong, or my way is the only way to go. I simply reported my test results which said "not worth anything"...

bob · Post by **bob** » Thu Nov 29, 2012 9:16 pm

Edsel Apostol wrote:Thanks for the input. My NPS speed-up is now somewhat decent though not as good as Crafty or other top programs.

I have not studied DTS yet. In a nutshell what should be the difference compared to YBW? Why it is not easy to implement it in a recursive search framework?

For DTS, you make split decisions anywhere in the tree, as opposed to only at the current node as in my current YBW implementation. It is not easy, if you are at ply=16, to split at ply=8, the data structures for ply=8 are already allocated locally somewhere but need to become global. So it simply gets VERY complicated. An iterated search, as opposed to a recursive search, solves this. The code is much uglier, however, and that's the reason I chose to not do DTS a second time when I wrote Crafty, I liked the cleanliness of the recursive implementation...

SMP and questions

Re: SMP and questions

Re: SMP and questions

Re: SMP and questions

Re: SMP and questions

Re: SMP and questions

Re: SMP and questions

Re: SMP and questions

Re: SMP and questions

Re: SMP and questions

Re: SMP and questions