An interesting parallel search non-bug

bob · Post by **bob** » Thu Nov 05, 2015 3:10 am

The current rewrite of Crafty's parallel search is performing amazingly well. But as I tweak and tune, I inserted a test in the "Join()" code to sanity-test things, and one of the tests was "am I joining with myself?" Thought it should never happen, but I was wrong, majorly wrong.

An explanation:

Assume a 30 ply search and ignore extensions and reductions. At ply=10 thread 0 decides to do a "gratuitous split". That is a split that is done preemptively so that when other threads become idle, they can join here without me doing any additional work. Now we have a split point at ply=10 (owned by thread 0) that is flagged as joinable.

Assume we get to ply 20 and a thread becomes idle. First thing it does is try to join the best current joinable split point. The above split point is not the best so it isn't selected, another one is. But by the time we discover which one is best, it has evaporated because other threads have already completed it. Before setting the split flag, which tells other threads to do a split a.s.a.p., I loop back through the split blocks again, to try a second pass to find a good split point. I find it but again it evaporates before that thread can join. I try one more time, and the same thing happens again. So now, in thread 0, I do a real split at ply=20 and thread 1 joins with me there. We both search and I happen to finish first and have to wait for thread 1 to finish. So thread 0 goes to Join() and guess what it finds. Its own split point at ply 10. So it joins, searches, and all is well.

I have added a simple test to avoid this, because even though it works perfectly at the present time, it seems like a bad idea and a potential bug waiting to happen if I change something later that having the same thread at the same split point twice might break. But it certainly made my head spin for a bit. Finding it was an interesting issue. Then discovering that it is perfectly safe in the current code was a second surprise. I decided that I didn't want a third, which would not occur until some future change breaks this...

Does show the outstanding flexibility of the old DTS algorithm compared to what I am doing now, because with DTS there was no need for any gratuitous splits at all, a thread can split with another thread no matter where it currently is in the tree, and it can choose any ply to split at that it wants, because it has no recursive stack to preserve and unwind. The more I look, the more tempted I am to go back and write an iterated version to get the many advantages of DTS over the current YBW approach, even though the current code is certainly much easier to understand and modify.

MikeB · Post by **MikeB** » Thu Nov 05, 2015 3:52 am

bob wrote:The current rewrite of Crafty's parallel search is performing amazingly well. But as I tweak and tune, I inserted a test in the "Join()" code to sanity-test things, and one of the tests was "am I joining with myself?" Thought it should never happen, but I was wrong, majorly wrong.

An explanation:

Assume a 30 ply search and ignore extensions and reductions. At ply=10 thread 0 decides to do a "gratuitous split". That is a split that is done preemptively so that when other threads become idle, they can join here without me doing any additional work. Now we have a split point at ply=10 (owned by thread 0) that is flagged as joinable.

Assume we get to ply 20 and a thread becomes idle. First thing it does is try to join the best current joinable split point. The above split point is not the best so it isn't selected, another one is. But by the time we discover which one is best, it has evaporated because other threads have already completed it. Before setting the split flag, which tells other threads to do a split a.s.a.p., I loop back through the split blocks again, to try a second pass to find a good split point. I find it but again it evaporates before that thread can join. I try one more time, and the same thing happens again. So now, in thread 0, I do a real split at ply=20 and thread 1 joins with me there. We both search and I happen to finish first and have to wait for thread 1 to finish. So thread 0 goes to Join() and guess what it finds. Its own split point at ply 10. So it joins, searches, and all is well.

I have added a simple test to avoid this, because even though it works perfectly at the present time, it seems like a bad idea and a potential bug waiting to happen if I change something later that having the same thread at the same split point twice might break. But it certainly made my head spin for a bit. Finding it was an interesting issue. Then discovering that it is perfectly safe in the current code was a second surprise. I decided that I didn't want a third, which would not occur until some future change breaks this...

Does show the outstanding flexibility of the old DTS algorithm compared to what I am doing now, because with DTS there was no need for any gratuitous splits at all, a thread can split with another thread no matter where it currently is in the tree, and it can choose any ply to split at that it wants, because it has no recursive stack to preserve and unwind. The more I look, the more tempted I am to go back and write an iterated version to get the many advantages of DTS over the current YBW approach, even though the current code is certainly much easier to understand and modify.

How long would that take - on the surface it sounds like a huge undertaken? Would it be ELO neutral ? DTS is known for having the best scalability and we all know the number of cores is going up.

sje · Post by **sje** » Thu Nov 05, 2015 4:45 am

bob wrote:The more I look, the more tempted I am to go back and write an iterated version to get the many advantages of DTS over the current YBW approach, even though the current code is certainly much easier to understand and modify.

An iterated, non-recursive approach is the way to go regardless of the parallelism algorithm chosen. A strict adherence to a state machine formalism makes it easier define, debug, and expand the search. Properly written, a non-recursive search has no objects hidden on the call stack and control can be single-stepped or easily paused/resumed. Also, it can be copied out and later restored by a separate process in a heterogeneous distributed environment.

Perhaps the reason that we don't see more authors use a state machine search is because minimax and A/B is almost always represented using recursion as taught in textbooks. But what appears in a textbook is not necessary what works best in a real world application.

jdart · Post by **jdart** » Thu Nov 05, 2015 4:56 am

That is really making my head hurt. YBWC was hard enough to debug. I am really reluctant to touch the parallel code I have, even though it is probably not optimal. I am sure it could do less locking for example. But it runs for hours or days without any hiccups and there is always the chance that changing it lets in some awful rare bug.

--Jon

bob · Post by **bob** » Thu Nov 05, 2015 5:33 am

MikeB wrote:
bob wrote:The current rewrite of Crafty's parallel search is performing amazingly well. But as I tweak and tune, I inserted a test in the "Join()" code to sanity-test things, and one of the tests was "am I joining with myself?" Thought it should never happen, but I was wrong, majorly wrong.

An explanation:

Assume a 30 ply search and ignore extensions and reductions. At ply=10 thread 0 decides to do a "gratuitous split". That is a split that is done preemptively so that when other threads become idle, they can join here without me doing any additional work. Now we have a split point at ply=10 (owned by thread 0) that is flagged as joinable.

Assume we get to ply 20 and a thread becomes idle. First thing it does is try to join the best current joinable split point. The above split point is not the best so it isn't selected, another one is. But by the time we discover which one is best, it has evaporated because other threads have already completed it. Before setting the split flag, which tells other threads to do a split a.s.a.p., I loop back through the split blocks again, to try a second pass to find a good split point. I find it but again it evaporates before that thread can join. I try one more time, and the same thing happens again. So now, in thread 0, I do a real split at ply=20 and thread 1 joins with me there. We both search and I happen to finish first and have to wait for thread 1 to finish. So thread 0 goes to Join() and guess what it finds. Its own split point at ply 10. So it joins, searches, and all is well.

I have added a simple test to avoid this, because even though it works perfectly at the present time, it seems like a bad idea and a potential bug waiting to happen if I change something later that having the same thread at the same split point twice might break. But it certainly made my head spin for a bit. Finding it was an interesting issue. Then discovering that it is perfectly safe in the current code was a second surprise. I decided that I didn't want a third, which would not occur until some future change breaks this...

Does show the outstanding flexibility of the old DTS algorithm compared to what I am doing now, because with DTS there was no need for any gratuitous splits at all, a thread can split with another thread no matter where it currently is in the tree, and it can choose any ply to split at that it wants, because it has no recursive stack to preserve and unwind. The more I look, the more tempted I am to go back and write an iterated version to get the many advantages of DTS over the current YBW approach, even though the current code is certainly much easier to understand and modify.
How long would that take - on the surface it sounds like a huge undertaken? Would it be ELO neutral ? DTS is known for having the best scalability and we all know the number of cores is going up.

It would be a complete rewrite of search.c... I don't think anything else would be bothered, but that is enough. Only good thing is I did it once, and still have a basic model in the Cray Blitz source...

bob · Post by **bob** » Thu Nov 05, 2015 5:41 am

jdart wrote:That is really making my head hurt. YBWC was hard enough to debug. I am really reluctant to touch the parallel code I have, even though it is probably not optimal. I am sure it could do less locking for example. But it runs for hours or days without any hiccups and there is always the chance that changing it lets in some awful rare bug.

--Jon

The current code is both extremely complex, and extremely simple at the same time. But what makes it simple was extremely hard to implement to facilitate that.

But the advantage of NEVER having to wait on YBWC is a significant issue, and I have some ideas to improve on DTS even, but first step is an iterated search.

It is certainly nice to be sitting at ply=30 and when a processor goes idle, it can split with you at ply=2 with no problems. And you won't even know. Also nice to have no "master" at a node where only one thread can "return" since now there is no call stack so when a thread finishes at a node, it just finishes and jumps somewhere else, leaving the last busy thread at a split point to clean up and back up to the previous ply.

Sounds nice, sounds easy. NOT.

But I suppose I will never be satisfied until this is done. Next step is to get 25.0 out, then probably some minor tweaks while working on 26.0, which will likely be the iterative search version, assuming I don't change my mind... The real pain today is that we spend too much time out near the tips where splits are avoided. Now, when we are out there, we simply can't split. With DTS you can split at any time, since you can see the entire tree state, ply by ply, in what amounts to a stack indexed by ply... You can be at ply=30, split at ply=3, and then continue at ply=30, easily.

BTW while I have not looked at your code, I did do some timing analysis on my code and the time spent waiting on locks, in total, is almost unmeasurable. IE less than one second over a search that takes 10 minutes or so. With the sole exception of extracting the next move at a split point, which has to be done serially. But DTS helps here significantly, also, because now you can split near to the root whenever possible, where move generation contention is almost nil...

Seems that I am talking myself into this... I'm going to think about Steven's FSM approach. The idea of a FSM state per ply is interesting, and it actually might simplify the CB style search approach. Going to be a Spring issue however, I'm now running our IT and teaching three courses, so Crafty is a late-night endeavor exclusively right now...

bob · Post by **bob** » Thu Nov 05, 2015 5:48 am

sje wrote:
bob wrote:The more I look, the more tempted I am to go back and write an iterated version to get the many advantages of DTS over the current YBW approach, even though the current code is certainly much easier to understand and modify.
An iterated, non-recursive approach is the way to go regardless of the parallelism algorithm chosen. A strict adherence to a state machine formalism makes it easier define, debug, and expand the search. Properly written, a non-recursive search has no objects hidden on the call stack and control can be single-stepped or easily paused/resumed. Also, it can be copied out and later restored by a separate process in a heterogeneous distributed environment.

Perhaps the reason that we don't see more authors use a state machine search is because minimax and A/B is almost always represented using recursion as taught in textbooks. But what appears in a textbook is not necessary what works best in a real world application.

I'm going to give some thought to your FSM approach. Never thought of doing the search that way. I do my move selection lock with a FSM, which does eliminate spaghetti code. But I did this in the CB move selection as well. Just not in the search. The more I think about it, the better it sounds, however. More as I start to design this...

Edsel Apostol · Post by **Edsel Apostol** » Thu Nov 05, 2015 6:19 am

bob wrote: I have added a simple test to avoid this, because even though it works perfectly at the present time, it seems like a bad idea and a potential bug waiting to happen if I change something later that having the same thread at the same split point twice might break. But it certainly made my head spin for a bit. Finding it was an interesting issue. Then discovering that it is perfectly safe in the current code was a second surprise. I decided that I didn't want a third, which would not occur until some future change breaks this...

It seems this behavior breaks the concept of helpful master, where threads that owns a splitpoint when waiting for slave threads to finish should only help the slave threads working for it. Since the thread is not working for itself, it should not help on the splitpoint it created earlier.

It is safe I think but inefficient. I have made the same mistake in earlier versions of Hannibal. For example at ply 10, thread 0 pre-splits at move 2, then at ply 20 splits again. When it finishes ahead of the other slave threads at ply 20, and joined it's own created splitpoint earlier at ply 10 and (lets say) move 3, it will need to finish all the other moves at that split before it becomes free again. The problem here is that it suspended working on the splitpoint it is waiting for in ply 20, and only when it finished working with the split at ply 10 that it resumes working on the ply 20 splitpoint. The inefficiency here is if that move 2 in ply 10 causes a cutoff, you have wasted time searching the other moves of that splitpoint in ply 10. NPS wise it is efficient, but not in terms of TTD.

Note that Hannibal is already implementing this pre-splits where threads are actively looking for splits to join instead of being booked by the master thread. This was already implemented more than a year ago I think (it is implemented in the 1.5 version).

bob · Post by **bob** » Thu Nov 05, 2015 7:03 am

Edsel Apostol wrote:
bob wrote: I have added a simple test to avoid this, because even though it works perfectly at the present time, it seems like a bad idea and a potential bug waiting to happen if I change something later that having the same thread at the same split point twice might break. But it certainly made my head spin for a bit. Finding it was an interesting issue. Then discovering that it is perfectly safe in the current code was a second surprise. I decided that I didn't want a third, which would not occur until some future change breaks this...
It seems this behavior breaks the concept of helpful master, where threads that owns a splitpoint when waiting for slave threads to finish should only help the slave threads working for it. Since the thread is not working for itself, it should not help on the splitpoint it created earlier.

Why? All that is required is that the ALL nodes that are potential split points need to be searched completely. Order doesn't matter. "helpful master" doesn't get broken at all, as a thread is still a master of its thread. And when thread 0 joins itself at ply=10, it will eventually finish there, go to thread wait() and notice that now ply=20 is done, and start working its way back up the tree.

Not sure why you think this breaks "helpful master".

BTW my helpful master has NEVER had the requirement that it only help those that are helping it. I see no reason why it can't jump across the tree to any viable split point, and have always done it like that. But getting rid of the "master" concept is certainly a big win when moving to full-blown DTS. Now there are no masters anywhere.

It is safe I think but inefficient. I have made the same mistake in earlier versions of Hannibal. For example at ply 10, thread 0 pre-splits at move 2, then at ply 20 splits again. When it finishes ahead of the other slave threads at ply 20, and joined it's own created splitpoint earlier at ply 10 and (lets say) move 3, it will need to finish all the other moves at that split before it becomes free again. The problem here is that it suspended working on the splitpoint it is waiting for in ply 20, and only when it finished working with the split at ply 10 that it resumes working on the ply 20 splitpoint. The inefficiency here is if that move 2 in ply 10 causes a cutoff, you have wasted time searching the other moves of that splitpoint in ply 10. NPS wise it is efficient, but not in terms of TTD.

Note that Hannibal is already implementing this pre-splits where threads are actively looking for splits to join instead of being booked by the master thread. This was already implemented more than a year ago I think (it is implemented in the 1.5 version).

The challenge for me was "nearly lockless" which was a critical key with 20+ threads. The only lock that has measurable overhead is the NextMove() protection at split points. This took a lot of thought. I have also spent a ton of time on making certain there are not two adjacent memory words that are modified by separate threads, even in places I did not think about... performance killers.

But then there is DTS where pre-splits are a thing of the past. I've done a ton of measuring with the split with oneself, and do not split with oneself, in my code, the difference is absolutely unmeasurable, NPS, time to depth, tree size, etc. But it looked like an accident waiting to happen when I change something next year not thinking about that quirk. My tree overhead has yet to exceed 30%, even with 24 cores... which is not bad. IE a 24 core search searches a tree maybe 20-25% larger than the serial tree to the same depth. The NextMove() synchronization is my biggest overhead. DTS can help by splitting close to the root rather than always having to either split where I am, or gratuitously splitting closer to the root. Closer to the root -> less contention for the move list...

As far as your worry about going back to a "wrong split point" there's nothing you can do. And nothing says a different split point is any better. In fact, positions near the root are less likely to have to abort since their move ordering is better. I've done a lot of measuring on aborts by ply, no big surprise that as you get further from the root there are more of 'em. Disproportionally more than expected.

Since I added the "move ordering doesn't change, LMR always reduces split moves the same amount no matter what" code last year, tree variation reduced quite a bit, also...

Have you measured TTD with any degree of accuracy on 20 cores? My current results are in the 13x-14x range. With the split with myself option. I'm trying to rerun the tests with the change, but for 20 cores I saw absolutely no improvement or degradation. Totally benign change to do it or not do it for me...

Will have some 24 core results in a week or two, we have a new cluster with 12 x 2 x 2.5ghz xeons, as opposed to this current 10 x 2 x 2.6ghz xeons. Be interesting to see how turbo-boost works with 24 cores. My 20 core 2.6 ghz box runs at 2.9ghz non-stop with Crafty on all 20 cores. Hope the 24 core box will run at around 2.8ghz or so...

sje · Post by **sje** » Thu Nov 05, 2015 7:23 am

bob wrote:I'm going to give some thought to your FSM approach. Never thought of doing the search that way. I do my move selection lock with a FSM, which does eliminate spaghetti code. But I did this in the CB move selection as well. Just not in the search. The more I think about it, the better it sounds, however. More as I start to design this...

There are only two tricky parts with a state machine approach: these are the state where the graph calls itself to start at a new ply and the state where it returns. Note that more than one state can call and so more than one state can return. In Symbolic, the next state variable is kept in the ply indexed vector (and so is not a local variable). At a state where the graph calls itself, the next state indicator for the current ply is set, then the ply is incremented, and finally the next state indicator for the now new ply is set. Likely always, the first value of the next state indicator at a new ply will be StartAtNewPly while the next state indicator giving the return state may vary.

Just get a big piece of blank paper and draw a bunch of circles connected with directed arcs. Works for me. For the first attempt, keep it simple with having only one state which calls and only one state which returns. Things like a null move search which will need its own call/return states can go in later.

In Symbolic, I coded a constant vector of strings indexed by state with each string being the ASCII state name. I used this for single stepping and tracing. Very good for finding bugs.

An interesting parallel search non-bug

An interesting parallel search non-bug

Re: An interesting parallel search non-bug

Re: An interesting parallel search non-bug

Re: An interesting parallel search non-bug

Re: An interesting parallel search non-bug

Re: An interesting parallel search non-bug

Re: An interesting parallel search non-bug

Re: An interesting parallel search non-bug

Re: An interesting parallel search non-bug

Re: An interesting parallel search non-bug