Glaurung 2 and SMP

bob · Post by **bob** » Mon May 26, 2008 6:21 pm

hcyrano wrote:
Thread[slave_id].stop is a boolean variable, which is probably compiled to an int or a char in machine code. Aren't writes to ints and chars atomic operations?
hehe, here is the question

It isn't really a question. 8-byte writes are atomic, so long as they don't get split across cache lines. But with integral data types that are aligned in memory, this can not happen. If you write something longer than 8 bytes, say a hash entry, then care is needed.

Zach Wegner · Post by **Zach Wegner** » Mon May 26, 2008 7:42 pm

I have it even worse.

Instead of the possibility of idle threads being picked up in that little frame of time, I have several different ways splits can fail:

1) The idle processor looks for active split points. If it tries to attach to it, it can already be gone.
2) The idle processor then looks for active processors, and sends a message to them. The active processor could already be idle.
3) The active processor copies its search state into shared memory. When the idle processor reads it and selects a split point, the active processor could have left the node already, or even have become idle.
4) An active processor could have copied its search state into shared memory after the idle processor had already chosen a split point.
5) Once the idle processor sees that a split point has been created and tries to attach to it, the split point could already be gone, usually because of a fail high.

That's just the procedure for splitting. And you wonder why I'm crazy...

bob · Post by **bob** » Tue May 27, 2008 1:05 am

There are a ton of potential race conditions, and every last one has to be handled. I've told the story before, but for those that missed it, in 1984 I did a brand new version of my parallel search for Cray Blitz as we were getting to use our first 4-cpu Cray at the ACM event late that year. Cray gave us time to play in a labor-day human tournament in Mobile, so off we went with the new code that was a new approach in how we split.

In the second round, against a future GM player (not GM at the time) we were playing along solidly and doing OK and suddenly uncorked a PV that said "Bxc7#". I looked at it and was thinking "hmm, the bishop is undefended, so how is that mate. Opponent promptly took it and a long struggle ensued. Don't remember how the game ended, but when I looked at it that night, what I found was that after BxP there were exactly 3 legal moves. Due to a bizarre timing bug, the three legal moves were searched prior to the 4th CPU getting started and when it did its search it said "I am checkmated" and backed that score up on top of the scores from the other moves.

Was an easy to fix problem, but goes to show that there are lots of things to consider in a parallel search that is filled with potential race conditions... I have had plenty of others that were far harder to detect and fix...

Zach Wegner · Post by **Zach Wegner** » Tue May 27, 2008 10:01 pm

That's just the tip of the iceberg. Some of my other favorites:

A processor tries to unsplit a split point because it is searching the only move left, but before it starts, another processor has joined a split point below it. This requires locking the whole "split point stack" whenever a processor detaches, unsplits, or attaches.

One processor is searching a move at a split point, and another processor comes and finishes off the remaining moves rather quickly and detaches. The first processor unsplits, but it's own internal move list still has the moves left. It doesn't try to look for the move list in the split point, obviously, so it would just research all the moves again. That went undetected for a while.

Only updating alpha and beta at a split point, and forgetting about the best_score. Led to "mate in 0" scores.

In my previous design, when a processor would get a beta cutoff at a split point, it would send messages to the other processors at that split point telling them to stop, and then wait for them. This led to so many bizarre wait conditions that I had to put beta cutoffs in the same function as just updating alpha at a split point, which is asynchronous and doesn't wait for the other processors. This function itself is pretty complicated, because updating alpha at a split point can cause beta cutoffs at split points below it, so a processor needs to detach from each, and my head is hurting just thinking about it.

Plain and simple, DTS is a nightmare. If you look at my source (specifically: http://zct.cvs.sourceforge.net/zct/zct/smp2.c), I try to put comments whenever I run into subtle bugs like this. I hope it will encourage more people to try it out, and maybe succeed. Just looking at it now though, it does need some cleanup...

bob · Post by **bob** » Wed May 28, 2008 2:03 am

Zach Wegner wrote:That's just the tip of the iceberg. Some of my other favorites:

A processor tries to unsplit a split point because it is searching the only move left, but before it starts, another processor has joined a split point below it. This requires locking the whole "split point stack" whenever a processor detaches, unsplits, or attaches.

I don't "unsplit" until I am actually finished, to avoid this problem. That means I might have several active split points where only one move is left, it is being searched, but I just allocate enough split blocks so that any sub-tree can have a split block for each ply in a worst-case scenario.

One processor is searching a move at a split point, and another processor comes and finishes off the remaining moves rather quickly and detaches. The first processor unsplits, but it's own internal move list still has the moves left. It doesn't try to look for the move list in the split point, obviously, so it would just research all the moves again. That went undetected for a while.

Don't have this issue either. All processors at a split point use the shared move list kept in the _parent_ split block. They all use the lock in that parent block as well to make sure two don't grab the same move to search...

Only updating alpha and beta at a split point, and forgetting about the best_score. Led to "mate in 0" scores.

In my previous design, when a processor would get a beta cutoff at a split point, it would send messages to the other processors at that split point telling them to stop, and then wait for them. This led to so many bizarre wait conditions that I had to put beta cutoffs in the same function as just updating alpha at a split point, which is asynchronous and doesn't wait for the other processors. This function itself is pretty complicated, because updating alpha at a split point can cause beta cutoffs at split points below it, so a processor needs to detach from each, and my head is hurting just thinking about it.

Plain and simple, DTS is a nightmare. If you look at my source (specifically: http://zct.cvs.sourceforge.net/zct/zct/smp2.c), I try to put comments whenever I run into subtle bugs like this. I hope it will encourage more people to try it out, and maybe succeed. Just looking at it now though, it does need some cleanup...

I had lots of those kinds of comments in Cray Blitz as well.

Zach Wegner · Post by **Zach Wegner** » Wed May 28, 2008 5:28 am

bob wrote:I don't "unsplit" until I am actually finished, to avoid this problem. That means I might have several active split points where only one move is left, it is being searched, but I just allocate enough split blocks so that any sub-tree can have a split block for each ply in a worst-case scenario.

Well "unsplitting" anywhere else in a recursive framework would be pretty silly. It would be possible for me to keep the old split points around, but my data structures have processors recursively attach to split points. So all the processors below a split point are visible directly from that split point. It comes out simpler IMO because I don't need a recursive stop routine, I just send messages to each child of a processor.

Don't have this issue either. All processors at a split point use the shared move list kept in the _parent_ split block. They all use the lock in that parent block as well to make sure two don't grab the same move to search...

In my design the processors share a move list... when they are in a split point. The issue is that a processor didn't copy the (empty) move list into local memory on an unsplit, so it would return to that ply and find the move list as it was before the split.

The real issue is the iterative search. Where you "unsplit", you simply back up a value to the ply above the split point. You never try to even access the move list of that ply again.

Having a recursive search must be nice.

bob · Post by **bob** » Wed May 28, 2008 6:59 pm

Zach Wegner wrote:
bob wrote:I don't "unsplit" until I am actually finished, to avoid this problem. That means I might have several active split points where only one move is left, it is being searched, but I just allocate enough split blocks so that any sub-tree can have a split block for each ply in a worst-case scenario.
Well "unsplitting" anywhere else in a recursive framework would be pretty silly. It would be possible for me to keep the old split points around, but my data structures have processors recursively attach to split points. So all the processors below a split point are visible directly from that split point. It comes out simpler IMO because I don't need a recursive stop routine, I just send messages to each child of a processor.

Don't have this issue either. All processors at a split point use the shared move list kept in the _parent_ split block. They all use the lock in that parent block as well to make sure two don't grab the same move to search...
In my design the processors share a move list... when they are in a split point. The issue is that a processor didn't copy the (empty) move list into local memory on an unsplit, so it would return to that ply and find the move list as it was before the split.

I could not do that anyway. I don't generate all the moves at once. Whether at a split point or not, I search the first move I can get (which might be a hash move or a good capture, or one of two killers) before I have even generated all moves. I use the same approach of hash move (no movegen), then captures only, then killers (no movgen) then rest of move. So copying the move list would not work very well. But I did this in Cray Blitz as well, where all child processes (same parent position) used the parent split block to access the move list, and used their private data to search stuff below the split point.

The real issue is the iterative search. Where you "unsplit", you simply back up a value to the ply above the split point. You never try to even access the move list of that ply again.

That's more of a design issue rather than an iterative search issue. In CB each process, as it finishes, makes a decision on whether its data should be backed up or not, once it discovers "there is nothing left for me to do, although others might still be busy finishing up here..."

Having a recursive search must be nice.

jswaff · Post by **jswaff** » Thu May 29, 2008 8:12 pm

I actually have both. I already had a recursive search. I have a preprocessor DEBUG_ITERATIVE_SEARCH that, when activated, does both a recursive search and then an iterative search from the root and compares the node counts, pvs, etc, to make sure they are traversing the same tree. It has helped immensely in the development of the iterative search, which was not trivial at all, especially when the null search, pv search, and lmr was included.

--
James

Zach Wegner · Post by **Zach Wegner** » Thu May 29, 2008 9:52 pm

The iterative search isn't the tricky part, it's the iterative search combined with parallel search. Maybe that's just because I've used iterative exclusively for a few years now and am used to it.

hcyrano · Post by **hcyrano** » Fri May 30, 2008 12:15 pm

Tord,

have you statistics (nodes, time, ... ) on Glaurung's code with 1, 2, 4, 8... cores?

thx

Glaurung 2 and SMP

Re: Glaurung 2 and SMP

Re: Glaurung 2 and SMP

Re: Glaurung 2 and SMP

Re: Glaurung 2 and SMP

Re: Glaurung 2 and SMP

Re: Glaurung 2 and SMP

Re: Glaurung 2 and SMP

Re: Glaurung 2 and SMP

Re: Glaurung 2 and SMP

Re: Glaurung 2 and SMP