Actual speedups from YBWC and ABDADA on 8+ core machines?

TomKerrigan · Post by **TomKerrigan** » Sat Jul 11, 2015 7:51 pm

Actually I currently clear the hash table between all searches anyway (even during games) since it makes debugging more straightforward.

I think a case could be made for testing SMP speedups based on time-to-solution for chess problems.

The scenario you might want to optimize for is quickly finding the best (but unintuitive) move, if one exists.

bob · Post by **bob** » Sat Jul 11, 2015 8:00 pm

TomKerrigan wrote:
bob wrote: ...
(1) very lightweight splitting code. When a processor goes idle, any active thread will try to split and give it something to do. In fact, this is pretty useful as with 20 cores, when a thread goes idle, it will likely see 6-8 potential split points at a minimum, and it gets to choose which one to accept. All a splitting thread does is grab a new split block, link it to the parent, copy the necessary data, and then continue searching.
...
I find this interesting.

In my code, every time the search function finishes searching a move (that didn't fail high), it checks to see if there are idle threads to help it search the rest of the moves. If so, it creates a split data structure and assigns the idle threads to it. In other words, the idle threads just work on whatever becomes available first. Maybe this isn't the best.

Can you explain a bit about how the idle threads choose which split node to work on?

Sure. Again, this is new and completely unoptimized, but I compute a simple "goodness" number for each available split point. The basic formula is 2 * depth - moves_already_searched. Depth is the remaining depth, not ply. The idea is that favoring the split points with the deepest search is better as the split overhead is almost nothing compared to the cost of the search, and when depths are close, I favor the one with the fewest moves searched so far. I had made a note to refine this a bit, because if the side on move is in check, there are not usually very many moves, which can probably cause it to favor joining split blocks where the side on move is in check since the moves searched so far will always be low.

Again, caveat emptor. 2 * D - MS is just a first implementation. Probably not the actual first which I just chose the one with the most depth, but this was more DTS-ish after I started debugging. I don't really know how many moves there are at a ply since I do a staged move generation, and until I have finally reached the point where I have to generate the rest of the moves, I have no idea about any sort of percentage. Cray Blitz used a vectorized move generator which made it faster to generate everything at once, which gave us a somewhat better "remaining effort" estimate. But what I am using is actually working pretty well. I'd be amazed if everything in the current version is optimal, but if it is not, then the speedup can go even higher which will be pretty remarkable...

In your program are you doing the usual recent search tricks like late move reductions, late move pruning, futility pruning? Those also tend to help the branching factor and the parallel search performance.

TomKerrigan · Post by **TomKerrigan** » Sat Jul 11, 2015 8:05 pm

bob wrote: ...
In your program are you doing the usual recent search tricks like late move reductions, late move pruning, futility pruning? Those also tend to help the branching factor and the parallel search performance.

No, I am only vaguely aware of any of that stuff. All of my time has been spent adding features to my iOS app, and not on engine strength.

I'm surprised that reducing the branching factor would increase parallel efficiency. I would think the more branches to search at the split nodes, the better the parallel efficiency would be.

bob · Post by **bob** » Sat Jul 11, 2015 11:09 pm

TomKerrigan wrote:
bob wrote: ...
In your program are you doing the usual recent search tricks like late move reductions, late move pruning, futility pruning? Those also tend to help the branching factor and the parallel search performance.
No, I am only vaguely aware of any of that stuff. All of my time has been spent adding features to my iOS app, and not on engine strength.

I'm surprised that reducing the branching factor would increase parallel efficiency. I would think the more branches to search at the split nodes, the better the parallel efficiency would be.

There's another side to the issue. The deeper you go the more stable the PV becomes the way LMR works. As moves closer to the root stabilize, you get perfect move ordering (or almost perfect) in at least the first 1/2 of the plies. In the positions I sent you, you probably noticed that the shallowest search was 24 plies, the deepest 34 plies. That seems to help significantly.

Actual speedups from YBWC and ABDADA on 8+ core machines?

Re: Actual speedups from YBWC and ABDADA on 8+ core machines

Re: Actual speedups from YBWC and ABDADA on 8+ core machines

Re: Actual speedups from YBWC and ABDADA on 8+ core machines

Re: Actual speedups from YBWC and ABDADA on 8+ core machines