SMP and questions

Edsel Apostol · Post by **Edsel Apostol** » Fri Nov 23, 2012 8:20 am

Over the past 2 weeks I've been working with SMP which I've promised Sam to implement years ago when we started the Hannibal project.
Since we are working on the project on an on/off basis and the main focus before was to improve strength we we're often sidetracked in implementing SMP.
After the last release until 3 to 4 weeks ago, we haven't touched the Hannibal project due to family/kids, travel/work, relocation, etc.

Lately I've finally mustered the motivation to work on it, which is already long overdue, it has been in my ToDo list for years (including Chess960 support).
Right now I have a working YBWC just recently which it would have taken me months to implement from design to working code without open source engines like Viper as reference (thanks Tord for that didactic open source engine).
I did have a bit hard time making it work though (chasing bugs, crashes, deadlocks, etc) but it was fun.

I implemented it this way in Hannibal.
1. Main Hash is shared without locking (legality of hash move is checked)
2. Pawn Hash is per thread (2 Mb per thread)
3. Eval Cache is per thread (1Mb per thread)
4. Killer moves is per thread (copied from master to slave in split point from root to current ply)
5. Evalvalues per node used in Futility pruning is per thread (copied from master to slave from root to current ply)
6. History tables is shared
7. Always split after having a score for the first move
8. Split at the root
9. Node count is shared (declared volatile)

NPS speed-up in the initial position is around 1.8, 2.3, and 2.6 for 2, 3, and 4 threads respectively. This is in an old quad.
I know the speedup sucks but what can you expect from a 2 weeks old implementation that has recently just working.
Any suggestions where is the potential bottle neck? What is the speed-up in your engine at initial position?

I know this has been discussed before but I wanted to know if there are new trends or things have changed. Questions pertaining to the corresponding items above:

#1 Do you discard the wrong result from the hash table if the hash move didn't pass legality checking? Why or why not?
#2 Which would be faster? Shared or per thread?
#3 Same question as #2.
#4 Should this be per split point instead of per thread? Why or why not?
#5 Same question as number #4
#7 What are the possible improvements for this? Do you have separate criteria for Cut/All nodes, move generation phase (hasmove, killers, good captures, etc), etc?
#8 Is this good or bad? I think Vincent D. mentioned somewhere in the old threads that it is better to spent effort on the child nodes of the second root move instead of
searching in parallel the other root moves, but I have read from SF team that it performs better in SF and all the root moves will be searched anyway.
#9 Should node count be counted per thread and just summed or is this sufficient enough?

Other questions:

What are the other tricks used to minimize idle time spent by the threads? I have read "Active reparenting" from the SF team but I have not studied SF yet how this is implemented
or was this implemented at all?

I think that the idle time spent by the threads will be minimized by lowering the minimum split depth, the lower the better, this will boost NPS I guess but
I'm just not sure if this is efficient scalability wise though, I haven't tested it yet. I will test this soon. Any explanation why there is a minimum split depth?

Another idea would be to keep track of the depths where there is a split point, and when the maximum depth at any given time is not greater than the minimum split depth,
it will be allowed to split below the minimum split depth.

For example, all threads are searching a split point at depth = minimum split depth and there is no other split points at higher depths besides this.
Lets say we have 30 moves and 4 threads, and the threads are now searching the last 4 moves.
When the threads returns, there will be no more moves to search, so it will get back into the idle loop.
What if in this case we allow splitting below the minimum split depth, to help the other threads?
Has anybody tried this?

Edsel Apostol · Post by **Edsel Apostol** » Fri Nov 23, 2012 10:40 am

Edsel Apostol wrote:Any explanation why there is a minimum split depth?

I'll answer my own question. In thinking more about it, it simply is the cost of doing the split, so one has to balance between numbers of doing the split and splitting in a higher depth. The optimal minimum split depth is the one that gives the fastest speed-up. Higher and lower depths from the optimal will be slower. If the cost of splitting is minimal, one can use a lower minimum split depth.

Houdini · Post by **Houdini** » Fri Nov 23, 2012 11:21 am

1) An illegal move from hash should be very rare, it doesn't really matter what you do.
2) 3) 6) Per thread is best. The less writable memory you share between threads, the better.
4) Killers are very volatile, it doesn't really matter what you do.
7) Don't split good captures or killers - the probability of a cut-off is too high.
8) Houdini doesn't split at the root, mostly for practical reasons (for example time control).
9) Have a counter per thread. Same reason as 2) 3) 6).

Robert

diep · Post by **diep** » Fri Nov 23, 2012 3:21 pm

Edsel Apostol wrote:Over the past 2 weeks I've been working with SMP which I've promised Sam to implement years ago when we started the Hannibal project.
Since we are working on the project on an on/off basis and the main focus before was to improve strength we we're often sidetracked in implementing SMP.
After the last release until 3 to 4 weeks ago, we haven't touched the Hannibal project due to family/kids, travel/work, relocation, etc.

Lately I've finally mustered the motivation to work on it, which is already long overdue, it has been in my ToDo list for years (including Chess960 support).
Right now I have a working YBWC just recently which it would have taken me months to implement from design to working code without open source engines like Viper as reference (thanks Tord for that didactic open source engine).
I did have a bit hard time making it work though (chasing bugs, crashes, deadlocks, etc) but it was fun.

I implemented it this way in Hannibal.
1. Main Hash is shared without locking (legality of hash move is checked)
2. Pawn Hash is per thread (2 Mb per thread)
3. Eval Cache is per thread (1Mb per thread)
4. Killer moves is per thread (copied from master to slave in split point from root to current ply)
5. Evalvalues per node used in Futility pruning is per thread (copied from master to slave from root to current ply)
6. History tables is shared
7. Always split after having a score for the first move
8. Split at the root
9. Node count is shared (declared volatile)

NPS speed-up in the initial position is around 1.8, 2.3, and 2.6 for 2, 3, and 4 threads respectively. This is in an old quad.
I know the speedup sucks but what can you expect from a 2 weeks old implementation that has recently just working.
Any suggestions where is the potential bottle neck? What is the speed-up in your engine at initial position?

I know this has been discussed before but I wanted to know if there are new trends or things have changed. Questions pertaining to the corresponding items above:

#1 Do you discard the wrong result from the hash table if the hash move didn't pass legality checking? Why or why not?

I do not check for legality of the hashmove. I do have a loop that sees whether the move is in the move list. So if it is an illegal move it doesn't get found in the movelist so it won't get assigned a score that it gets tried first, as it can't find that move in the movelist. So it ignores the illegal stored bestmove in such case.

The check i'm doing is the check as described by Robert M Hyatt in an ICCA aritcle. If i remember well Tim Mann showed up with the idea of just doing an XOR.

So you want to know whether the position is correct.

An illegal hashmove you can only get in case of a collission in short.
Using 64 bits you should get a collission on average once each 200 billion nodes or so. Bit more collisions if search space you cover is larger and also bit more if you use less bits. Use at least 64 bits is my advice. In diep internal i use 128 bits. I would need to have a look how many get used effectively out of that. It'll be between 70 and 80 bits or so right now.

Just ignore that move if it is not legal.

#2 Which would be faster? Shared or per thread?
#3 Same question as #2.

Both depend upon the architecture you run onto. In Diep i've changed several times the evaluation and pawnhash from shared to per proces and visa versa. Probably i'll change it to shared on each node soon once again, as the huge SSI machines no longer exists

Not sure what i already do shared and what not. Might be i already do it shared by now - would need to look it up. Good idea to check

So i vote shared for todays hardware.

#4 Should this be per split point instead of per thread? Why or why not?

Killermoves only work local for Diep. So global killers hardly give anything to Diep. Stuff like this, do i do this 100% local. So not shared. YMMV here.

#5 Same question as number #4

I'm not using futility pruning in Diep so can't answer this to you. Note that the maximum evaluation difference in Diep can easily be 45 pawns versus the material on the board. So keeping evaluation differences to do futility pruning is not interesting. Last experiment i did do there was around end 90s

It was Bart Weststrate who attended me to this trick back then - long before it was public knowledge...

On your number 6: history table doesn't work for Diep. I know it works for a lot of engines. Yet let's be realistic. If you have 40 cores or so, or 64 cores in case of Diep at a cluster here, how accurate is such table?

In Fruit we see basically the table is a reflection of the piece square tables.

I tend to believe history table only works for engines with a simple evaluation function where the positional values from the piece square tables play a major role positional.

#7 What are the possible improvements for this? Do you have separate criteria for Cut/All nodes, move generation phase (hasmove, killers, good captures, etc), etc?

Always splitting after having a score for the first move.

Basically that's what we call YBW.

It's the only parallel searching algorithm that manages to keep a branching factor when searching in parallel that is similar to the branching factor you have when running on a single core.

Any other method of parallel search so far used didn't manage to keep the same branching factor with as a result a horrible speedup out of n cores with n a tad larger than a handful.

#8 Is this good or bad? I think Vincent D. mentioned somewhere in the old threads that it is better to spent effort on the child nodes of the second root move instead of
searching in parallel the other root moves, but I have read from SF team that it performs better in SF and all the root moves will be searched anyway.

The root is a special case.

We can mathematical prove that splitting in the root is very dangerous, provided you have a bunch of cores.

We tend to order moves in computerchess. If you order moves well,
odds is higher than the 2nd move gives a cutoff than the 30th move.

In fact crafty some years ago used to split in moves after having searched 2 moves. So not 1 move like YBW, crafty waited for 2 moves first.

That should give you some indication about the root as well.

Suppose you have 16 cores. the first move goes down in score a tad.
1 core searches move 2 and the other 15 cores search the other moves.

Your second move happens to be the best move of course. then the clock on the wall makes a noise. Timeout.

Your engine plays the move. The bad move that just got down in score a tad.

Maybe you would've had a fail high for move 2 if you focussed all attention there?

Getting these huge 30 ply searches of today there is plenty of room to split there for 16 cores and divide that tree...

So from the redenation used at the root you can see that in itself parallellizing a search is not ideal from search viewpoint as your b.f. WILL lose something compared to running single core. Yet we have to split *somewhere* so we make a choice and do not include the root there as that has a direct impact possibly on the time that we get a fail high.

#9 Should node count be counted per thread and just summed or is this sufficient enough?

I do everything local. Only just before printing statistics i collect all statistics in diep. Note that i do this without locking.

Other questions:

What are the other tricks used to minimize idle time spent by the threads?

very good question.

I have read "Active reparenting" from the SF team but I have not studied SF yet how this is implemented
or was this implemented at all?

I think that the idle time spent by the threads will be minimized by lowering the minimum split depth, the lower the better, this will boost NPS I guess but
I'm just not sure if this is efficient scalability wise though, I haven't tested it yet. I will test this soon. Any explanation why there is a minimum split depth?

they invent fantastic new things to solve a design flaw in the program.

In Diep i have no 'stack owner', so if a CPU is ready searching a position that has been split, it just aborts itself (doing an AbortMe() ) and goes back to the idle list from where other cpu's can pick it up if they want to split in CPU's.

Very simple. Very effective. Best working concept to date.

Another idea would be to keep track of the depths where there is a split point, and when the maximum depth at any given time is not greater than the minimum split depth,
it will be allowed to split below the minimum split depth.

For example, all threads are searching a split point at depth = minimum split depth and there is no other split points at higher depths besides this.
Lets say we have 30 moves and 4 threads, and the threads are now searching the last 4 moves.
When the threads returns, there will be no more moves to search, so it will get back into the idle loop.
What if in this case we allow splitting below the minimum split depth, to help the other threads?
Has anybody tried this?

Don't make it more complicated than it is. Inventing whole new lemma's theories and stuff to solve a design flaw is not a good idea. Just don't design your SMP search with the flaw in the first place in such case

diep · Post by **diep** » Fri Nov 23, 2012 4:06 pm

Edsel Apostol wrote:
Edsel Apostol wrote:Any explanation why there is a minimum split depth?
I'll answer my own question. In thinking more about it, it simply is the cost of doing the split, so one has to balance between numbers of doing the split and splitting in a higher depth. The optimal minimum split depth is the one that gives the fastest speed-up. Higher and lower depths from the optimal will be slower. If the cost of splitting is minimal, one can use a lower minimum split depth.

Keeping down the cost of a split is VERY IMPORTANT.

mcostalba · Post by **mcostalba** » Fri Nov 23, 2012 4:55 pm

Edsel Apostol wrote: What are the other tricks used to minimize idle time spent by the threads? I have read "Active reparenting" from the SF team but I have not studied SF yet how this is implemented
or was this implemented at all?

The idea is very simple. Implemenatation, as usual with SMP, is simple only after you made it to work

.

Once a slave has finished searching, instead of returning to idle loop waiting to be picked up by a new master, it looks for an active split point, register itself as a split point's slave and starts searching on it. The idea is that average idle time is greatly reduced in this case.

Unfortunaty the patch failed to score an increase and so has been retired:

https://github.com/mcostalba/Stockfish/ ... 95083a3ab9

My guess is that the idle time is anyhow already small on average so that the patch does not change the things a lot. One possibility that I have not tested is to enable active reparenting together with increasing minimum split depth. This is because increasing split depth reduces the SMP overhead but at the same time increases the average slave's idle time. So at first glance could make sense to increase the first and enable the second.

zamar · Post by **zamar** » Fri Nov 23, 2012 8:15 pm

diep wrote: Suppose you have 16 cores. the first move goes down in score a tad.
1 core searches move 2 and the other 15 cores search the other moves.

Your second move happens to be the best move of course. then the clock on the wall makes a noise. Timeout.

Your engine plays the move. The bad move that just got down in score a tad.

Maybe you would've had a fail high for move 2 if you focussed all attention there?

Getting these huge 30 ply searches of today there is plenty of room to split there for 16 cores and divide that tree...

I'll explain shortly why this is usually not a problem for SF:

1) SF is a modern beancounter (I love this descriptive term!). It will spend most of it time in searching the first move and can usually quickly refute the others.

2) Once SF has finished searching the first move, it will try very hard to complete the iteration (it's prepared to use extra time to do this).

3) When splitting at root and second move is taking long time, other threads can usually quickly refute the other moves and then they start helping in resolving the problematic move. Of course if we have two moves which are both taking long to complete, we have a problem, but my instincts say that this is extremely rear.

diep · Post by **diep** » Fri Nov 23, 2012 8:54 pm

zamar wrote:
diep wrote: Suppose you have 16 cores. the first move goes down in score a tad.
1 core searches move 2 and the other 15 cores search the other moves.

Your second move happens to be the best move of course. then the clock on the wall makes a noise. Timeout.

Your engine plays the move. The bad move that just got down in score a tad.

Maybe you would've had a fail high for move 2 if you focussed all attention there?

Getting these huge 30 ply searches of today there is plenty of room to split there for 16 cores and divide that tree...
I'll explain shortly why this is usually not a problem for SF:

1) SF is a modern beancounter (I love this descriptive term!). It will spend most of it time in searching the first move and can usually quickly refute the others.

2) Once SF has finished searching the first move, it will try very hard to complete the iteration (it's prepared to use extra time to do this).

3) When splitting at root and second move is taking long time, other threads can usually quickly refute the other moves and then they start helping in resolving the problematic move. Of course if we have two moves which are both taking long to complete, we have a problem, but my instincts say that this is extremely rear.

If score drops a tad it's very usual that a bunch of other moves takes a long time to search and cannot quickly get refuted.

So your instincts are a tad off here - and in case of SF you've got more than enough search depth to split in idle cpu's

bob · Post by **bob** » Fri Nov 23, 2012 9:02 pm

Edsel Apostol wrote:Over the past 2 weeks I've been working with SMP which I've promised Sam to implement years ago when we started the Hannibal project.
Since we are working on the project on an on/off basis and the main focus before was to improve strength we we're often sidetracked in implementing SMP.
After the last release until 3 to 4 weeks ago, we haven't touched the Hannibal project due to family/kids, travel/work, relocation, etc.

Lately I've finally mustered the motivation to work on it, which is already long overdue, it has been in my ToDo list for years (including Chess960 support).
Right now I have a working YBWC just recently which it would have taken me months to implement from design to working code without open source engines like Viper as reference (thanks Tord for that didactic open source engine).
I did have a bit hard time making it work though (chasing bugs, crashes, deadlocks, etc) but it was fun.

I implemented it this way in Hannibal.
1. Main Hash is shared without locking (legality of hash move is checked)
2. Pawn Hash is per thread (2 Mb per thread)
3. Eval Cache is per thread (1Mb per thread)
4. Killer moves is per thread (copied from master to slave in split point from root to current ply)
5. Evalvalues per node used in Futility pruning is per thread (copied from master to slave from root to current ply)
6. History tables is shared
7. Always split after having a score for the first move
8. Split at the root
9. Node count is shared (declared volatile)

Don't do #9. Create an array of node counters, one per thread. Make certain that two consecutive threads have their node counters in different 64 byte blocks of memory. Otherwise you will generate an unbelievable amount of cache-to-cache coherency traffic since every thread modifies the node counter once per node searched.

NPS speed-up in the initial position is around 1.8, 2.3, and 2.6 for 2, 3, and 4 threads respectively. This is in an old quad.
I know the speedup sucks but what can you expect from a 2 weeks old implementation that has recently just working.
Any suggestions where is the potential bottle neck? What is the speed-up in your engine at initial position?

I'd bet your node counter will help there a lot.

Here is another pointer. Look at all your shared data. Does it REALLY need to be shared? Shared data generates a lot of cache-to-cache traffic. If you have data for several threads stored in the same 64 byte cache line, you get killed...

Here's mine, on a dual quad-core (5 year old box):

1cpu:
time=30.04 mat=0 n=63770613 fh=91% nps=2.1M
4cpu:
time=30.23 mat=0 n=230170703 fh=91% nps=7.6M
8cpu:
time=30.29 mat=0 n=427123141 fh=91% nps=14.1M

I know this has been discussed before but I wanted to know if there are new trends or things have changed. Questions pertaining to the corresponding items above:

#1 Do you discard the wrong result from the hash table if the hash move didn't pass legality checking? Why or why not?

I use lockless hashing to avoid problems, but if the move is illegal, I do not trust the score eithre.

#2 Which would be faster? Shared or per thread?

Don't see why you would not make it shared. The data can be useful to any thread.

#3 Same question as #2.
#4 Should this be per split point instead of per thread? Why or why not?

I doubt it matters at all

#5 Same question as number #4
#7 What are the possible improvements for this? Do you have separate criteria for Cut/All nodes, move generation phase (hasmove, killers, good captures, etc), etc?

Your idea is classic young brothers wait. There's not much better unless you want to implement the full DTS algorithm I wrote about years ago...

#8 Is this good or bad? I think Vincent D. mentioned somewhere in the old threads that it is better to spent effort on the child nodes of the second root move instead of
searching in parallel the other root moves, but I have read from SF team that it performs better in SF and all the root moves will be searched anyway.
#9 Should node count be counted per thread and just summed or is this sufficient enough?

There is an "in-between" approach that I use. I always split at the root, after I get the first score back. Unless I have reason to suspect that the second move in the move list might well replace the current best move. How could I tell that? By looking at the individual node counts for each root move at the end of an iteration. A move that is threatening to replace the current best will have a much higher node count than moves that are easily dismissed as inferior.

In Crafty, after I finish an iteration, I look at the node counts, since I sort the root move list after each iteration based on this counter (but keeping the PV move first regardless of its counter). If the first three moves after the best move have unusually large node counts, I flag them as "search each one separately, before starting a parallel search at the root, so that I can get all processors busy on the first to quickly determine if it is better, before I run out of time...

Other questions:

What are the other tricks used to minimize idle time spent by the threads? I have read "Active reparenting" from the SF team but I have not studied SF yet how this is implemented
or was this implemented at all?

I don't think that with a YBW implementation, you are going to see a lot of "idle time". So this is pretty much irrelevant.

I think that the idle time spent by the threads will be minimized by lowering the minimum split depth, the lower the better, this will boost NPS I guess but
I'm just not sure if this is efficient scalability wise though, I haven't tested it yet. I will test this soon. Any explanation why there is a minimum split depth?

It takes time and effort. The closer to the tips you split, the more splits you do, and the move overhead you have to tolerate as a result.

Another idea would be to keep track of the depths where there is a split point, and when the maximum depth at any given time is not greater than the minimum split depth,
it will be allowed to split below the minimum split depth.

For example, all threads are searching a split point at depth = minimum split depth and there is no other split points at higher depths besides this.
Lets say we have 30 moves and 4 threads, and the threads are now searching the last 4 moves.
When the threads returns, there will be no more moves to search, so it will get back into the idle loop.
What if in this case we allow splitting below the minimum split depth, to help the other threads?
Has anybody tried this?

syzygy · Post by **syzygy** » Fri Nov 23, 2012 10:12 pm

Edsel Apostol wrote:NPS speed-up in the initial position is around 1.8, 2.3, and 2.6 for 2, 3, and 4 threads respectively. This is in an old quad.
I know the speedup sucks but what can you expect from a 2 weeks old implementation that has recently just working.
Any suggestions where is the potential bottle neck? What is the speed-up in your engine at initial position?

I just did a 75-80 seconds run on the initial position. The NPS speed up I get with 6 cores is 5.84 compared to my smp engine running 1 thread and 5.81 compared to my engine with smp not compiled in.

#9 Should node count be counted per thread and just summed or is this sufficient enough?

For sure it is much better to do this per thread, especially if you also want the node count to be accurate. With a single counter you would have to use atomic increment (either through inline assembly or a compiler builtin), which you can avoid at no cost with a counter per thread. (Atomic instructions aren't terribly expensive on modern cpus when uncontested, but a single node counter would be one of the most contested memory locations (or rather, cache lines) in your engine.)

What are the other tricks used to minimize idle time spent by the threads? I have read "Active reparenting" from the SF team but I have not studied SF yet how this is implemented
or was this implemented at all?

As far as I understand, SF and many other engines let a thread that is searching a node check if there are idle threads when it is ready to split. So an idle thread will have to wait until it is picked up by an active thread. This way you have idle time, and you have no control on how close to the root the idle thread joins the search (except by increasing the minimum split depth).

If I'm not mistaken, the active reparenting idea is to let a thread that has become idle look at the nodes at which the active threads have joined the search, and presumably join the one that is closest to the root. I guess this should in theory help somewhat if you have a large number of threads. If you have just two threads, it will do nothing because there are no "active split points" if one thread is idle (so only one thread is active).

In my engine threads do not try to pick up idle threads. Instead, when a node is ready to be split (after captures and killers have been searched), it sets a flag for that node to mark it as a potential split point (and copies alpha, beta, depth to a place where other threads can find them). When a thread becomes idle, it basically loops over all other threads to find the potential split point with the highest depth. (If a thread is idle because it must wait for slave threads to finish, it only looks at potential split points in the subtree below the node at which it is waiting.) The flags are bits in one 64-bit word per thread (for plies 0-63, although I don't currently split at the root / ply 0), so looking for a split point is just a matter of taking the LSB of this word for each thread and looking up and comparing the corresponding depths.

I don't know if my approach is measurably better than the approach taken by SF, but it seems to work quite well. Initially I feared that the overhead for "potential split points" that never become split points would be high (because those need to take a node-specific spinlock when selecting each next move), but it turns out that it works fine with a minimum split depth of 2 (i.e. I only don't split in the last ply of full width search).

To keep the cost of splitting low, I think it is (always) best to just copy the moves leading from the root to the split node.

SMP and questions

SMP and questions

Re: SMP and questions

Re: SMP and questions

Re: SMP and questions

Re: SMP and questions

Re: SMP and questions

Re: SMP and questions

Re: SMP and questions

Re: SMP and questions

Re: SMP and questions