SMP and questions
Posted: Fri Nov 23, 2012 8:20 am
Over the past 2 weeks I've been working with SMP which I've promised Sam to implement years ago when we started the Hannibal project.
Since we are working on the project on an on/off basis and the main focus before was to improve strength we we're often sidetracked in implementing SMP.
After the last release until 3 to 4 weeks ago, we haven't touched the Hannibal project due to family/kids, travel/work, relocation, etc.
Lately I've finally mustered the motivation to work on it, which is already long overdue, it has been in my ToDo list for years (including Chess960 support).
Right now I have a working YBWC just recently which it would have taken me months to implement from design to working code without open source engines like Viper as reference (thanks Tord for that didactic open source engine).
I did have a bit hard time making it work though (chasing bugs, crashes, deadlocks, etc) but it was fun.
I implemented it this way in Hannibal.
1. Main Hash is shared without locking (legality of hash move is checked)
2. Pawn Hash is per thread (2 Mb per thread)
3. Eval Cache is per thread (1Mb per thread)
4. Killer moves is per thread (copied from master to slave in split point from root to current ply)
5. Evalvalues per node used in Futility pruning is per thread (copied from master to slave from root to current ply)
6. History tables is shared
7. Always split after having a score for the first move
8. Split at the root
9. Node count is shared (declared volatile)
NPS speed-up in the initial position is around 1.8, 2.3, and 2.6 for 2, 3, and 4 threads respectively. This is in an old quad.
I know the speedup sucks but what can you expect from a 2 weeks old implementation that has recently just working.
Any suggestions where is the potential bottle neck? What is the speed-up in your engine at initial position?
I know this has been discussed before but I wanted to know if there are new trends or things have changed. Questions pertaining to the corresponding items above:
#1 Do you discard the wrong result from the hash table if the hash move didn't pass legality checking? Why or why not?
#2 Which would be faster? Shared or per thread?
#3 Same question as #2.
#4 Should this be per split point instead of per thread? Why or why not?
#5 Same question as number #4
#7 What are the possible improvements for this? Do you have separate criteria for Cut/All nodes, move generation phase (hasmove, killers, good captures, etc), etc?
#8 Is this good or bad? I think Vincent D. mentioned somewhere in the old threads that it is better to spent effort on the child nodes of the second root move instead of
searching in parallel the other root moves, but I have read from SF team that it performs better in SF and all the root moves will be searched anyway.
#9 Should node count be counted per thread and just summed or is this sufficient enough?
Other questions:
What are the other tricks used to minimize idle time spent by the threads? I have read "Active reparenting" from the SF team but I have not studied SF yet how this is implemented
or was this implemented at all?
I think that the idle time spent by the threads will be minimized by lowering the minimum split depth, the lower the better, this will boost NPS I guess but
I'm just not sure if this is efficient scalability wise though, I haven't tested it yet. I will test this soon. Any explanation why there is a minimum split depth?
Another idea would be to keep track of the depths where there is a split point, and when the maximum depth at any given time is not greater than the minimum split depth,
it will be allowed to split below the minimum split depth.
For example, all threads are searching a split point at depth = minimum split depth and there is no other split points at higher depths besides this.
Lets say we have 30 moves and 4 threads, and the threads are now searching the last 4 moves.
When the threads returns, there will be no more moves to search, so it will get back into the idle loop.
What if in this case we allow splitting below the minimum split depth, to help the other threads?
Has anybody tried this?
Since we are working on the project on an on/off basis and the main focus before was to improve strength we we're often sidetracked in implementing SMP.
After the last release until 3 to 4 weeks ago, we haven't touched the Hannibal project due to family/kids, travel/work, relocation, etc.
Lately I've finally mustered the motivation to work on it, which is already long overdue, it has been in my ToDo list for years (including Chess960 support).
Right now I have a working YBWC just recently which it would have taken me months to implement from design to working code without open source engines like Viper as reference (thanks Tord for that didactic open source engine).
I did have a bit hard time making it work though (chasing bugs, crashes, deadlocks, etc) but it was fun.
I implemented it this way in Hannibal.
1. Main Hash is shared without locking (legality of hash move is checked)
2. Pawn Hash is per thread (2 Mb per thread)
3. Eval Cache is per thread (1Mb per thread)
4. Killer moves is per thread (copied from master to slave in split point from root to current ply)
5. Evalvalues per node used in Futility pruning is per thread (copied from master to slave from root to current ply)
6. History tables is shared
7. Always split after having a score for the first move
8. Split at the root
9. Node count is shared (declared volatile)
NPS speed-up in the initial position is around 1.8, 2.3, and 2.6 for 2, 3, and 4 threads respectively. This is in an old quad.
I know the speedup sucks but what can you expect from a 2 weeks old implementation that has recently just working.
Any suggestions where is the potential bottle neck? What is the speed-up in your engine at initial position?
I know this has been discussed before but I wanted to know if there are new trends or things have changed. Questions pertaining to the corresponding items above:
#1 Do you discard the wrong result from the hash table if the hash move didn't pass legality checking? Why or why not?
#2 Which would be faster? Shared or per thread?
#3 Same question as #2.
#4 Should this be per split point instead of per thread? Why or why not?
#5 Same question as number #4
#7 What are the possible improvements for this? Do you have separate criteria for Cut/All nodes, move generation phase (hasmove, killers, good captures, etc), etc?
#8 Is this good or bad? I think Vincent D. mentioned somewhere in the old threads that it is better to spent effort on the child nodes of the second root move instead of
searching in parallel the other root moves, but I have read from SF team that it performs better in SF and all the root moves will be searched anyway.
#9 Should node count be counted per thread and just summed or is this sufficient enough?
Other questions:
What are the other tricks used to minimize idle time spent by the threads? I have read "Active reparenting" from the SF team but I have not studied SF yet how this is implemented
or was this implemented at all?
I think that the idle time spent by the threads will be minimized by lowering the minimum split depth, the lower the better, this will boost NPS I guess but
I'm just not sure if this is efficient scalability wise though, I haven't tested it yet. I will test this soon. Any explanation why there is a minimum split depth?
Another idea would be to keep track of the depths where there is a split point, and when the maximum depth at any given time is not greater than the minimum split depth,
it will be allowed to split below the minimum split depth.
For example, all threads are searching a split point at depth = minimum split depth and there is no other split points at higher depths besides this.
Lets say we have 30 moves and 4 threads, and the threads are now searching the last 4 moves.
When the threads returns, there will be no more moves to search, so it will get back into the idle loop.
What if in this case we allow splitting below the minimum split depth, to help the other threads?
Has anybody tried this?