Questions on SMP search

bob · Post by **bob** » Sun Apr 24, 2011 5:19 pm

bhlangonijr wrote:
Houdini wrote:
bob wrote:The XORs are essentially free, they provide some pretty easy to use extra work for a pipe. When I added the lockless hashing stuff, I didn't see any drop in NPS at all.
In Houdini I do find a drop in NPS between XORing hash entries and not XORing hash entries. The lockless hashing scheme on the main hash produces a small but clear performance hit (I can activate the option with a compilation switch).

bob wrote:The down-side of local pawn hashes shows up on a 12-core or 24 core box. 24 separate pawn hashes can burn some memory. And for any configuration, it has a somewhat negative effect on cache.
Sure, but 24 times 2 MB is still only 48 MB, nowadays that's become "quantité négligeable".
The main issue in systems with higher number of cores is to separate the memory usage per NUMA node, so that each thread uses its own NUMA node-specific memory allocation. It's better for a thread to use the thread-specific pawn hash in its own NUMA node than to use a shared pawn hash that will mostly be outside its NUMA node.

Robert
Robert, concerning SMP code, what is the technique you have implemented in Houdini?
How does it scale (considering the number of threads/cores)?
How do you test SMP code changes?
What about the questions in the first post of this topic (I would like to know how you do it in Houdini)?

Regards,

Just like it is done in whatever version of Robo* that went "SMP", which is probably just like it is done in Rybka, which is probably just like it is done in Crafty.

michiguel · Post by **michiguel** » Sun Apr 24, 2011 6:01 pm

bob wrote:
bhlangonijr wrote:
Houdini wrote:
bob wrote:The XORs are essentially free, they provide some pretty easy to use extra work for a pipe. When I added the lockless hashing stuff, I didn't see any drop in NPS at all.
In Houdini I do find a drop in NPS between XORing hash entries and not XORing hash entries. The lockless hashing scheme on the main hash produces a small but clear performance hit (I can activate the option with a compilation switch).

bob wrote:The down-side of local pawn hashes shows up on a 12-core or 24 core box. 24 separate pawn hashes can burn some memory. And for any configuration, it has a somewhat negative effect on cache.
Sure, but 24 times 2 MB is still only 48 MB, nowadays that's become "quantité négligeable".
The main issue in systems with higher number of cores is to separate the memory usage per NUMA node, so that each thread uses its own NUMA node-specific memory allocation. It's better for a thread to use the thread-specific pawn hash in its own NUMA node than to use a shared pawn hash that will mostly be outside its NUMA node.

Robert
Robert, concerning SMP code, what is the technique you have implemented in Houdini?
How does it scale (considering the number of threads/cores)?
How do you test SMP code changes?
What about the questions in the first post of this topic (I would like to know how you do it in Houdini)?

Regards,
Just like it is done in whatever version of Robo* that went "SMP", which is probably just like it is done in Rybka, which is probably just like it is done in Crafty.

IIRC what it was mentioned here, Ryka uses processes not threads. But you already know that.

Miguel

Houdini · Post by **Houdini** » Sun Apr 24, 2011 9:04 pm

bob wrote:if you use only 2mb, you are faced with this:

log.001: time=3:00 mat=0 n=441934327 fh=91% nps=2.5M
log.002: time=3:00 mat=0 n=490668472 fh=91% nps=2.7M

Both searches are for 180 seconds. Which one would you rather toss out in a game? The first uses hashp=2m, the second uses hashp=128m. I know which one _I_ want. I'll take the almost 10% faster version myself. And when you have 24 of those, you are impacting cache significantly. And with no down-side to a shared cache, this is an idea that makes no sense to me.

You report results obtained with Crafty, I report Houdini results.
Houdini uses a 2 MB pawn hash per thread and typically spends less than 2 % of the time in the pawn hash code.
While your results are certainly interesting, they're not relevant for Houdini.

bob wrote:If this is a good idea, why not local transposition tables as well?

There is a fundamental difference between the pawn hash and the main transposition table.
The pawn hash is just for speed-up, it doesn't alter the actual search process.
The main hash modifies the search process very significantly, not sharing it would be counterproductive.

bob wrote:You won't see significant NUMA issues until you go beyond two physical processor chips (not cores). And the pawn hash trick I explained will eliminate about half of the pawn hash table accesses anyway. The main trans/ref table is _much_ worse because it is continually getting written to, roughly once per node searched. The pawn hash doesn't get modified much if it is big enough to prevent excessive overwrites.

My NUMA development and tests are on a 16 core box with 4 NUMA nodes.
Like Marco explained, your pawn hash trick is just a manual caching method that can be dealt with very efficiently by the normal memory caching mechanism including prefetching.

To each his own...

Robert

Houdini · Post by **Houdini** » Sun Apr 24, 2011 9:30 pm

bhlangonijr wrote:Robert, concerning SMP code, what is the technique you have implemented in Houdini?
How does it scale (considering the number of threads/cores)?
How do you test SMP code changes?
What about the questions in the first post of this topic (I would like to know how you do it in Houdini)?

Regards,

Houdini only shares the main hash table, all other structures (pawn hash, killer tables, history tables) are per thread. For the main hash no locking is used, illegal hash moves are simply detected.

The goal in Houdini was to have all threads work as independently as possible, except for the main hash table. This becomes particularly important when you're using multiple CPU sockets on several NUMA nodes.

Houdini's SMP code uses a special technique in which each thread runs its version of the code prewired to its own internal board representation. In other words, each thread uses its own code operating on its own memory segment.
In a rather strange way the end result might be fairly similar to what happens in Rybka with processes, but in Houdini it's achieved with threads.

I test SMP code changes by measuring the node speed (using the "autotune" command), and by playing engine matches with 2 threads or 4 threads.
Scaling of the node speed is quite good up to 8 cores if you allow for increased move times - it takes some time, say 2 to 5 seconds, before 8 threads have fully kicked in. Beyond 8 cores, and with multiple NUMA nodes, some special memory organizing and separation is required to avoid significant drops of the NPS due to memory bandwidth issues.

Robert

Houdini · Post by **Houdini** » Sun Apr 24, 2011 9:39 pm

bob wrote:Just like it is done in whatever version of Robo* that went "SMP", which is probably just like it is done in Rybka, which is probably just like it is done in Crafty.

Please read my previous reply to Ben-Hur Carlos to obtain additional information about how Houdini actually works. Houdini's SMP philisophy is most definitely closer to Stockfish than to any other engine you mention above.

While I very much respect your person and your past work with Cray Blitz and Crafty, as far as Houdini concerns you just seem to be following the "it's a clone" song invented by people much less talented than yourself.

Robert

bhlangonijr · Post by **bhlangonijr** » Sun Apr 24, 2011 11:35 pm

Houdini wrote: Houdini only shares the main hash table, all other structures (pawn hash, killer tables, history tables) are per thread. For the main hash no locking is used, illegal hash moves are simply detected.

Robert

For me it still doesn't make much sense to have a thread-local killers and history. The reason is that any thread might grab work to do on many different parts of a given search tree, so the information stored when doing work in a certain sub-tree would not be so relevant when the thread is scheduled to work on a different sub-tree. Am I missing something here?

marcelk · Post by **marcelk** » Mon Apr 25, 2011 12:36 am

Houdini wrote: While I very much respect your person and your past work with Cray Blitz and Crafty, as far as Houdini concerns you just seem to be following the "it's a clone" song invented by people much less talented than yourself.

Are you claiming with this statement that Houdini is your original work?

Houdini · Post by **Houdini** » Mon Apr 25, 2011 12:48 am

bhlangonijr wrote:
Houdini wrote: Houdini only shares the main hash table, all other structures (pawn hash, killer tables, history tables) are per thread. For the main hash no locking is used, illegal hash moves are simply detected.

Robert
For me it still doesn't make much sense to have a thread-local killers and history. The reason is that any thread might grab work to do on many different parts of a given search tree, so the information stored when doing work in a certain sub-tree would not be so relevant when the thread is scheduled to work on a different sub-tree. Am I missing something here?

No, you make a good point, threads can perform work at different locations in different sub-trees.

The main reason that thread-local killer tables work well (at least in Houdini), is that the killer and history tables are very volatile, they change rapidly and different sub-trees will often have very different killer tables.

Again, I can only say what works well in Houdini (and is compatible with its philosophy of separating as much as possible the threads), other engines may require a different approach.

Robert

michiguel · Post by **michiguel** » Mon Apr 25, 2011 1:23 am

bhlangonijr wrote:Concerning the following structures of your chess engine:

- Main Hash Table
- Pawn Hash Table
- Killers array
- History array

How you deal with these structures, when implementing a SMP search? To be more specific:
- For what structures, you keep a separate instance for each thread and what structures share a single instance with all threads;
- How you handle the concurrent access - if applicable - for each of these structures;
- How to maintain the killers/history arrays.

Thanks,

Hi BH,

I am the least expert here regarding this topic, so I stayed quiet. However, my experience may end up being more useful for you because I am closer to you in developing an SMP engine (I am only a couple of steps ahead). People with a lot of experience forgot what it means to tame this beast

Whatever you do, keep it simple and robust and keep the bugs out of the implementation. Design something that is not very complicated. Later, you can move forward.

You need a way to keep track what the search is doing at the beginning, dumping to a log all the decisions of a shallow search.

So, I will give you an advice that goes AGAINST everything said here. Lock all the tables with mutexes and/or spinlocks. Yes, that is slower, but you will know that you **won't** be chasing ghosts when you debug this. Once the bugs are out, replace those mutexes with all the advices you got here, but not before! I see replacing mutexes with nothing or lockless systems as an optimization technique. Optimization should never precede a bug-free implementation of a straightforward design. The point is, you have to get to know the search. You can't optimize this before learning how to do it.

In Gaviota I removed all the mutexes and I use a lockless procedure. Not because it is needed, but because it will help me to keep the code with fewer bugs possible. One day, I may even remove that.

For now, In Gaviota, only the killers are local to each thread. I found they really need to be that way.The main hashtable needs to be shared. I think that the rest is a matter of taste. Do whatever it will give you less headaches. Only you know what that would be

Miguel

bhlangonijr · Post by **bhlangonijr** » Mon Apr 25, 2011 2:18 am

michiguel wrote: Hi BH,

I am the least expert here regarding this topic, so I stayed quiet. However, my experience may end up being more useful for you because I am closer to you in developing an SMP engine (I am only a couple of steps ahead). People with a lot of experience forgot what it means to tame this beast

Whatever you do, keep it simple and robust and keep the bugs out of the implementation. Design something that is not very complicated. Later, you can move forward.

You need a way to keep track what the search is doing at the beginning, dumping to a log all the decisions of a shallow search.

So, I will give you an advice that goes AGAINST everything said here. Lock all the tables with mutexes and/or spinlocks. Yes, that is slower, but you will know that you **won't** be chasing ghosts when you debug this. Once the bugs are out, replace those mutexes with all the advices you got here, but not before! I see replacing mutexes with nothing or lockless systems as an optimization technique. Optimization should never precede a bug-free implementation of a straightforward design. The point is, you have to get to know the search. You can't optimize this before learning how to do it.

In Gaviota I removed all the mutexes and I use a lockless procedure. Not because it is needed, but because it will help me to keep the code with fewer bugs possible. One day, I may even remove that.

For now, In Gaviota, only the killers are local to each thread. I found they really need to be that way.The main hashtable needs to be shared. I think that the rest is a matter of taste. Do whatever it will give you less headaches. Only you know what that would be

Miguel

Hey Miguel,

First of all, your opinions and advises are always welcome and very much appreciated.
My first attempt to implement SMP was a total disaster. I tried to make it as good and complete as possible and ended up with an unmaintainable mess with a lot of crashes and non-functional code. At the end I discarded all the SMP code and started over. In my second attempt to SMP I finally got it to work and scaling reasonably up to 8 processors.
Well, everything that you said is true and I will try to follow your advises to avoid more hassle.

Everything was really fine, until I tried to make a PGO compile of my "new" SMP engine, using GCC. It doesn't work. It appears that when you try the PGO with a multi-threaded application it screw up with the profile data.

Code: Select all

Building file&#58; ../src/board.cpp
Invoking&#58; GCC C++ Compiler
g++ -O3 -Wall -c -fmessage-length=0 -fprofile-use -MMD -MP -MF"src/board.d" -MT"src/board.d" -o"src/board.o" "../src/board.cpp"
../src/board.cpp&#58; In member function ‘void Board&#58;&#58;doMove&#40;const MoveIterator&#58;&#58;Move&, BoardTypes&#58;&#58;MoveBackup&)’&#58;
../src/board.cpp&#58;601&#58; error&#58; corrupted profile info&#58; profile data is not flow-consistent
../src/board.cpp&#58;601&#58; error&#58; corrupted profile info&#58; number of executions for edge 18-19 thought to be 4637221
../src/board.cpp&#58;601&#58; error&#58; corrupted profile info&#58; number of executions for edge 18-27 thought to be -2912
../src/board.cpp&#58;601&#58; error&#58; corrupted profile info&#58; number of executions for edge 29-30 thought to be 114539
../src/board.cpp&#58;601&#58; error&#58; corrupted profile info&#58; number of executions for edge 29-38 thought to be -5
../src/bitboard.h&#58; At global scope&#58;
../src/bitboard.h&#58;315&#58; warning&#58; ‘midBoardNoFileA’ defined but not used
../src/bitboard.h&#58;316&#58; warning&#58; ‘midBoardNoFileH’ defined but not used
../src/bitboard.h&#58;384&#58; warning&#58; ‘upperMaskBitboard’ defined but not used
../src/bitboard.h&#58;395&#58; warning&#58; ‘lowerMaskBitboard’ defined but not used
../src/bitboard.h&#58;406&#58; warning&#58; ‘neighborFiles’ defined but not used
../src/bitboard.h&#58;482&#58; warning&#58; ‘eighthRank’ defined but not used
make&#58; *** &#91;src/board.o&#93; Error 1

Any idea how to solve it?
Regards,

Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search