Questions on SMP search

bob · Post by **bob** » Mon Apr 25, 2011 3:44 pm

michiguel wrote:
bob wrote:
bhlangonijr wrote:
Houdini wrote:
bob wrote:The XORs are essentially free, they provide some pretty easy to use extra work for a pipe. When I added the lockless hashing stuff, I didn't see any drop in NPS at all.
In Houdini I do find a drop in NPS between XORing hash entries and not XORing hash entries. The lockless hashing scheme on the main hash produces a small but clear performance hit (I can activate the option with a compilation switch).

bob wrote:The down-side of local pawn hashes shows up on a 12-core or 24 core box. 24 separate pawn hashes can burn some memory. And for any configuration, it has a somewhat negative effect on cache.
Sure, but 24 times 2 MB is still only 48 MB, nowadays that's become "quantité négligeable".
The main issue in systems with higher number of cores is to separate the memory usage per NUMA node, so that each thread uses its own NUMA node-specific memory allocation. It's better for a thread to use the thread-specific pawn hash in its own NUMA node than to use a shared pawn hash that will mostly be outside its NUMA node.

Robert
Robert, concerning SMP code, what is the technique you have implemented in Houdini?
How does it scale (considering the number of threads/cores)?
How do you test SMP code changes?
What about the questions in the first post of this topic (I would like to know how you do it in Houdini)?

Regards,
Just like it is done in whatever version of Robo* that went "SMP", which is probably just like it is done in Rybka, which is probably just like it is done in Crafty.
IIRC what it was mentioned here, Ryka uses processes not threads. But you already know that.

Miguel

That is a 15 minute change. Crafty went from threads, to processes, and back to threads. Threads is the "right way" for lots of reasons. I went to processes because of early quirks in the posix thread library on linux...

bob · Post by **bob** » Mon Apr 25, 2011 4:58 pm

bhlangonijr wrote:
Houdini wrote: Houdini only shares the main hash table, all other structures (pawn hash, killer tables, history tables) are per thread. For the main hash no locking is used, illegal hash moves are simply detected.

Robert
For me it still doesn't make much sense to have a thread-local killers and history. The reason is that any thread might grab work to do on many different parts of a given search tree, so the information stored when doing work in a certain sub-tree would not be so relevant when the thread is scheduled to work on a different sub-tree. Am I missing something here?

thread-local killers are OK, done correctly. For example, when you split, you can copy your killer list to the helper threads so that they will have help on move ordering. But killers are actually quite "local" in scope and having them local is better. I ran this test a few years back (cluster test using SMP search) and found that local killers were better. It is not a huge win, as in tens or hundreds of Elo, but it is certainly better.

I don't use history at all and don't believe, based on my testing, that it is particularly useful whether it is done locally or globally. Many use it simply because the program they copied used it.

bob · Post by **bob** » Mon Apr 25, 2011 5:05 pm

Houdini wrote:
bob wrote:if you use only 2mb, you are faced with this:

log.001: time=3:00 mat=0 n=441934327 fh=91% nps=2.5M
log.002: time=3:00 mat=0 n=490668472 fh=91% nps=2.7M

Both searches are for 180 seconds. Which one would you rather toss out in a game? The first uses hashp=2m, the second uses hashp=128m. I know which one _I_ want. I'll take the almost 10% faster version myself. And when you have 24 of those, you are impacting cache significantly. And with no down-side to a shared cache, this is an idea that makes no sense to me.
You report results obtained with Crafty, I report Houdini results.
Houdini uses a 2 MB pawn hash per thread and typically spends less than 2 % of the time in the pawn hash code.
While your results are certainly interesting, they're not relevant for Houdini.

So you don't search as many different pawn positions as I do? How is that possible since we are playing the _same_ game??? In Crafty, a pawn hash entry is 32 bytes. 2mb = 64K entries. For that 3 minute search, going from 2mb to 128mb improved speed by almost 10%. That speed improvement is produced by avoiding pawn evaluation more frequently due to pawn hash hits.

I don't see how your numbers would be any different unless the pawn evaluation is _extremely_ simple. Looking at Robo* and friends, it is not so simple that its cost can be ignored.

bob wrote:If this is a good idea, why not local transposition tables as well?
There is a fundamental difference between the pawn hash and the main transposition table.
The pawn hash is just for speed-up, it doesn't alter the actual search process.
The main hash modifies the search process very significantly, not sharing it would be counterproductive.

If you alter the program's speed, you are altering the search space, because you have extra time to search things the slower version couldn't get to.

bob wrote:You won't see significant NUMA issues until you go beyond two physical processor chips (not cores). And the pawn hash trick I explained will eliminate about half of the pawn hash table accesses anyway. The main trans/ref table is _much_ worse because it is continually getting written to, roughly once per node searched. The pawn hash doesn't get modified much if it is big enough to prevent excessive overwrites.
My NUMA development and tests are on a 16 core box with 4 NUMA nodes.
Like Marco explained, your pawn hash trick is just a manual caching method that can be dealt with very efficiently by the normal memory caching mechanism including prefetching.

To each his own...

Robert

There is more to it than that, but if you don't see the issues, there's not much to say. But suffice it to say, running in cache is different when replacement is an issue, as is the TLB. And other things.....

bob · Post by **bob** » Mon Apr 25, 2011 5:08 pm

michiguel wrote:
bhlangonijr wrote:Concerning the following structures of your chess engine:

- Main Hash Table
- Pawn Hash Table
- Killers array
- History array

How you deal with these structures, when implementing a SMP search? To be more specific:
- For what structures, you keep a separate instance for each thread and what structures share a single instance with all threads;
- How you handle the concurrent access - if applicable - for each of these structures;
- How to maintain the killers/history arrays.

Thanks,
Hi BH,

I am the least expert here regarding this topic, so I stayed quiet. However, my experience may end up being more useful for you because I am closer to you in developing an SMP engine (I am only a couple of steps ahead). People with a lot of experience forgot what it means to tame this beast

Whatever you do, keep it simple and robust and keep the bugs out of the implementation. Design something that is not very complicated. Later, you can move forward.

You need a way to keep track what the search is doing at the beginning, dumping to a log all the decisions of a shallow search.

So, I will give you an advice that goes AGAINST everything said here. Lock all the tables with mutexes and/or spinlocks. Yes, that is slower, but you will know that you **won't** be chasing ghosts when you debug this. Once the bugs are out, replace those mutexes with all the advices you got here, but not before! I see replacing mutexes with nothing or lockless systems as an optimization technique. Optimization should never precede a bug-free implementation of a straightforward design. The point is, you have to get to know the search. You can't optimize this before learning how to do it.

In Gaviota I removed all the mutexes and I use a lockless procedure. Not because it is needed, but because it will help me to keep the code with fewer bugs possible. One day, I may even remove that.

For now, In Gaviota, only the killers are local to each thread. I found they really need to be that way.The main hashtable needs to be shared. I think that the rest is a matter of taste. Do whatever it will give you less headaches. Only you know what that would be

Miguel

I would suggest reading main.c in Crafty, version 15.0... it was a much more simplistic algorithm (called PVS in the early 1980's, not to be confused with the sequential PVS (null-window search) of today. It is much easier to get it up and running and debugged, and it will provide real speedup with 2 and 4 cores. Maybe 2.5x with 4 with some luck, but that is not a bad start. With 8 or more it is not nearly as good, but it works. Then you can add cleverness once you know that the basic framework works reliably.

bob · Post by **bob** » Mon Apr 25, 2011 5:09 pm

bhlangonijr wrote:
michiguel wrote: Hi BH,

I am the least expert here regarding this topic, so I stayed quiet. However, my experience may end up being more useful for you because I am closer to you in developing an SMP engine (I am only a couple of steps ahead). People with a lot of experience forgot what it means to tame this beast

Whatever you do, keep it simple and robust and keep the bugs out of the implementation. Design something that is not very complicated. Later, you can move forward.

You need a way to keep track what the search is doing at the beginning, dumping to a log all the decisions of a shallow search.

So, I will give you an advice that goes AGAINST everything said here. Lock all the tables with mutexes and/or spinlocks. Yes, that is slower, but you will know that you **won't** be chasing ghosts when you debug this. Once the bugs are out, replace those mutexes with all the advices you got here, but not before! I see replacing mutexes with nothing or lockless systems as an optimization technique. Optimization should never precede a bug-free implementation of a straightforward design. The point is, you have to get to know the search. You can't optimize this before learning how to do it.

In Gaviota I removed all the mutexes and I use a lockless procedure. Not because it is needed, but because it will help me to keep the code with fewer bugs possible. One day, I may even remove that.

For now, In Gaviota, only the killers are local to each thread. I found they really need to be that way.The main hashtable needs to be shared. I think that the rest is a matter of taste. Do whatever it will give you less headaches. Only you know what that would be

Miguel
Hey Miguel,

First of all, your opinions and advises are always welcome and very much appreciated.
My first attempt to implement SMP was a total disaster. I tried to make it as good and complete as possible and ended up with an unmaintainable mess with a lot of crashes and non-functional code. At the end I discarded all the SMP code and started over. In my second attempt to SMP I finally got it to work and scaling reasonably up to 8 processors.
Well, everything that you said is true and I will try to follow your advises to avoid more hassle.

Everything was really fine, until I tried to make a PGO compile of my "new" SMP engine, using GCC. It doesn't work. It appears that when you try the PGO with a multi-threaded application it screw up with the profile data.
Code: Select all
Building file&#58; ../src/board.cpp
Invoking&#58; GCC C++ Compiler
g++ -O3 -Wall -c -fmessage-length=0 -fprofile-use -MMD -MP -MF"src/board.d" -MT"src/board.d" -o"src/board.o" "../src/board.cpp"
../src/board.cpp&#58; In member function ‘void Board&#58;&#58;doMove&#40;const MoveIterator&#58;&#58;Move&, BoardTypes&#58;&#58;MoveBackup&)’&#58;
../src/board.cpp&#58;601&#58; error&#58; corrupted profile info&#58; profile data is not flow-consistent
../src/board.cpp&#58;601&#58; error&#58; corrupted profile info&#58; number of executions for edge 18-19 thought to be 4637221
../src/board.cpp&#58;601&#58; error&#58; corrupted profile info&#58; number of executions for edge 18-27 thought to be -2912
../src/board.cpp&#58;601&#58; error&#58; corrupted profile info&#58; number of executions for edge 29-30 thought to be 114539
../src/board.cpp&#58;601&#58; error&#58; corrupted profile info&#58; number of executions for edge 29-38 thought to be -5
../src/bitboard.h&#58; At global scope&#58;
../src/bitboard.h&#58;315&#58; warning&#58; ‘midBoardNoFileA’ defined but not used
../src/bitboard.h&#58;316&#58; warning&#58; ‘midBoardNoFileH’ defined but not used
../src/bitboard.h&#58;384&#58; warning&#58; ‘upperMaskBitboard’ defined but not used
../src/bitboard.h&#58;395&#58; warning&#58; ‘lowerMaskBitboard’ defined but not used
../src/bitboard.h&#58;406&#58; warning&#58; ‘neighborFiles’ defined but not used
../src/bitboard.h&#58;482&#58; warning&#58; ‘eighthRank’ defined but not used
make&#58; *** &#91;src/board.o&#93; Error 1
Any idea how to solve it?
Regards,

Yes, use the Intel C compiler. Of course, you can get by without profiling the SMP code since it is not used that often, and continue to use gcc. Or switch to ICC which works perfectly. I've not found any version of gcc that works reliably with threaded code, all the way to 4.6...

bhlangonijr · Post by **bhlangonijr** » Mon Apr 25, 2011 7:05 pm

bob wrote: Yes, use the Intel C compiler. Of course, you can get by without profiling the SMP code since it is not used that often, and continue to use gcc. Or switch to ICC which works perfectly. I've not found any version of gcc that works reliably with threaded code, all the way to 4.6...

Intel C compiler is not an option because:

- I usually make some builds to the Windows platform, and don't want to pay for using Intel compiler there;
- I use some computers with AMD processors to make the compiles, and it is well know Intel compiler cripples code when compiling on it;

Moreover, I have a special feeling for GCC. I don't want to get rid of it. It is like asking me to use Chessbase instead of Winboard. I won't do this.

Houdini · Post by **Houdini** » Mon Apr 25, 2011 9:18 pm

bob wrote:
Houdini wrote:You report results obtained with Crafty, I report Houdini results.
Houdini uses a 2 MB pawn hash per thread and typically spends less than 2 % of the time in the pawn hash code.
While your results are certainly interesting, they're not relevant for Houdini.
So you don't search as many different pawn positions as I do? How is that possible since we are playing the _same_ game??? In Crafty, a pawn hash entry is 32 bytes. 2mb = 64K entries. For that 3 minute search, going from 2mb to 128mb improved speed by almost 10%. That speed improvement is produced by avoiding pawn evaluation more frequently due to pawn hash hits.

I don't see how your numbers would be any different unless the pawn evaluation is _extremely_ simple. Looking at Robo* and friends, it is not so simple that its cost can be ignored.

My numbers are valid for Houdini, your numbers are valid for Crafty. Houdini and Crafty are two very different engines.
The difference in our results serves as a good reminder that it's dangerous to extrapolate blindly experience with one engine to other engines.

bob wrote:
Houdini wrote:There is a fundamental difference between the pawn hash and the main transposition table.
The pawn hash is just for speed-up, it doesn't alter the actual search process.
The main hash modifies the search process very significantly, not sharing it would be counterproductive.
If you alter the program's speed, you are altering the search space, because you have extra time to search things the slower version couldn't get to.

I find it difficult to believe that you wouldn't see the fundamental difference between the pawn hash and the main transposition table, so I can only assume that you're arguing for the sake of arguing.

Robert

bob · Post by **bob** » Mon Apr 25, 2011 11:07 pm

Houdini wrote:
bob wrote:
Houdini wrote:You report results obtained with Crafty, I report Houdini results.
Houdini uses a 2 MB pawn hash per thread and typically spends less than 2 % of the time in the pawn hash code.
While your results are certainly interesting, they're not relevant for Houdini.
So you don't search as many different pawn positions as I do? How is that possible since we are playing the _same_ game??? In Crafty, a pawn hash entry is 32 bytes. 2mb = 64K entries. For that 3 minute search, going from 2mb to 128mb improved speed by almost 10%. That speed improvement is produced by avoiding pawn evaluation more frequently due to pawn hash hits.

I don't see how your numbers would be any different unless the pawn evaluation is _extremely_ simple. Looking at Robo* and friends, it is not so simple that its cost can be ignored.
My numbers are valid for Houdini, your numbers are valid for Crafty. Houdini and Crafty are two very different engines.
The difference in our results serves as a good reminder that it's dangerous to extrapolate blindly experience with one engine to other engines.

bob wrote:
Houdini wrote:There is a fundamental difference between the pawn hash and the main transposition table.
The pawn hash is just for speed-up, it doesn't alter the actual search process.
The main hash modifies the search process very significantly, not sharing it would be counterproductive.
If you alter the program's speed, you are altering the search space, because you have extra time to search things the slower version couldn't get to.
I find it difficult to believe that you wouldn't see the fundamental difference between the pawn hash and the main transposition table, so I can only assume that you're arguing for the sake of arguing.

Robert

I certainly understand how either works. My point is that sharing the information is a win. A _clear_ win. If the accesses slow things down, then the same should be true of the TT in the late opening where hits are not as common as later in the endgame.

Programs can be different, but percentage of pawn positions to overall positions searched is pretty constant for a specific position. And if you can save time by avoiding the evaluation (pawn evaluation) that is a win. If you only see 64K unique pawn positions (or whatever 2mb of pawn hash gives you) something is _most_ unusual. Because the game of chess itself doesn't change based on which program is doing the search...

If you collect pawn hash hits, why not run with 2mb, 20mb, 200mb and see where you have to get to before you have < 50% of the entries written to during a search? Or run the same position with two different hash sizes and compare the speed?

bob · Post by **bob** » Mon Apr 25, 2011 11:12 pm

bhlangonijr wrote:
bob wrote: Yes, use the Intel C compiler. Of course, you can get by without profiling the SMP code since it is not used that often, and continue to use gcc. Or switch to ICC which works perfectly. I've not found any version of gcc that works reliably with threaded code, all the way to 4.6...
Intel C compiler is not an option because:

- I usually make some builds to the Windows platform, and don't want to pay for using Intel compiler there;
- I use some computers with AMD processors to make the compiles, and it is well know Intel compiler cripples code when compiling on it;

Moreover, I have a special feeling for GCC. I don't want to get rid of it. It is like asking me to use Chessbase instead of Winboard. I won't do this.

Then (a) you are not going to PGO a threaded program and (b) you are going to give up 10%+ in speed on Intel processors when comparing ICC to GCC. Don't know why they can't fix the PGO problem, but it has been present for years.

Houdini · Post by **Houdini** » Tue Apr 26, 2011 1:44 am

bob wrote:Programs can be different, but percentage of pawn positions to overall positions searched is pretty constant for a specific position. And if you can save time by avoiding the evaluation (pawn evaluation) that is a win. If you only see 64K unique pawn positions (or whatever 2mb of pawn hash gives you) something is _most_ unusual. Because the game of chess itself doesn't change based on which program is doing the search...

If you collect pawn hash hits, why not run with 2mb, 20mb, 200mb and see where you have to get to before you have < 50% of the entries written to during a search? Or run the same position with two different hash sizes and compare the speed?

For Houdini, above 2 MB of pawn hash the NPS gain becomes negligible (less than 1%). In other words, keeping the most recent 64K pawn positions in memory appears to be sufficient to achieve most of the speed gain produced by the pawn hash.
The relative cache/TLB-friendliness of the small pawn hash probably plays a role in this success.

Robert

Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search