SMP, first shot at implementation

chrisw · Post by **chrisw** » Sat Sep 12, 2020 12:43 pm

I'ld like to share my 'experiences' of first time programming chess engine multiple threads.

1. Stress and anxiety factors. Well, I'm very emotionally and psychologically cool with doing small incremental changes and testing. If something doesn't work, the problem is usually isolated in some small area, and one can usually fix it without too much hassle. Each increment is a limited goal, satisfaction each time. Some additions are far from incremental, they involves stuff one hasn't done before, and often involve quite major restructuring and design change. So, Lazy SMP, I've been researching it experimenting with threads, getting annoyed and frustrated and wandered off into displacement activity, decided on using pthreads, only to discover after much hassle that ptreads and Microsoft Visual Studio and x64 are more less mutually exclusive. This left plowing into c++ threads discovery (which I really wanted to avoid) only to discover, actually, this was not so difficult, and I got what I guess would be called very lazy SMP functional at least from a standing start within two days.

2. My idea of what Lazy SMP is (which may be wrong, which is why I'm expounding it here). Basically, N threads, each one a 'normal' search, all with own variables (some globals of course, it's a bit of work to clean all that up) and each its own tree, only thing in common being the transposition table, acting as communication of proposed best move and eval across all the threaded searches.

3. First shot at it does absolutely no move sorting other than what is already in place in a one-thread search, and I assumed that what in effect would happen is that although deterministically each thread starts off the same, the transfer of info (tt) would begin to create divergence, and cross-information from the tt would shorten the effective search time for the main thread.

4. Basically I have a main thread plus N helper threads, the PV gets reported from main thread only, and only consider time controls within that main thread. When that times out, all the helpers get a stop signal, program flow waits for N x t.join() to complete, and done.

5. Preliminary results (benchmark 970 positions, each at depth=13) are slightly weird.

For one thread only:
2.2 Mnps
total nodes 112 M
time per move 54 ms
'solutions' 389
legal, validated tt-moves proposed 3.19 M
tt-moves searched, with v>alpha 1.84 M
'efficiency' of tt-move proposal 57%

For four threads:
4.1 Mnps
total nodes 262 M
time per move 66 ms
'solutions' 391
legal, validated tt-moves proposed 10.1 M
tt-moves searched, with v>alpha 5.6 M
'efficiency' of tt-move proposal 55%

Obviously the major weirdness is in spending a little more time to move to depth, in opposition to expectations.

Joost Buijs · Post by **Joost Buijs** » Sat Sep 12, 2020 2:39 pm

Mnps should scale almost linear with the number of threads you use (besides the effect of the processor boosting somewhat lower when all cores are fully loaded). So it could be that you have a problem with 'false sharing'.

I use almost the same scheme as you, 1 master thread and an x number off slave threads that are all identical. I have some restrictions though, I try to let all threads work in parallel on the PV move, for all the remaining moves I allow only 1 thread for a certain combination of move and depth. If all moves at a certain depth are already being searched the thread automatically iterates to the next depth etc.

Of course you have to synchronize transposition-table access, since locking is way to slow you can use Bob's lockless hashing trick by XORing the hash-key with the data-field (easy to do if both are 64 bit). Maybe this is not perfect, but in practice this works rather well.

Kieren Pearson · Post by **Kieren Pearson** » Sat Sep 12, 2020 3:08 pm

A good first test is writing a multi-threaded perft routine. From this you should observe (If you've done it correctly) a almost linear scaling with the number of threads and you should be able to verify that no threads are sharing data they are not meant to and getting an incorrect result.

One easy way to do this is to at the root, launch say 4 threads with 1/4 of the root moves each to make. Depending on your language and chess program design this might be a little finicky to do to maybe a better idea would just be to launch as many threads as their are legal moves at the root. It shouldn't matter if this is way higher than the number of cores your computer has, they will just run in series and use as much cpu as possible.

Doing this was a great excersise for me to get familiar with how threads work in c++, as well as make my position data structures thread safe and not rely on any global variables.

Once this is done, you then need to basically remove all global variables. Anything not shared between threads (so killers, history tables, however you store the PV etc) should be wrapped into an object and each thread should have its own copy. This will fix potential bugs. At this point, the only thing left that is shared should be the transposition table.

At this point the only thing you need to do is make the transposition table thread safe. By using mutexes in c++ I was able to do this with minimal changes to code.

Lazy SMP is great because you can gain 100+ elo easily by just launching the threads at the root position and not needing to do anything special. Because each thread has a different killer and history table they will naturally diverge in the moves they search and by sharing a transposition table they are then able to skip the work done by other engines.

Another test you can do is to disable the transposition table and launch threads from the root position. Have all threads output to the console and see that they all give exactly the same results in order to check theres no shared data you've missed.

Joost Buijs · Post by **Joost Buijs** » Sat Sep 12, 2020 3:27 pm

Kieren Pearson wrote: ↑Sat Sep 12, 2020 3:08 pm At this point the only thing you need to do is make the transposition table thread safe. By using mutexes in c++ I was able to do this with minimal changes to code.

Maybe you can get away with this when you don't use the transposition table in quiescence, in my experience locking the transposition table with a mutex is way to slow to be useful.

chrisw · Post by **chrisw** » Sat Sep 12, 2020 6:57 pm

Joost Buijs wrote: ↑Sat Sep 12, 2020 2:39 pm Mnps should scale almost linear with the number of threads you use (besides the effect of the processor boosting somewhat lower when all cores are fully loaded). So it could be that you have a problem with 'false sharing'.

Yes, could be, although I was using only four of the six available cores and the CPU load was reporting at 60% or so.
I figured for some time now that nps is quite search efficienvy dependant. Depends on where the node counter gets upticked also. In my case, when a MakeMove gets done. If there's some 'smarts' move-ordering and so on, then efficient search with more cuts tends to spend more time on smarts and less doing MakeMoves (they get pruned or thrown out), so more efficient search (from thread cross-pollination) ought not (maybe) to scale nps with thread count. Dunno, maybe.

I use almost the same scheme as you, 1 master thread and an x number off slave threads that are all identical. I have some restrictions though, I try to let all threads work in parallel on the PV move, for all the remaining moves I allow only 1 thread for a certain combination of move and depth. If all moves at a certain depth are already being searched the thread automatically iterates to the next depth etc.

Yes, useful. I am beginning to think of controlling what gets searched. Right now the scheme is so 'lazy' it just launches all threads and each thread returns right back to the root, there's no mechanism implemented (as yet) to control what goes on after launch, nit even at ply one

Of course you have to synchronize transposition-table access, since locking is way to slow you can use Bob's lockless hashing trick by XORing the hash-key with the data-field (easy to do if both are 64 bit). Maybe this is not perfect, but in practice this works rather well.

Yes! Right now, it takes no care at all with hash entries, read nor write. No locks, no mutex. I do have some quite rigorous tests to check hash data, the tt move has to validate else the entire entry is abandoned. I also double check (because overblew the normal 64 bit read/write) that what just gets read (or written) is still there by the time of the next instruction. Hash code is marked 'go back and clean this up'.

Curiously, now, although the benchmark test showed no quicker time to depth, a quick run at 40/8 against something (not naming names) the prior version was scoring about 36% against , has now 57% 336W-211L-267D, so it must be doing something right. Obviously needs much more testing.

Kieren Pearson · Post by **Kieren Pearson** » Sat Sep 12, 2020 7:11 pm

Joost Buijs wrote: ↑Sat Sep 12, 2020 3:27 pm
Kieren Pearson wrote: ↑Sat Sep 12, 2020 3:08 pm At this point the only thing you need to do is make the transposition table thread safe. By using mutexes in c++ I was able to do this with minimal changes to code.
Maybe you can get away with this when you don't use the transposition table in quiescence, in my experience locking the transposition table with a mutex is way to slow to be useful.

Yeah this is what I also discovered. First I made a version that was thread safe but it actually adds elo to just be unsafe because it’s quicker

chrisw · Post by **chrisw** » Sat Sep 12, 2020 7:11 pm

Kieren Pearson wrote: ↑Sat Sep 12, 2020 3:08 pm A good first test is writing a multi-threaded perft routine.

Eek! I decided I was done with PERFT after building up the move/unmove/genmove/validatemove/genspecial moves etc etc etc. I think my old PERFT will probably break if I try running it right now.

From this you should observe (If you've done it correctly) a almost linear scaling with the number of threads and you should be able to verify that no threads are sharing data they are not meant to and getting an incorrect result.

One easy way to do this is to at the root, launch say 4 threads with 1/4 of the root moves each to make. Depending on your language and chess program design this might be a little finicky to do to maybe a better idea would just be to launch as many threads as their are legal moves at the root. It shouldn't matter if this is way higher than the number of cores your computer has, they will just run in series and use as much cpu as possible.

Good idea, I'll probably try that. Haven't bothered storing all the thread PVs, but checking those off against each other with nulled out hash looks a good sanity tester. Neat.

Doing this was a great excersise for me to get familiar with how threads work in c++, as well as make my position data structures thread safe and not rely on any global variables.

Once this is done, you then need to basically remove all global variables. Anything not shared between threads (so killers, history tables, however you store the PV etc) should be wrapped into an object and each thread should have its own copy. This will fix potential bugs. At this point, the only thing left that is shared should be the transposition table.

Yup, I designed from the ground up that there would one day be multiple threads so structures a re taken care of already (I hope). Globals are few, and I think any stray ones are dealt with. STOP_SEARCH is of course global and I actually didn't bother to put the nodecounters into any structure. Not entirely sure I'm getting away with that, but I think it's okay to uptick the same int64 with different threads? Not actually tested that yet.

At this point the only thing you need to do is make the transposition table thread safe. By using mutexes in c++ I was able to do this with minimal changes to code.

Yup, I read around this topic a lot and figured I'ld try and get away with rigorous checking of the hash info and a large key rather than locks. We'll see, but no problems yet, as far as I can tell.

Lazy SMP is great because you can gain 100+ elo easily by just launching the threads at the root position and not needing to do anything special. Because each thread has a different killer and history table they will naturally diverge in the moves they search and by sharing a transposition table they are then able to skip the work done by other engines.

Yes, agree! I'm very pleased that mine at least functions (and might even be functionally effective) with actually a lot less work and hassle than I thought. famous last words, no doubt. Anyway, thanks for comments.

Another test you can do is to disable the transposition table and launch threads from the root position. Have all threads output to the console and see that they all give exactly the same results in order to check theres no shared data you've missed.

Joost Buijs · Post by **Joost Buijs** » Sat Sep 12, 2020 7:24 pm

chrisw wrote: ↑Sat Sep 12, 2020 6:57 pm Yes, could be, although I was using only four of the six available cores and the CPU load was reporting at 60% or so.
I figured for some time now that nps is quite search efficienvy dependant. Depends on where the node counter gets upticked also. In my case, when a MakeMove gets done. If there's some 'smarts' move-ordering and so on, then efficient search with more cuts tends to spend more time on smarts and less doing MakeMoves (they get pruned or thrown out), so more efficient search (from thread cross-pollination) ought not (maybe) to scale nps with thread count. Dunno, maybe.

The best is to keep all global variables that belong to a search thread (including the node counter) in a struct that is aligned at 64 bytes (the typical width of a cache line) to prevent different threads from writing into the same cache line. At the end of the search you can collect the node counters from the different threads and add them together to get the final node count. When you let different threads write into the same cache line (even for a single counter) it completely kills performance.

Yes! Right now, it takes no care at all with hash entries, read nor write. No locks, no mutex. I do have some quite rigorous tests to check hash data, the tt move has to validate else the entire entry is abandoned. I also double check (because overblew the normal 64 bit read/write) that what just gets read (or written) is still there by the time of the next instruction. Hash code is marked 'go back and clean this up'.

Checking hash data validity should work, there is no need to lock the transposition table at all if you are sure your that your hash data is valid.

Joost Buijs · Post by **Joost Buijs** » Sat Sep 12, 2020 7:32 pm

chrisw wrote: ↑Sat Sep 12, 2020 6:57 pm Curiously, now, although the benchmark test showed no quicker time to depth, a quick run at 40/8 against something (not naming names) the prior version was scoring about 36% against , has now 57% 336W-211L-267D, so it must be doing something right. Obviously needs much more testing.

Making your search wider while keeping approx. the same depth also increases playing strength, this is what happens.

mvanthoor · Post by **mvanthoor** » Sat Sep 12, 2020 8:21 pm

Joost Buijs wrote: ↑Sat Sep 12, 2020 3:27 pm
Kieren Pearson wrote: ↑Sat Sep 12, 2020 3:08 pm At this point the only thing you need to do is make the transposition table thread safe. By using mutexes in c++ I was able to do this with minimal changes to code.
Maybe you can get away with this when you don't use the transposition table in quiescence, in my experience locking the transposition table with a mutex is way to slow to be useful.

Do you have a link to that non-locking mechanism for using the hash table? In my engine, it's (almost) impossible to write the thing single-threaded because of the way how Rust works. I'm going to need a thread for the UCI loop, and a thread for the search, so I'll be starting out with two threads already. Going from there to using 2-4 threads for the search instead of one, will be a relatively easy step, as platform-independent concurrency is built right into the language. A non-locking hash table mechanism would be really useful.

SMP, first shot at implementation

SMP, first shot at implementation

Re: SMP, first shot at implementation

Re: SMP, first shot at implementation

Re: SMP, first shot at implementation

Re: SMP, first shot at implementation

Re: SMP, first shot at implementation

Re: SMP, first shot at implementation

Re: SMP, first shot at implementation

Re: SMP, first shot at implementation

Re: SMP, first shot at implementation