Help with LazySMP...

Steve Maughan · Post by **Steve Maughan** » Wed Jun 04, 2025 11:27 pm

I'm trying to get LazySMP working on my new engine. So far I don't get any speedup at all — zilch, zippo, nada, nothing!!

My transposition table is working well. A single thread search get's a eight fold speed increase searching from the initial start position. This seems quite decent. It seems to be a decent lock free transposition table. When I fire up multiple threads, all grinding away at the same TT, I can see the threads are running but there's no speed-up. I should say, all threads (including the main search thread), are doing exactly the same thing. I'm not varying the aspiration window or depth.

Any tips on implementing LazySMP? What's the secret sauce to get LazySMP working?

— Steve

mar · Post by **mar** » Thu Jun 05, 2025 6:29 pm

what do you mean by single thread gets 8-fold speed increase from initial position?
with lazysmp you should get almost linear increase in nps (not time to depth) - of course you have to sum the total nodes searched from all running threads
with 4 vs 1 threads you should get about 100 elo self-play even with the plain dumbest shared TT

Steve Maughan · Post by **Steve Maughan** » Thu Jun 05, 2025 8:21 pm

Hi!

The speed-up refers to a one-thread search from the initial position to depth 11. With transposition tables enabled, the search is completed in 13 seconds, with them disabled the search takes 100 seconds — hence the x8 speed-up.

When I enable 10 threads, the nps goes x10 higher, but the time to depth 11 remains at 13 seconds. Something must be wrong.

— Steve

mar · Post by **mar** » Thu Jun 05, 2025 8:53 pm

ah ok, then it seems it works well, congratulations. lazySMP isn't supposed to cut time to depth.
I suggest you run a self-play test, 2 threads vs 1 (or 4 vs 1, ....) and you should already see a nice improvement in strength

hgm · Post by **hgm** » Fri Jun 06, 2025 6:47 am

This can of course only work if your single-threaded search is so heavily pruned/reduced that it is totally inadequate for finding the best move in a typical game position that a fixed-depth search would return. If the large number of threads doesn't add depth in a given time, the extra nodes can only have been used to fill each other's holes.

petero2 · Post by **petero2** » Fri Jun 06, 2025 9:09 am

hgm wrote: ↑Fri Jun 06, 2025 6:47 am This can of course only work if your single-threaded search is so heavily pruned/reduced that it is totally inadequate for finding the best move in a typical game position that a fixed-depth search would return. If the large number of threads doesn't add depth in a given time, the extra nodes can only have been used to fill each other's holes.

Indeed, and this is the reason why lazy SMP (or shared transposition tables as it was originally called) historically did not work well in chess engines. Eventually it was realized that a full-width search, where >99.9% of the searched nodes were "garbage nodes", was very weak compared to a more pruned search (history/late move reductions is probably the turning point). After that "lazy SMP" was re-discovered, as with a small enough effective branching factor for the single-threaded search, the lazy SMP approach is more efficient than other parallel algorithms, especially for large number of cores, because of the low thread synchronization overhead. Also because the full-width tree grows much faster than the pruned tree, there will "never" be enough search threads to "fill all the holes", so scaling stays good also for large number of cores.

So, if your lazy SMP implementation does not improve the engine strength, and you have debugged it to make sure there are no implementation bugs, check the following:

Ensure your engine has a low enough branching factor in single threaded search. If not, focus on improving single-threaded strength first.
Measure Elo, not secondary metrics like time to depth. Once the branching factor is low enough, time to depth is no longer a reliable measure of strength.
Make sure the threads are not doing exactly, or nearly, the same thing. Having thread-local killer/history/etc tables generally helps. Some implementations vary the search depth for different threads, though this may be less important if the single-threaded branching factor is already low.

Steve Maughan · Post by **Steve Maughan** » Fri Jun 06, 2025 12:25 pm

petero2 wrote: ↑Fri Jun 06, 2025 9:09 am<snip>
Indeed, and this is the reason why lazy SMP (or shared transposition tables as it was originally called) historically did not work well in chess engines. Eventually it was realized that a full-width search, where >99.9% of the searched nodes were "garbage nodes", was very weak compared to a more pruned search (history/late move reductions is probably the turning point). After that "lazy SMP" was re-discovered, as with a small enough effective branching factor for the single-threaded search, the lazy SMP approach is more efficient than other parallel algorithms, especially for large number of cores, because of the low thread synchronization overhead. Also because the full-width tree grows much faster than the pruned tree, there will "never" be enough search threads to "fill all the holes", so scaling stays good also for large number of cores.

So, if your lazy SMP implementation does not improve the engine strength, and you have debugged it to make sure there are no implementation bugs, check the following:
<snip>

This is fascinating. Currently my engine is extremely basic. The search has zero extensions, zero killers, no null move, no internal iterative deepening, no late move reductions etc. Basically it's a full width search with a move ordering of hash move, captures (MVV/LVA), then history ordered quiet moves; all followed by a quiescent search. I have a shared hash across all threads, and a shared history table (I'll test having a history table per thread). This is all intentional — I wanted to start by creating a bug free basic search and then start adding features. As you can imagine, the branching factor is nothing special.

Maybe the simplicity of the search is the reason I'm not seeing any advantage or speed-up with multiple cores.

— Steve

Help with LazySMP...

Help with LazySMP...

Re: Help with LazySMP...

Re: Help with LazySMP...

Re: Help with LazySMP...

Re: Help with LazySMP...

Re: Help with LazySMP...

Re: Help with LazySMP...