I'm trying to get LazySMP working on my new engine. So far I don't get any speedup at all — zilch, zippo, nada, nothing!!
My transposition table is working well. A single thread search get's a eight fold speed increase searching from the initial start position. This seems quite decent. It seems to be a decent lock free transposition table. When I fire up multiple threads, all grinding away at the same TT, I can see the threads are running but there's no speed-up. I should say, all threads (including the main search thread), are doing exactly the same thing. I'm not varying the aspiration window or depth.
Any tips on implementing LazySMP? What's the secret sauce to get LazySMP working?
— Steve
Help with LazySMP...
Moderator: Ras
-
- Posts: 1273
- Joined: Wed Mar 08, 2006 8:28 pm
- Location: Florida, USA
Help with LazySMP...
http://www.chessprogramming.net - Juggernaut & Maverick Chess Engine
-
- Posts: 2654
- Joined: Fri Nov 26, 2010 2:00 pm
- Location: Czech Republic
- Full name: Martin Sedlak
Re: Help with LazySMP...
what do you mean by single thread gets 8-fold speed increase from initial position?
with lazysmp you should get almost linear increase in nps (not time to depth) - of course you have to sum the total nodes searched from all running threads
with 4 vs 1 threads you should get about 100 elo self-play even with the plain dumbest shared TT
with lazysmp you should get almost linear increase in nps (not time to depth) - of course you have to sum the total nodes searched from all running threads
with 4 vs 1 threads you should get about 100 elo self-play even with the plain dumbest shared TT
-
- Posts: 1273
- Joined: Wed Mar 08, 2006 8:28 pm
- Location: Florida, USA
Re: Help with LazySMP...
Hi!
The speed-up refers to a one-thread search from the initial position to depth 11. With transposition tables enabled, the search is completed in 13 seconds, with them disabled the search takes 100 seconds — hence the x8 speed-up.
When I enable 10 threads, the nps goes x10 higher, but the time to depth 11 remains at 13 seconds. Something must be wrong.
— Steve
The speed-up refers to a one-thread search from the initial position to depth 11. With transposition tables enabled, the search is completed in 13 seconds, with them disabled the search takes 100 seconds — hence the x8 speed-up.
When I enable 10 threads, the nps goes x10 higher, but the time to depth 11 remains at 13 seconds. Something must be wrong.
— Steve
http://www.chessprogramming.net - Juggernaut & Maverick Chess Engine
-
- Posts: 2654
- Joined: Fri Nov 26, 2010 2:00 pm
- Location: Czech Republic
- Full name: Martin Sedlak
Re: Help with LazySMP...
ah ok, then it seems it works well, congratulations. lazySMP isn't supposed to cut time to depth.
I suggest you run a self-play test, 2 threads vs 1 (or 4 vs 1, ....) and you should already see a nice improvement in strength
I suggest you run a self-play test, 2 threads vs 1 (or 4 vs 1, ....) and you should already see a nice improvement in strength
-
- Posts: 28353
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Help with LazySMP...
This can of course only work if your single-threaded search is so heavily pruned/reduced that it is totally inadequate for finding the best move in a typical game position that a fixed-depth search would return. If the large number of threads doesn't add depth in a given time, the extra nodes can only have been used to fill each other's holes.
-
- Posts: 723
- Joined: Mon Apr 19, 2010 7:07 pm
- Location: Sweden
- Full name: Peter Osterlund
Re: Help with LazySMP...
Indeed, and this is the reason why lazy SMP (or shared transposition tables as it was originally called) historically did not work well in chess engines. Eventually it was realized that a full-width search, where >99.9% of the searched nodes were "garbage nodes", was very weak compared to a more pruned search (history/late move reductions is probably the turning point). After that "lazy SMP" was re-discovered, as with a small enough effective branching factor for the single-threaded search, the lazy SMP approach is more efficient than other parallel algorithms, especially for large number of cores, because of the low thread synchronization overhead. Also because the full-width tree grows much faster than the pruned tree, there will "never" be enough search threads to "fill all the holes", so scaling stays good also for large number of cores.hgm wrote: ↑Fri Jun 06, 2025 6:47 am This can of course only work if your single-threaded search is so heavily pruned/reduced that it is totally inadequate for finding the best move in a typical game position that a fixed-depth search would return. If the large number of threads doesn't add depth in a given time, the extra nodes can only have been used to fill each other's holes.
So, if your lazy SMP implementation does not improve the engine strength, and you have debugged it to make sure there are no implementation bugs, check the following:
- Ensure your engine has a low enough branching factor in single threaded search. If not, focus on improving single-threaded strength first.
- Measure Elo, not secondary metrics like time to depth. Once the branching factor is low enough, time to depth is no longer a reliable measure of strength.
- Make sure the threads are not doing exactly, or nearly, the same thing. Having thread-local killer/history/etc tables generally helps. Some implementations vary the search depth for different threads, though this may be less important if the single-threaded branching factor is already low.
-
- Posts: 1273
- Joined: Wed Mar 08, 2006 8:28 pm
- Location: Florida, USA
Re: Help with LazySMP...
This is fascinating. Currently my engine is extremely basic. The search has zero extensions, zero killers, no null move, no internal iterative deepening, no late move reductions etc. Basically it's a full width search with a move ordering of hash move, captures (MVV/LVA), then history ordered quiet moves; all followed by a quiescent search. I have a shared hash across all threads, and a shared history table (I'll test having a history table per thread). This is all intentional — I wanted to start by creating a bug free basic search and then start adding features. As you can imagine, the branching factor is nothing special.petero2 wrote: ↑Fri Jun 06, 2025 9:09 am<snip>
Indeed, and this is the reason why lazy SMP (or shared transposition tables as it was originally called) historically did not work well in chess engines. Eventually it was realized that a full-width search, where >99.9% of the searched nodes were "garbage nodes", was very weak compared to a more pruned search (history/late move reductions is probably the turning point). After that "lazy SMP" was re-discovered, as with a small enough effective branching factor for the single-threaded search, the lazy SMP approach is more efficient than other parallel algorithms, especially for large number of cores, because of the low thread synchronization overhead. Also because the full-width tree grows much faster than the pruned tree, there will "never" be enough search threads to "fill all the holes", so scaling stays good also for large number of cores.
So, if your lazy SMP implementation does not improve the engine strength, and you have debugged it to make sure there are no implementation bugs, check the following:
<snip>
Maybe the simplicity of the search is the reason I'm not seeing any advantage or speed-up with multiple cores.
— Steve
http://www.chessprogramming.net - Juggernaut & Maverick Chess Engine