Symmetric multiprocessing (SMP) scaling - SF8 and K10.4

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

mjlef
Posts: 1494
Joined: Thu Mar 30, 2006 2:08 pm

Re: Symmetric multiprocessing (SMP) scaling - SF8 and K10.4

Post by mjlef »

mar wrote:
Michel wrote:
mjlef wrote: We owe this to Don trying something people dismissed a bit too quickly.
This is distorting history in a big way. Lazy SMP was used by Toga from the start (long before Komodo had SMP) and one could easily verify on the rating lists that it had the same scaling 1->4 cores as engines that used YBW. This was pointed out regularly here.
Let's do some history then:

- The term Lazy SMP was coined by Julien Marcel (originally it was about parallelizing evaluation).
- Dan Homan (ExChess) liked the name and a while later reported his success with shared TT (this is what was used in Toga) plus varying depth for each other helper.
- Of course people like Bob kept saying that it "doesn't work" even though there was evidence to the contrary
The old term for "Lazy SMP" was "shared hash" and is documented on the Chess Programming Wiki. It is 30 years old or so old. Among others, Vincent Diepeveen was an advocate of it many years ago. Papers and posting show it in use as early as the 1980s.
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Symmetric multiprocessing (SMP) scaling - SF8 and K10.4

Post by mar »

mjlef wrote:I never claimed all programmers dismissed Lazy SMP, but certainly a lot of them did. There are many threads here along the topic. Clang was another program that used it. If you look at the various threads, Kai noted Komodo's nps scaling was like Clang's, and Clang was know to use Lazy SMP.
Clang is a compiler. If you meant Cheng, that would be my engine and I don't like it's name being crippled. Thank you.
mjlef
Posts: 1494
Joined: Thu Mar 30, 2006 2:08 pm

Re: Symmetric multiprocessing (SMP) scaling - SF8 and K10.4

Post by mjlef »

mar wrote:
mjlef wrote:I never claimed all programmers dismissed Lazy SMP, but certainly a lot of them did. There are many threads here along the topic. Clang was another program that used it. If you look at the various threads, Kai noted Komodo's nps scaling was like Clang's, and Clang was know to use Lazy SMP.
Clang is a compiler. If you meant Cheng, that would be my engine and I don't like it's name being crippled. Thank you.
I am very sorry. I apologize for mangling the name of your program so much.I will try to be more careful in the future. I think you deserve credit for making shared hash (Lazy SMP) popular. Thanks.

Mark
jstanback
Posts: 130
Joined: Fri Jun 17, 2016 4:14 pm
Location: Colorado, USA
Full name: John Stanback

Re: Symmetric multiprocessing (SMP) scaling - SF8 and K10.4

Post by jstanback »

In Wasp, which uses lazy SMP, I tried abandoning the search for all threads searching at depths below the depth of the thread with the current best move, but it didn't seem to help. But I don't have a lot of data. I also didn't have much luck with incrementing depth by 2 for alternate threads.

What seems to help in Wasp (although still with limited testing) is to keep track of the current search depth for each thread and before starting a search at depth N, check how many threads are already searching at depth N. If there are 3 or more other threads already searching at depth N, bump the depth for that thread to N+1. With a lot of threads, it might be better to also test whether there are 5 or more threads searching at depth N+1 and if so, bump the depth to N+2...

John
mjlef
Posts: 1494
Joined: Thu Mar 30, 2006 2:08 pm

Re: Symmetric multiprocessing (SMP) scaling - SF8 and K10.4

Post by mjlef »

I asked a sysop to fix my blunder. Again, I am sorry about that.
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Symmetric multiprocessing (SMP) scaling - SF8 and K10.4

Post by mar »

mjlef wrote:I am very sorry. I apologize for mangling the name of your program so much.I will try to be more careful in the future. I think you deserve credit for making shared hash (Lazy SMP) popular. Thanks.

Mark
Thanks, but the one who deserves credit is Dan, I did nothing special, simply tried the idea he shared and it worked for me.
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: Symmetric multiprocessing (SMP) scaling - SF8 and K10.4

Post by lucasart »

mar wrote:
lucasart wrote:Here's another trick for you, which worked well in testing for my engine. I don't claim to have invented the wheel, and I'm sure others have done it before, but here it is. When a thread completes a depth, it signals all other threads working on that depth or lower to stop immediately, and report back to base (to find a useful depth to work on, which is where your depth skipping strategy comes into play). Stockfish doesn't do that. Code is here: https://github.com/lucasart/demolito, it's in search.cc iterate function (where signals are raised) and the recursive search is in recurse.h (where signals are listened to).
Yes, this is almost exactly what I'm doing in Cheng since the early days => whenever a helper finishes iteration I notify the master to abort, grab the result and continue next iteration.
(I've noticed during analysis that sometimes master can take much longer to finish than one of the helpers).

The only difference is that I don't continue at d+1 where the helper finished (simply continue master depth+1; in my case that's the lowest depth), so it's something worth trying, however I don't work on it anymore so it's up to others to try.
I don't have a master thread. All threads are equal. Any thread working on an obsolete depth can observe a selective stop signal and abort immediately (by throwing an exception).

Instead of a master thread, I have a centralized data structure for UCI Info. When a thread finishes a depth, it puts the result there (score, depth, nodes, pv). The object itself is lock protected, and checks that the completed depth is indeed larger than the last completed one. It's the timer thread that probes this object, and updates the GUI (if necessary) at regular time intervals, as well as check for time up conditions. Meanwhile, the main thread is just doing I/O, waiting on a getline(stdin). By "main" thread, I mean the one that started the main() function.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
PK
Posts: 893
Joined: Mon Jan 15, 2007 11:23 am
Location: Warsza

Re: Symmetric multiprocessing (SMP) scaling - SF8 and K10.4

Post by PK »

Current Lazy SMP implementation in Rodent allows starting new search iteration at (best_depth_reached - 1), but not at (best_depth_reached - 2) - it skips an iteration in the latter case. It seemed consistent with varying initial search depth. Otherwise I was observing threads that were "eternally late" by 3 or 4 plies, and therefore unlikely to write something useful in the hash table.
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: Symmetric multiprocessing (SMP) scaling - SF8 and K10.4

Post by lucasart »

PK wrote:Current Lazy SMP implementation in Rodent allows starting new search iteration at (best_depth_reached - 1), but not at (best_depth_reached - 2) - it skips an iteration in the latter case. It seemed consistent with varying initial search depth. Otherwise I was observing threads that were "eternally late" by 3 or 4 plies, and therefore unlikely to write something useful in the hash table.
I think you have room for improvement here. The low hanging fruit is simply to never redo a completed depth. Do not allow threads to start at best_depth_completed-1, but instead make them search best_depth_completed+1 (ideally half at this depth, and half one depth above).

There are many variations, and the real bottleneck for me is testing time and hardware. I only have a 4 core i7 (8 ht threads, but that's not the same as 8 real cores). To improve SMP scaling, I would need access to a big machine with 8+ real cores. For now, I just do what works best for my SMP use-case, which is 8 threads on 4 physical cores with HT.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.