Symmetric multiprocessing (SMP) scaling - SF8 and K10.4

mjlef · Post by **mjlef** » Sat May 06, 2017 6:23 pm

mar wrote:
Michel wrote:
mjlef wrote: We owe this to Don trying something people dismissed a bit too quickly.
This is distorting history in a big way. Lazy SMP was used by Toga from the start (long before Komodo had SMP) and one could easily verify on the rating lists that it had the same scaling 1->4 cores as engines that used YBW. This was pointed out regularly here.
Let's do some history then:

- The term Lazy SMP was coined by Julien Marcel (originally it was about parallelizing evaluation).
- Dan Homan (ExChess) liked the name and a while later reported his success with shared TT (this is what was used in Toga) plus varying depth for each other helper.
- Of course people like Bob kept saying that it "doesn't work" even though there was evidence to the contrary

The old term for "Lazy SMP" was "shared hash" and is documented on the Chess Programming Wiki. It is 30 years old or so old. Among others, Vincent Diepeveen was an advocate of it many years ago. Papers and posting show it in use as early as the 1980s.

mar · Post by **mar** » Sat May 06, 2017 6:40 pm

mjlef wrote:I never claimed all programmers dismissed Lazy SMP, but certainly a lot of them did. There are many threads here along the topic. Clang was another program that used it. If you look at the various threads, Kai noted Komodo's nps scaling was like Clang's, and Clang was know to use Lazy SMP.

Clang is a compiler. If you meant Cheng, that would be my engine and I don't like it's name being crippled. Thank you.

mjlef · Post by **mjlef** » Sat May 06, 2017 10:09 pm

mar wrote:
mjlef wrote:I never claimed all programmers dismissed Lazy SMP, but certainly a lot of them did. There are many threads here along the topic. Clang was another program that used it. If you look at the various threads, Kai noted Komodo's nps scaling was like Clang's, and Clang was know to use Lazy SMP.
Clang is a compiler. If you meant Cheng, that would be my engine and I don't like it's name being crippled. Thank you.

I am very sorry. I apologize for mangling the name of your program so much.I will try to be more careful in the future. I think you deserve credit for making shared hash (Lazy SMP) popular. Thanks.

Mark

jstanback · Post by **jstanback** » Sat May 06, 2017 10:51 pm

In Wasp, which uses lazy SMP, I tried abandoning the search for all threads searching at depths below the depth of the thread with the current best move, but it didn't seem to help. But I don't have a lot of data. I also didn't have much luck with incrementing depth by 2 for alternate threads.

What seems to help in Wasp (although still with limited testing) is to keep track of the current search depth for each thread and before starting a search at depth N, check how many threads are already searching at depth N. If there are 3 or more other threads already searching at depth N, bump the depth for that thread to N+1. With a lot of threads, it might be better to also test whether there are 5 or more threads searching at depth N+1 and if so, bump the depth to N+2...

John

mjlef · Post by **mjlef** » Sat May 06, 2017 10:56 pm

I asked a sysop to fix my blunder. Again, I am sorry about that.

mar · Post by **mar** » Sat May 06, 2017 10:58 pm

mjlef wrote:I am very sorry. I apologize for mangling the name of your program so much.I will try to be more careful in the future. I think you deserve credit for making shared hash (Lazy SMP) popular. Thanks.

Mark

Thanks, but the one who deserves credit is Dan, I did nothing special, simply tried the idea he shared and it worked for me.

lucasart · Post by **lucasart** » Sun May 07, 2017 1:11 am

mar wrote:
lucasart wrote:Here's another trick for you, which worked well in testing for my engine. I don't claim to have invented the wheel, and I'm sure others have done it before, but here it is. When a thread completes a depth, it signals all other threads working on that depth or lower to stop immediately, and report back to base (to find a useful depth to work on, which is where your depth skipping strategy comes into play). Stockfish doesn't do that. Code is here: https://github.com/lucasart/demolito, it's in search.cc iterate function (where signals are raised) and the recursive search is in recurse.h (where signals are listened to).
Yes, this is almost exactly what I'm doing in Cheng since the early days => whenever a helper finishes iteration I notify the master to abort, grab the result and continue next iteration.
(I've noticed during analysis that sometimes master can take much longer to finish than one of the helpers).

The only difference is that I don't continue at d+1 where the helper finished (simply continue master depth+1; in my case that's the lowest depth), so it's something worth trying, however I don't work on it anymore so it's up to others to try.

I don't have a master thread. All threads are equal. Any thread working on an obsolete depth can observe a selective stop signal and abort immediately (by throwing an exception).

Instead of a master thread, I have a centralized data structure for UCI Info. When a thread finishes a depth, it puts the result there (score, depth, nodes, pv). The object itself is lock protected, and checks that the completed depth is indeed larger than the last completed one. It's the timer thread that probes this object, and updates the GUI (if necessary) at regular time intervals, as well as check for time up conditions. Meanwhile, the main thread is just doing I/O, waiting on a getline(stdin). By "main" thread, I mean the one that started the main() function.

PK · Post by PK » Sun May 07, 2017 11:41 am

Current Lazy SMP implementation in Rodent allows starting new search iteration at (best_depth_reached - 1), but not at (best_depth_reached - 2) - it skips an iteration in the latter case. It seemed consistent with varying initial search depth. Otherwise I was observing threads that were "eternally late" by 3 or 4 plies, and therefore unlikely to write something useful in the hash table.

lucasart · Post by **lucasart** » Mon May 08, 2017 7:35 am

PK wrote:Current Lazy SMP implementation in Rodent allows starting new search iteration at (best_depth_reached - 1), but not at (best_depth_reached - 2) - it skips an iteration in the latter case. It seemed consistent with varying initial search depth. Otherwise I was observing threads that were "eternally late" by 3 or 4 plies, and therefore unlikely to write something useful in the hash table.

I think you have room for improvement here. The low hanging fruit is simply to never redo a completed depth. Do not allow threads to start at best_depth_completed-1, but instead make them search best_depth_completed+1 (ideally half at this depth, and half one depth above).

There are many variations, and the real bottleneck for me is testing time and hardware. I only have a 4 core i7 (8 ht threads, but that's not the same as 8 real cores). To improve SMP scaling, I would need access to a big machine with 8+ real cores. For now, I just do what works best for my SMP use-case, which is 8 threads on 4 physical cores with HT.

Symmetric multiprocessing (SMP) scaling - SF8 and K10.4

Re: Symmetric multiprocessing (SMP) scaling - SF8 and K10.4

Re: Symmetric multiprocessing (SMP) scaling - SF8 and K10.4

Re: Symmetric multiprocessing (SMP) scaling - SF8 and K10.4

Re: Symmetric multiprocessing (SMP) scaling - SF8 and K10.4

Re: Symmetric multiprocessing (SMP) scaling - SF8 and K10.4

Re: Symmetric multiprocessing (SMP) scaling - SF8 and K10.4

Re: Symmetric multiprocessing (SMP) scaling - SF8 and K10.4

Re: Symmetric multiprocessing (SMP) scaling - SF8 and K10.4

Re: Symmetric multiprocessing (SMP) scaling - SF8 and K10.4