Lazy SMP and 44 cores

jstanback · Post by **jstanback** » Wed Aug 08, 2018 4:33 pm

In Wasp, I start all threads at depth=1 and let them do their own ID loop. At the start of each iteration beyond 6 (I think) I count the number of threads already searching this depth or higher and if the count is more than 50-67% of the total (I'm not sure what is best) then the thread will skip to the next iteration. For a previous TCEC season I skipped 2 iterations if 67% of the threads were already searching at a given depth but this did not work well. Before Wasp 3.0 I let any thread update the global score and pv, but now I only let a "master" thread do this. Both methods seem about the same in strength.

BTW, I had a "bug" in the Wasp I sent for TCEC 12 which reduced the search speed by almost 50%. I have a structure called "EvalData" used in many places which I previously allocated on the stack at the start of eval(). But in a brain-dead moment, I changed this to an array of structures indexed by thread. I think this causes the memory for all threads to be associated with the NUMA node that first writes to this array. Threads not on that NUMA node will have slow access to this data. I fixed this in Wasp 3.0 by including the EvalData structure in the ThreadData structure which is allocated and written to by each thread when it is first started. I have some other global variables which are initialized by the master thread and used, but not changed, by all the threads during search. I think I'll try duplicating these in the ThreadData structure to see if I can get more speedup....

John

PK · Post by PK » Thu Aug 09, 2018 7:29 am

Code: Select all

jstanback: At the start of each iteration beyond 6 (I think) I count the number of threads already searching this depth or higher and if the count is more than 50-67% of the total (I'm not sure what is best) then the thread will skip to the next iteration.

Just implemented this and tested overnight. Really helpful! Until now Rodent shown no gain for more than 16 threads.

Seems that TCEC forced forty-something threads as a new standard.

sandermvdb · Post by **sandermvdb** » Wed Nov 21, 2018 2:20 pm

A am still struggling with improving my SMP implementation:
I used to spawn new threads using the modulo of 2 for assigning the depth. If a certain thread was finished I would abort ALL threads and start ALL threads again. I've changed this so I let all threads 'loose' and use the formula as desribed above (depth + 1 if half of the threads are searching at the current depth or above). I also updated my transposition table from a 2 replacement scheme (always replace and depth replace) to buckets as most engines do. But still my new implementation performs worse when using 4 threads than my old implementation.

Some other info
- I use 2 arrays in my hashtable (keys[] and values[]) which are xorred to prevent racing conditions
- killers, countermoves, history heuristics are per thread
- evalcache, materialcache and pawnevalcache are not per thread
- PV is only updated by the master thread

The trace below shows when a thread starts searching at a particular depth and when it is finished (aspiration window has been disabled).
As you can see, sometimes the different threads are finished around the same time with a particular depth (20, 22, ...).
Sometimes however a slave thread is finished way earlier than the master thread (21, 23, ...).

Code: Select all

23:15:11,474 - start 1 1
23:15:11,475 - start 3 1
23:15:11,476 - start 0 1
23:15:11,477 - start 2 1
23:15:11,481 - done  2 1
23:15:11,482 - done  1 1
23:15:11,482 - done  0 1
23:15:11,483 - done  3 1
23:15:11,484 - start 1 2
23:15:11,486 - start 2 2
23:15:11,487 - start 3 3
23:15:11,490 - done  1 2
23:15:11,491 - start 1 3
23:15:11,492 - done  2 2
23:15:11,492 - start 2 4
23:15:11,501 - done  1 3
23:15:11,502 - start 1 4
23:15:11,503 - done  3 3
23:15:11,507 - start 3 5
23:15:11,515 - done  2 4
23:15:11,516 - start 2 5
23:15:11,516 - done  1 4
23:15:11,552 - start 1 6
23:15:11,553 - done  3 5
23:15:11,554 - done  2 5
23:15:11,555 - start 2 7
23:15:11,556 - start 3 6
23:15:11,557 - info depth 1 seldepth 4 time 151 score cp 188 nps 4099 nodes 620 hashfull 0 pv e1g1 
23:15:11,560 - start 0 2
23:15:11,561 - done  0 2
23:15:11,562 - info depth 2 seldepth 10 time 193 score cp 35 nps 39958 nodes 7712 hashfull 0 pv e1g1 e6e2 d3e2 b8c6 c1e3 c6b4 b1a3 
23:15:11,564 - start 0 3
23:15:11,565 - done  0 3
23:15:11,566 - info depth 3 seldepth 10 time 193 score cp 35 nps 40424 nodes 7803 hashfull 0 pv e1g1 e6e2 d3e2 b8c6 c1e3 c6b4 b1a3 
23:15:11,568 - start 0 4
23:15:11,569 - done  0 4
23:15:11,569 - info depth 4 seldepth 10 time 193 score cp 35 nps 40886 nodes 7892 hashfull 0 pv e1g1 e6e2 d3e2 b8c6 c1e3 c6b4 b1a3 
23:15:11,571 - start 0 5
23:15:11,572 - done  0 5
23:15:11,573 - info depth 5 seldepth 10 time 194 score cp 35 nps 41170 nodes 7987 hashfull 0 pv e1g1 e6e2 d3e2 b8c6 c1e3 c6b4 b1a3 
23:15:11,575 - start 0 6
23:15:11,576 - done  0 6
23:15:11,577 - done  3 6
23:15:11,580 - done  1 6
23:15:11,582 - info depth 6 seldepth 10 time 205 score cp 17 nps 79941 nodes 16389 hashfull 0 pv b1c3 b8c6 c1e3 e8g8 e1g1 f8e8 
23:15:11,587 - start 0 7
23:15:11,588 - start 3 7
23:15:11,589 - start 1 8
23:15:11,590 - done  2 7
23:15:11,590 - start 2 8
23:15:11,593 - done  3 7
23:15:11,594 - start 3 9
23:15:11,595 - done  0 7
23:15:11,595 - info depth 7 seldepth 10 time 226 score cp 2 nps 130367 nodes 29464 hashfull 0 pv e1g1 e6e2 d3e2 b8c6 c1e3 c6b4 b1a3 
23:15:11,602 - start 0 8
23:15:11,603 - done  0 8
23:15:11,603 - done  2 8
23:15:11,612 - start 2 9
23:15:11,613 - info depth 8 seldepth 12 time 257 score cp 18 nps 167474 nodes 43041 hashfull 0 pv e1g1 e8g8 e2e6 f7e6 b1c3 b8c6 c1e3 a7a6 
23:15:11,620 - start 0 9
23:15:11,621 - done  2 9
23:15:11,622 - start 2 10
23:15:11,623 - done  3 9
23:15:11,624 - start 3 10
23:15:11,629 - done  1 8
23:15:11,630 - start 1 11
23:15:11,631 - done  0 9
23:15:11,632 - info depth 9 seldepth 13 time 289 score cp 5 nps 205249 nodes 59318 hashfull 0 pv e1g1 e6e2 d3e2 b8c6 c1e3 c6b4 b1a3 c7c6 c2c3 
23:15:11,634 - start 0 10
23:15:11,846 - done  0 10
23:15:11,847 - done  3 10
23:15:11,849 - done  2 10
23:15:11,850 - info depth 10 seldepth 14 time 510 score cp 7 nps 402264 nodes 205155 hashfull 0 pv b1c3 b8c6 c3b5 e8g8 e2e6 f7e6 c2c3 a7a6 b5d6 c7d6 
23:15:11,857 - start 3 11
23:15:11,858 - start 0 11
23:15:11,858 - start 2 12
23:15:11,953 - done  3 11
23:15:11,954 - done  0 11
23:15:11,954 - info depth 11 seldepth 15 time 617 score cp 7 nps 477685 nodes 294733 hashfull 0 pv b1c3 b8c6 c3b5 e8g8 e2e6 f7e6 c2c3 a7a6 b5d6 c7d6 
23:15:11,957 - start 0 12
23:15:11,958 - start 3 12
23:15:12,219 - done  1 11
23:15:12,222 - start 1 13
23:15:12,272 - done  0 12
23:15:12,274 - done  2 12
23:15:12,275 - done  3 12
23:15:12,276 - start 3 14
23:15:12,277 - start 2 13
23:15:12,278 - info depth 12 seldepth 17 time 936 score cp 13 nps 547231 nodes 512209 hashfull 0 pv b1c3 b8c6 c3b5 e8g8 e2e6 f7e6 c2c3 a7a6 b5d6 c7d6 
23:15:12,281 - start 0 13
23:15:12,579 - done  0 13
23:15:12,580 - done  2 13
23:15:12,580 - done  1 13
23:15:12,581 - start 2 14
23:15:12,582 - start 1 15
23:15:12,583 - info depth 13 seldepth 19 time 1243 score cp 1 nps 655860 nodes 815236 hashfull 0 pv b1c3 b8c6 c3b5 e8g8 e2e6 f7e6 c2c3 a7a6 b5d6 c7d6 
23:15:12,586 - start 0 14
23:15:13,245 - done  0 14
23:15:13,246 - done  2 14
23:15:13,247 - done  3 14
23:15:13,248 - start 2 15
23:15:13,248 - info depth 14 seldepth 19 time 1910 score cp 1 nps 770541 nodes 1471736 hashfull 1 pv b1c3 b8c6 c3b5 e8g8 e2e6 f7e6 c2c3 a7a6 b5d6 c7d6 
23:15:13,251 - start 3 16
23:15:13,252 - start 0 15
23:15:13,955 - done  0 15
23:15:13,956 - done  2 15
23:15:13,956 - start 2 16
23:15:13,957 - info depth 15 seldepth 21 time 2619 score cp 0 nps 964234 nodes 2525330 hashfull 2 pv b1c3 b8c6 c3b5 e8g8 c2c3 f8e8 e2e6 e8e6 c1e3 f6g4 
23:15:13,960 - start 0 16
23:15:13,962 - done  1 15
23:15:13,963 - start 1 17
23:15:14,759 - done  2 16
23:15:14,760 - done  0 16
23:15:14,761 - start 2 17
23:15:14,762 - info depth 16 seldepth 26 time 3424 score cp -1 nps 1377666 nodes 4717134 hashfull 5 pv b1c3 b8c6 c3b5 e8g8 e2e6 f7e6 c2c3 e6e5 b5d6 c7d6 
23:15:14,764 - start 0 17
23:15:14,843 - done  3 16
23:15:14,844 - start 3 18
23:15:14,932 - done  1 17
23:15:14,933 - done  0 17
23:15:14,934 - done  2 17
23:15:14,935 - start 1 18
23:15:14,936 - start 2 19
23:15:14,936 - info depth 17 seldepth 27 time 3596 score cp 0 nps 1442885 nodes 5188618 hashfull 5 pv b1c3 b8c6 c3b5 e8g8 e2e6 f7e6 c2c3 e6e5 b5d6 c7d6 
23:15:14,939 - start 0 18
23:15:16,764 - done  0 18
23:15:16,765 - done  3 18
23:15:16,766 - start 3 19
23:15:16,766 - info depth 18 seldepth 29 time 5428 score cp 7 nps 1844129 nodes 10009937 hashfull 11 pv b1c3 b8c6 c3b5 e8g8 e2e6 f7e6 c2c3 e6e5 b5d6 c7d6 
23:15:16,769 - start 0 19
23:15:16,792 - done  1 18
23:15:16,792 - start 1 20
23:15:18,299 - done  2 19
23:15:18,300 - start 2 20
23:15:18,301 - done  3 19
23:15:18,301 - start 3 21
23:15:18,302 - done  0 19
23:15:18,302 - info depth 19 seldepth 29 time 6963 score cp 16 nps 2065770 nodes 14383963 hashfull 17 pv b1c3 b8c6 c3b5 e8g8 e2e6 f7e6 c2c3 e6e5 e1g1 a8e8 
23:15:18,305 - start 0 20
23:15:26,033 - done  3 21
23:15:26,034 - start 3 22
23:15:28,618 - done  1 20
23:15:28,619 - done  2 20
23:15:28,619 - done  0 20
23:15:28,620 - start 2 22
23:15:28,621 - start 1 21
23:15:28,622 - info depth 20 seldepth 33 time 17282 score cp 18 nps 2580668 nodes 44599118 hashfull 54 pv b1c3 e8g8 e2e6 f7e6 c3b5 b8c6 c2c3 e6e5 e1g1 f6e8 
23:15:28,625 - start 0 21
23:15:29,344 - done  1 21
23:15:29,345 - done  0 21
23:15:29,346 - start 1 23
23:15:29,346 - info depth 21 seldepth 33 time 18008 score cp 18 nps 2594858 nodes 46728213 hashfull 57 pv b1c3 e8g8 e2e6 f7e6 c3b5 b8c6 c2c3 e6e5 e1g1 f6e8 
23:15:29,349 - start 0 22
23:15:31,842 - done  2 22
23:15:31,843 - done  0 22
23:15:31,844 - start 2 23
23:15:31,844 - info depth 22 seldepth 33 time 20506 score cp 18 nps 2625330 nodes 53835034 hashfull 65 pv b1c3 e8g8 e2e6 f7e6 c3b5 b8c6 c2c3 e6e5 e1g1 f6e8 
23:15:31,847 - start 0 23
23:15:33,258 - done  3 22
23:15:33,259 - start 3 24
23:15:51,006 - done  1 23
23:15:51,007 - done  2 23
23:15:51,008 - start 1 24
23:15:51,009 - start 2 25
23:16:00,974 - done  0 23
23:16:00,976 - info depth 23 seldepth 40 time 49639 score cp 19 nps 2783156 nodes 138153116 hashfull 168 pv b1c3 e6e2 c3e2 e8g8 e1g1 b8d7 c1e3 f8e8 h2h3 c7c6

The only cause I can think of is my hashtable implementation or the way I spawn threads which is now about the same as other engines. Any ideas/suggestions?

jdart · Post by **jdart** » Wed Nov 21, 2018 3:18 pm

I still have a pretty simple-mined approach to assigning depth to threads.

But one thing I have looked at recently is the selection logic when the search is done and there are results from multiple threads. It is possible that some of these are failing low. I extend time if the main thread is failing low, but there is a limit on the extension, so it might still be failing low at search end.

What to do in this case? You really only have an upper bound for the score, so there is no telling how bad the score might become if the search iteration were to complete. Therefore it could be risky to take that search result (also: it may have no PV, so you won't get a ponder hit). You could just say, I won't accept a fail-low score .. but due to the time limit, it is even possible that all threads are failing low at search end (although not likely at all with 44 cores).

--Jon

Ronald · Post by **Ronald** » Wed Nov 21, 2018 4:47 pm

Hi Sander,

I took a quick look at your code, and it looks like you have one global iterative deepening loop in which you start new threads for your calculateBestmove search on every iteration. Most lazy SMP implementations (i think

) have their own iterative deepening loop for every thread, or in other words, all threads execute exactly the same iterative deepening search as you would normally do for a single thread. You can assign one of the threads to be the boss, which updates the score and pv, and makes the decision to stop searching.

sandermvdb · Post by **sandermvdb** » Wed Nov 21, 2018 4:51 pm

Ronald wrote: ↑Wed Nov 21, 2018 4:47 pm Hi Sander,

I took a quick look at your code, and it looks like you have one global iterative deepening loop in which you start new threads for your calculateBestmove search on every iteration. Most lazy SMP implementations (i think ) have their own iterative deepening loop for every thread, or in other words, all threads execute exactly the same iterative deepening search as you would normally do for a single thread. You can assign one of the threads to be the boss, which updates the score and pv, and makes the decision to stop searching.

1.11 is indeed using my 'old' approach which I am trying to update the way as you are suggesting. Unfortunately so far no success, therefore I re-opened this topic

Ratosh · Post by **Ratosh** » Wed Nov 21, 2018 5:25 pm

Hi Sanders,

On Pirarucu i have similar search info:
- Only Pawn cache and Transposition are shared;
- No eval cache nor material cache;
- PV and score is updated by master thread and taken from TT.

I have 2 different threads running (main search and helper search):
- The main controls the search stop conditions, pv and score of the search. There is no difference on this when using single thread or multi thread.
- The Helper is used basically to populate the TT with deeper search information. This thread will stop searching when reaching max depth (127) or when the search stop condition is set. Each depth is individual and uses a depth skip logic (based on laser) it will go deeper on different rates and keep it searching deeper than the main search.

It is pretty simple and seems to scale well.

cdani · Post by **cdani** » Wed Nov 21, 2018 11:56 pm

In Andscacs there is only one iterative deepening loop in the main thread, which signals all the threads to start searching.
About depth of the threads:

Code: Select all

	for (int i = 1; i < NumThreads; i++) {
		incrementthread[i] = i & 1;
	}

	if (NumThreads <= 4) {
		incrementthread[0] = 0;
		incrementthread[1] = 1;
		incrementthread[2] = 2;
		incrementthread[3] = 3;
	}
	else {
		incrementthread[0] = 0;
		incrementthread[1] = 1;
		incrementthread[2] = 0;
		incrementthread[3] = 2;
		incrementthread[4] = 1;
		incrementthread[5] = 3;
		incrementthread[6] = 2;
		incrementthread[7] = 4;
	}

	if (NumThreads > 16) {
		for (int i = 16; i < NumThreads; i++) {
			incrementthread[i] = i & 7;
		}
	}

D Sceviour · Post by **D Sceviour** » Thu Nov 22, 2018 12:15 am

sandermvdb wrote: ↑Tue Aug 07, 2018 8:51 pm At the moment my chessengine participates in the TCEC tournament which uses a machine that has 44(!) cores. I was wondering what would be the best approach on how to use these cores effectively using lazy SMP. At the moment half of the threads search the actual depth and the other half searches at depth + 1. I thought this is what most engines do but I can imagine it isn't very useful when there are 44 threads.

Any advice?

Each independent thread should find its own maximum internal depth based on reductions, so what is the point in varying the depth at the root? Interesting is the idea of an aspiration spread for large numbers of threads. Has anyone tested a method of threading based on aspiration window rather than iterative depth? Maybe something like this is worth trying for large numbers of threads:

t = thread number
1.414 = square root of 2
MINIMUM_ASPIRATION_WINDOW = 5
MATE_SCORE = 32768

aspiration(t) = (int) (1.414 * t) ^ 2

with the limits of,
MINIMUM_ASPIRATION_WINDOW < aspiration(t) < MATE_SCORE

I have a 4 core machine, so testing 43 threads is only academic. Maybe someone with a large hardware might report findings on this.

jorose · Post by **jorose** » Thu Nov 22, 2018 12:52 pm

There are a few things which I could imagine makes Lazy SMP less effective in Winter. Eg There are no reductions at the root node and Winter does not use the history heuristic in any way at the moment, so the search is less stochastic.

The scheme I ended up coming up with was to use the one described in this thread (skip a depth iff more than half the other threads are searching that depth or higher) with an additional modification that was necessary in Winter's case. At every depth where a search actually occurs in a helper thread, there is a 25% chance to perturb the root move list for that thread in the following way:

Code: Select all

  for (int i = moves.size()-1; i > 0; i--) {
    if (rng() % 2) {
      std::swap(moves[i-1], moves[i]);
    }
  }

This perturbation has some nice properties imo. A move can only get knocked back by at most one rank, so the best move will get back to the front of the list quickly if it is incorrectly demoted. On the other hand any move can end up at an arbitrarily high rank, but with exponentially lower probability the further back in the list it is. Since it is stochastic you also don't need to come up with a scheme for to make sure different threads aren't doing exactly the same thing.

Originally I did this perturbation with probability 1/depth since one expects the move lists to be more correct over time, but I found the search scaled worse with more time than with a fixed 25% probability.

Lazy SMP and 44 cores

Re: Lazy SMP and 44 cores

Re: Lazy SMP and 44 cores

Re: Lazy SMP and 44 cores

Re: Lazy SMP and 44 cores

Re: Lazy SMP and 44 cores

Re: Lazy SMP and 44 cores

Re: Lazy SMP and 44 cores

Re: Lazy SMP and 44 cores

Re: Lazy SMP and 44 cores

Re: Lazy SMP and 44 cores