Lazy SMP troubles

adityachandra · Post by **adityachandra** » Sun Apr 23, 2023 8:07 pm

I have been trying to implement Lazy SMP multithreading for the past week, however my results have been underwhelming as at most I only get a 2x speedup when on 8 threads compared to 1 thread. I did all the usual such as keeping a seperate history and killer table for each thread, and increasing the depth by one for every odd thread, to create divergence early on so each thread is not just repeating the same work. But still to no avail. As reference I just attached my iterative deepening method, which implements the multithreading. Perhaps you guys could find mistakes?

Code: Select all

void searchPosition(board& inputBoard) {
  std::vector<int> moves;
  generateMoves(inputBoard, moves, false);
  board workerDatas[numberOfThreads];
  for (int i = 0; i < numberOfThreads; i++) workerDatas[i] = inputBoard;
  board& firstWorkerData = workerDatas[0];
  int score = negamax(firstWorkerData, 1, -20000, 20000, moves, true);
  int alpha = score - 50;
  int beta = score + 50;
  int threadsToSpawn = numberOfThreads - 1;
  int currentDepth = 2;
  while (true) {
    std::thread workerThreads[threadsToSpawn];
    for (int i = 0; i < threadsToSpawn; i++) {
      board& inputWorkerData = workerDatas[i + 1];
      workerThreads[i] = std::thread(negamax, std::ref(inputWorkerData), currentDepth + (i & 1), alpha, beta, std::ref(moves), true);
    }
    int score = negamax(firstWorkerData, currentDepth, alpha, beta, moves, true);
    for (int i = 0; i < threadsToSpawn; i++) {
      // Stop all the threads once main thread has finished.
      workerDatas[i + 1].isSearchStopped = true;
      if (workerThreads[i].joinable()) workerThreads[i].join();
    }
    if (score <= alpha || score >= beta) {
      alpha = -20000;
      beta = 20000;
      continue;
    }
    alpha = score - 50;
    beta = score + 50;
    // Set the threads as unstopped once they actually have stopped so that we can search again.
    for (int i = 0; i < threadsToSpawn; i++) workerDatas[i + 1].isSearchStopped = false;
    if (firstWorkerData.isSearchStopped) break;
    std::chrono::steady_clock::time_point currentTime = std::chrono::steady_clock::now();
    int timeElapsed = (int)std::chrono::duration_cast<std::chrono::milliseconds>(currentTime - startTime).count();
    double ordering = 100.0 * ((double)firstWorkerData.failedHighFirst / (double)firstWorkerData.failedHigh);
    std::cout << "Depth: " << currentDepth << " Time: " << timeElapsed << " Nodes: " << firstWorkerData.nodes << " Ordering: " << ordering << "%" << " Score: " << score << " Principal Variation: ";
    principalVariation.clear();
    std::vector<std::string> internalPrincipleVariationAlgebraic;
    retrievePrincipleVariation(inputBoard, principalVariation, internalPrincipleVariationAlgebraic);
    for (int i = 0; i < internalPrincipleVariationAlgebraic.size(); i++) std::cout << internalPrincipleVariationAlgebraic[i] << " ";
    std::cout << std::endl;
    currentDepth++;
  }
}

smatovic · Post by **smatovic** » Mon Apr 24, 2023 1:45 am

There was a similar case recently:

Need advice on SMP
https://talkchess.com/forum3/viewtopic.php?f=7&t=81751

Did not look at your code, but maybe try with plain Shared Hash Table first, to ensure NPS speedup across real cores, and then add further Lazy SMP methods.

--
Srdja

adityachandra · Post by **adityachandra** » Mon Apr 24, 2023 7:48 am

Yes, I do have a shared hash table with Lockless hashing. Thankfully I am certain it works fine as I have tested it on a single core to see if I am packing the data and keys correctly. I read from the link you sent that that can be due to limitations on the number of data lines in hardware, perhaps YBW will be better for this purpose?

smatovic · Post by **smatovic** » Mon Apr 24, 2023 9:00 am

YBWC was used in Stockfish and meanwhile superseded by Lazy SMP. Algorithms like DTS and YBW have some overhead which prevent near linear NPS speedup, I myself like ABDADA. But, in your case, with a plain SHT approach, as was mentioned in the other thread, you should get near linear NPS speedup across real CPU cores. Maybe tell us the CPU type you use, how many cores, hyperthreading, which OS? And post a speedup table as was done in the other thread. As Joost mentioned, TT lookups during QSearch could be a matter of memory bandwidth.

--
Srdja

smatovic · Post by **smatovic** » Mon Apr 24, 2023 9:18 am

And, from the other thread, maybe you test your SMP wrong:

https://talkchess.com/forum3/viewtopic. ... 10#p945309
https://talkchess.com/forum3/viewtopic. ... 20#p945638

Maybe test with time control of 10 seconds per move instead of fix depth?

--
Srdja

adityachandra · Post by **adityachandra** » Mon Apr 24, 2023 12:08 pm

I have a MacBook Pro with the 16 real core Intel Chip. I usually test on 8 threads but here is the output of my engine for this position [fen]8/7p/5k2/5p2/p1p2P2/Pr1pPK2/1P1R3P/8 b - -[/fen]. The moves are in algebraic notation to show in a GUI I made as an engine line but it should be understandable. (there is a rook sacrifice that it must find on depth 17). As you can see below, there is no real speedup in time to depth when running the iterative deepening method for 10 seconds.

Code: Select all

Running on 1 threads
Depth: 2 Time: 0 Nodes: 299 Ordering: 72.7273% Score: 100 Principal Variation: h5 h4 
Depth: 3 Time: 0 Nodes: 1089 Ordering: 70.2703% Score: 103 Principal Variation: Ke6 h4 h5 
Depth: 4 Time: 2 Nodes: 3254 Ordering: 86.3415% Score: 108 Principal Variation: c3 bxc3 Rxc3 Ra2 
Depth: 5 Time: 6 Nodes: 7870 Ordering: 87.3199% Score: 118 Principal Variation: c3 bxc3 Rxc3 Ra2 Rc1 
Depth: 6 Time: 21 Nodes: 24751 Ordering: 87.8462% Score: 161 Principal Variation: c3 bxc3 Rxc3 h4 Rxa3 h5 
Depth: 7 Time: 55 Nodes: 61897 Ordering: 90.659% Score: 158 Principal Variation: c3 bxc3 Rxc3 Rb2 Rxa3 Rb6+ Kf7 Rb8 
Depth: 8 Time: 124 Nodes: 140773 Ordering: 92.582% Score: 156 Principal Variation: c3 bxc3 Rxc3 Rb2 Rxa3 Rb6+ Kf7 Rb7+ Kg6 h4 
Depth: 9 Time: 267 Nodes: 324292 Ordering: 92.8643% Score: 159 Principal Variation: c3 bxc3 Rxc3 Rb2 Rxa3 Rb6+ Kf7 Rb7+ Ke6 Rxh7 d2 
Depth: 10 Time: 630 Nodes: 800706 Ordering: 93.5061% Score: 121 Principal Variation: h5 Kg3 c3 bxc3 Rxc3 Kh4 Kg6 Rg2+ Kf6 Kxh5 Rxa3 
Depth: 11 Time: 1873 Nodes: 2424579 Ordering: 94.6066% Score: 110 Principal Variation: Rb6 h3 Rb3 h4 c3 bxc3 Rxc3 h5 Rxa3 h6 Rc3 
Depth: 12 Time: 4148 Nodes: 5299389 Ordering: 95.2562% Score: 115 Principal Variation: Rb6 h3 Rc6 h4 c3 bxc3 Rxc3 h5 Rxa3 h6 Kg6 Rh2 
Depth: 13 Time: 9855 Nodes: 12578358 Ordering: 95.4779% Score: 125 Principal Variation: Rb6 e4 fxe4+ Kxe4 Re6+ Kd4 Re2 Kc3 Re4 Kb4 Kf5 Rd1 Kxf4 Rf1+ Kg4 Kxa4 
Running on 2 threads
Depth: 2 Time: 0 Nodes: 299 Ordering: 72.7273% Score: 100 Principal Variation: h5 h4 
Depth: 3 Time: 0 Nodes: 1087 Ordering: 70.068% Score: 103 Principal Variation: Ke6 h4 h5 
Depth: 4 Time: 2 Nodes: 3144 Ordering: 80.9524% Score: 108 Principal Variation: c3 bxc3 Rxc3 Ra2 
Depth: 5 Time: 6 Nodes: 7406 Ordering: 82.1458% Score: 118 Principal Variation: c3 bxc3 Rxc3 Ra2 Rc1 
Depth: 6 Time: 19 Nodes: 21228 Ordering: 85.4038% Score: 177 Principal Variation: c3 bxc3 Rxc3 h3 Rxa3 h4 
Depth: 7 Time: 51 Nodes: 54142 Ordering: 90.2116% Score: 158 Principal Variation: c3 bxc3 Rxc3 Rb2 Rxa3 Rb6+ Kf7 Rb8 
Depth: 8 Time: 115 Nodes: 127654 Ordering: 91.9532% Score: 156 Principal Variation: c3 bxc3 Rxc3 Rb2 Rxa3 Rb6+ Kf7 Rb7+ Kg6 h4 
Depth: 9 Time: 271 Nodes: 329205 Ordering: 92.3118% Score: 159 Principal Variation: c3 bxc3 Rxc3 Rb2 Rxa3 Rb6+ Kf7 Rb7+ Ke6 Rxh7 d2 
Depth: 10 Time: 801 Nodes: 1018612 Ordering: 93.2519% Score: 121 Principal Variation: h5 Kg3 c3 bxc3 Rxc3 Kh4 Kg6 Rg2+ Kf6 Kxh5 Rxa3 
Depth: 11 Time: 1857 Nodes: 2353360 Ordering: 93.7293% Score: 117 Principal Variation: h5 e4 h4 h3 fxe4+ Kxe4 Rb8 Kd4 Rc8 Rd1 Kf5 Ke3 
Depth: 12 Time: 3507 Nodes: 4331786 Ordering: 94.5881% Score: 118 Principal Variation: h5 e4 h4 e5+ Ke6 Ke3 Kd5 h3 Rb5 Rg2 Rb3 Rd2 
Depth: 13 Time: 8548 Nodes: 10458139 Ordering: 95.1546% Score: 120 Principal Variation: h5 e4 h4 Ke3 fxe4 Kxe4 Rb8 Kd4 Rc8 h3 Kf5 Ke3 Rc5 Kd4 Kxf4 
Running on 4 threads
Depth: 2 Time: 0 Nodes: 299 Ordering: 72.7273% Score: 100 Principal Variation: h5 h4 
Depth: 3 Time: 0 Nodes: 937 Ordering: 66.087% Score: 103 Principal Variation: Ke6 h4 h5 
Depth: 4 Time: 2 Nodes: 2468 Ordering: 82.6374% Score: 108 Principal Variation: c3 bxc3 Rxc3 Ra2 
Depth: 5 Time: 6 Nodes: 5883 Ordering: 80.2312% Score: 118 Principal Variation: c3 bxc3 Rxc3 Ra2 Rc1 
Depth: 6 Time: 19 Nodes: 18045 Ordering: 86.7418% Score: 177 Principal Variation: c3 bxc3 Rxc3 h3 Rxa3 h4 
Depth: 7 Time: 43 Nodes: 41805 Ordering: 89.8261% Score: 158 Principal Variation: c3 bxc3 Rxc3 Rb2 Rxa3 Rb6+ Kf7 Rb8 
Depth: 8 Time: 88 Nodes: 89771 Ordering: 92.7026% Score: 156 Principal Variation: c3 bxc3 Rxc3 Rb2 Rxa3 Rb6+ Kf7 Rb7+ Kg6 h4 
Depth: 9 Time: 222 Nodes: 242228 Ordering: 92.7838% Score: 159 Principal Variation: c3 bxc3 Rxc3 Rb2 Rxa3 Rb6+ Kf7 Rb7+ Ke6 Rxh7 d2 
Depth: 10 Time: 620 Nodes: 695551 Ordering: 93.2229% Score: 113 Principal Variation: h5 e4 Ke6 e5 h4 Ke3 Kd5 h3 Rb8 Kf3 Rb3 Ke3 
Depth: 11 Time: 1446 Nodes: 1654793 Ordering: 94.3118% Score: 118 Principal Variation: Rb6 h4 h5 Kf2 Rb3 Kf3 c3 bxc3 Rxc3 Ra2 
Depth: 12 Time: 2410 Nodes: 2752570 Ordering: 94.8263% Score: 117 Principal Variation: Rb6 h4 h5 Kf2 Rb3 Kf3 c3 bxc3 Rxc3 Ra2 Rc2 Ra1 d2 Kf2 
Depth: 13 Time: 7721 Nodes: 8490355 Ordering: 95.4158% Score: 125 Principal Variation: Rb7 
Running on 8 threads
Depth: 2 Time: 0 Nodes: 293 Ordering: 75% Score: 100 Principal Variation: h5 h4 
Depth: 3 Time: 1 Nodes: 886 Ordering: 63.1579% Score: 103 Principal Variation: Ke6 h4 h5 
Depth: 4 Time: 3 Nodes: 2160 Ordering: 79.5322% Score: 108 Principal Variation: c3 bxc3 Rxc3 Ra2 
Depth: 5 Time: 6 Nodes: 5178 Ordering: 79.6586% Score: 118 Principal Variation: c3 bxc3 Rxc3 Ra2 Rc1 
Depth: 6 Time: 20 Nodes: 16076 Ordering: 84.7276% Score: 161 Principal Variation: c3 bxc3 Rxc3 h4 Rxa3 h5 
Depth: 7 Time: 40 Nodes: 32082 Ordering: 87.6929% Score: 158 Principal Variation: c3 bxc3 Rxc3 Rb2 Rxa3 Rb6+ Kf7 Rb8 
Depth: 8 Time: 101 Nodes: 81816 Ordering: 91.0375% Score: 156 Principal Variation: c3 bxc3 Rxc3 Rb2 Rxa3 Rb6+ Kf7 Rb7+ Kg6 h4 
Depth: 9 Time: 239 Nodes: 193614 Ordering: 91.848% Score: 159 Principal Variation: c3 bxc3 Rxc3 Rb2 Rxa3 Rb6+ Kf7 Rb7+ Ke6 Rxh7 d2 
Depth: 10 Time: 777 Nodes: 615678 Ordering: 92.4607% Score: 118 Principal Variation: h5 e4 Ke6 exf5+ Kxf5 Ke3 h4 Kd4 Kxf4 Kxc4 Ke3 Rd1 
Depth: 11 Time: 1725 Nodes: 1370640 Ordering: 93.1007% Score: 111 Principal Variation: Rb8 h4 Kg6 e4 fxe4+ Kxe4 Re8+ Kf3 Re1 Rg2+ Kf5 Rg5+ Kf6 Rd5 Ke6 
Depth: 12 Time: 4160 Nodes: 3161047 Ordering: 94.0797% Score: 121 Principal Variation: Rb7 h4 h5 Rg2 Rg7 Rh2 Rg1 Rd2 Ke6 Rh2 Rf1+ Rf2 Rd1 Kg3 
Depth: 13 Time: 8735 Nodes: 6152828 Ordering: 94.3623% Score: 132 Principal Variation: Rb7 h4 h5 e4 fxe4+ Kxe4 Re7+ Kf3 Rd7 Ke3 Re7+ Kf3 
Running on 16 threads
Depth: 2 Time: 0 Nodes: 189 Ordering: 60% Score: 100 Principal Variation: h5 h4 
Depth: 3 Time: 1 Nodes: 484 Ordering: 54.717% Score: 103 Principal Variation: Ke6 h4 h5 
Depth: 4 Time: 5 Nodes: 1819 Ordering: 78.5185% Score: 108 Principal Variation: c3 bxc3 Rxc3 Ra2 
Depth: 5 Time: 8 Nodes: 3426 Ordering: 78.882% Score: 118 Principal Variation: c3 bxc3 Rxc3 Ra2 Rc1 
Depth: 6 Time: 32 Nodes: 15387 Ordering: 84.3034% Score: 177 Principal Variation: c3 bxc3 Rxc3 h3 Rxa3 h4 
Depth: 7 Time: 59 Nodes: 27705 Ordering: 86.8816% Score: 158 Principal Variation: c3 bxc3 Rxc3 Rb2 Rxa3 Rb6+ Kf7 Rb8 
Depth: 8 Time: 149 Nodes: 69249 Ordering: 89.6397% Score: 156 Principal Variation: c3 bxc3 Rxc3 Rb2 Rxa3 Rb6+ Kf7 Rb7+ Kg6 h4 
Depth: 9 Time: 390 Nodes: 175645 Ordering: 90.9094% Score: 159 Principal Variation: c3 bxc3 Rxc3 Rb2 Rxa3 Rb6+ Kf7 Rb7+ Ke6 Rxh7 d2 
Depth: 10 Time: 848 Nodes: 389907 Ordering: 92.4232% Score: 160 Principal Variation: c3 bxc3 Rxc3 Rb2 Rxa3 Rb6+ Kf7 Rb7+ Ke6 Rb6+ Kd5 Rf6 d2 
Depth: 11 Time: 2440 Nodes: 1101156 Ordering: 93.7872% Score: 117 Principal Variation: h5 e4 fxe4+ Kxe4 Rb8 Kd4 Rc8 
Depth: 12 Time: 5243 Nodes: 2239358 Ordering: 94.2062% Score: 117 Principal Variation: Rb6 h3 h5 h4 Rb3 Rf2 c3 bxc3 Rxc3 Ra2 Rc2 Ra1 d2 Kg3 Ke6 Kf3

smatovic · Post by **smatovic** » Mon Apr 24, 2023 12:25 pm

Indeed, this looks wrong, not time to depth (Lazy SMP is measured in Elo gain, not in time to depth decrease) but your NPS (nodes per second) scaling across cores, you stick on ~1M NPS for 1,2,4 threads, then with 8 and 16 decline, looks simply wrong.

Do you have independent node counter for each thread and at the end sum up?

--
Srdja

adityachandra · Post by **adityachandra** » Mon Apr 24, 2023 12:37 pm

I have a node counter only for the main thread, that is why the node count stays roughly the same. I would expect it to decline as threads increase as the main thread does not have to do as much work thanks to the hash table being filled up. Yes, I think I would have to setup some self-play matches. But I did expect there to be a speedup in time to depth.

smatovic · Post by **smatovic** » Mon Apr 24, 2023 12:45 pm

1. NPS scaling across real cores with plain Shared Hash Table should be almost linear (considering UMA).
2. I assume you have 4 or 8 real cores with 8 resp. 16 hyperthreads, cos your single thread NPS value does decline beyond thread count of 4.
3. I did not get Lazy SMP running by myself, neither in better time to depth, nor in Elo gain, but people say Elo matters.
4. As wrote in another thread, Lazy SMP is considered by some as "unsound" algorithm, cos it depends, there are alternatives like ABDADA or YBWC, these can be measured in time to depth, and figured with pen n paper.

--
Srdja

adityachandra · Post by **adityachandra** » Mon Apr 24, 2023 12:49 pm

So for Lazy SMP I should better measure ELO rather than time to depth to get an idea of how good my implementation is? And with YBW time to depth would suffice.

Lazy SMP troubles

Lazy SMP troubles

Re: Lazy SMP troubles

Re: Lazy SMP troubles

Re: Lazy SMP troubles

Re: Lazy SMP troubles

Re: Lazy SMP troubles

Re: Lazy SMP troubles

Re: Lazy SMP troubles

Re: Lazy SMP troubles

Re: Lazy SMP troubles