I have been trying to implement Lazy SMP multithreading for the past week, however my results have been underwhelming as at most I only get a 2x speedup when on 8 threads compared to 1 thread. I did all the usual such as keeping a seperate history and killer table for each thread, and increasing the depth by one for every odd thread, to create divergence early on so each thread is not just repeating the same work. But still to no avail. As reference I just attached my iterative deepening method, which implements the multithreading. Perhaps you guys could find mistakes?
Did not look at your code, but maybe try with plain Shared Hash Table first, to ensure NPS speedup across real cores, and then add further Lazy SMP methods.
Yes, I do have a shared hash table with Lockless hashing. Thankfully I am certain it works fine as I have tested it on a single core to see if I am packing the data and keys correctly. I read from the link you sent that that can be due to limitations on the number of data lines in hardware, perhaps YBW will be better for this purpose?
YBWC was used in Stockfish and meanwhile superseded by Lazy SMP. Algorithms like DTS and YBW have some overhead which prevent near linear NPS speedup, I myself like ABDADA. But, in your case, with a plain SHT approach, as was mentioned in the other thread, you should get near linear NPS speedup across real CPU cores. Maybe tell us the CPU type you use, how many cores, hyperthreading, which OS? And post a speedup table as was done in the other thread. As Joost mentioned, TT lookups during QSearch could be a matter of memory bandwidth.
I have a MacBook Pro with the 16 real core Intel Chip. I usually test on 8 threads but here is the output of my engine for this position [fen]8/7p/5k2/5p2/p1p2P2/Pr1pPK2/1P1R3P/8 b - -[/fen]. The moves are in algebraic notation to show in a GUI I made as an engine line but it should be understandable. (there is a rook sacrifice that it must find on depth 17). As you can see below, there is no real speedup in time to depth when running the iterative deepening method for 10 seconds.
Indeed, this looks wrong, not time to depth (Lazy SMP is measured in Elo gain, not in time to depth decrease) but your NPS (nodes per second) scaling across cores, you stick on ~1M NPS for 1,2,4 threads, then with 8 and 16 decline, looks simply wrong.
Do you have independent node counter for each thread and at the end sum up?
--
Srdja
Last edited by smatovic on Mon Apr 24, 2023 12:38 pm, edited 2 times in total.
I have a node counter only for the main thread, that is why the node count stays roughly the same. I would expect it to decline as threads increase as the main thread does not have to do as much work thanks to the hash table being filled up. Yes, I think I would have to setup some self-play matches. But I did expect there to be a speedup in time to depth.
1. NPS scaling across real cores with plain Shared Hash Table should be almost linear (considering UMA).
2. I assume you have 4 or 8 real cores with 8 resp. 16 hyperthreads, cos your single thread NPS value does decline beyond thread count of 4.
3. I did not get Lazy SMP running by myself, neither in better time to depth, nor in Elo gain, but people say Elo matters.
4. As wrote in another thread, Lazy SMP is considered by some as "unsound" algorithm, cos it depends, there are alternatives like ABDADA or YBWC, these can be measured in time to depth, and figured with pen n paper.
--
Srdja
Last edited by smatovic on Mon Apr 24, 2023 12:53 pm, edited 2 times in total.
So for Lazy SMP I should better measure ELO rather than time to depth to get an idea of how good my implementation is? And with YBW time to depth would suffice.
Last edited by adityachandra on Mon Apr 24, 2023 12:55 pm, edited 1 time in total.