Lazy SMP ideas

Edsel Apostol · Post by **Edsel Apostol** » Wed Aug 22, 2018 3:53 pm

So the new engine rewrite is going well. Only material and PST for now with a basic search (just aspiration window PVS, LMR and Null move). Around Sunguros level. I managed to implement a basic Lazy SMP as well. Hannibal SMP was a modified YBWC but it is not that efficient with large number of threads due to non support of NUMA (I don't have a test machine for this), and the inherent limitation of YBWC especially in the endgame where there is not much number of moves to split work with. So right now I'm playing with Lazy SMP.

I first implemented a basic shared hash table, but I noticed that the threads seems to finish the iteration almost at the same time, returning the same score and principal variation. This is probably due to the fact that the single thread search is deterministic still and they are sharing the transposition table and the principal variation hash table.

I then implemented having every other thread start at depth 2 (odd thread ids start at depth 1), but they still converge after some time.

I also implemented an improvement over the previous one where each thread when started at depth 1 will search odd depths (1,3,5,...) and threads that started depth 2 will search even depths(2,4,6,..) but this seems too extreme as threads are not helping each other finish the iterations faster.

I don't quite get what SF is doing. It seems to take into account game ply, thread ids, and iteration depth to skip one depth. I also don't quite like the idea of lazy SMP and can't believe that it works at all. It seems too primitive. I'll probably implement a modified ABDADA.

So what's the best SMP algorithm nowadays?

Here's the link to the new engine by the way, if someone wants to play with a basic PST and mat only eval engine:

https://drive.google.com/file/d/1R_zvU3 ... sp=sharing

Let me know how it scales in large number of threads.

I'll probably share the source code as well some day when I release the first official version.

F. Bluemers · Post by **F. Bluemers** » Wed Aug 22, 2018 4:52 pm

I'll probably implement a modified ABDADA.

Some time ago Tom Kernighan posted his abdada modification in here .His webpage on it: http://www.tckerrigan.com/Chess/Paralle ... ed_ABDADA/

PK · Post by PK » Thu Aug 23, 2018 10:26 am

Recently I noticed a post in which John Stanback, author of Wasp, wrote about the following algorithm: when a thread starts searching a new depth, check if more than some percentage of threads already search to this depth or higher. If so, skip that depth. In Rodent, I started using 50% as a skip threshold, and this modification scored 54% at 6 threads.

Also, there is a possibility that a threas starts lagging behind. I skip depth if a thread wants to search at depth lower than max depth reached - 1.

Edsel Apostol · Post by **Edsel Apostol** » Thu Aug 23, 2018 9:23 pm

PK wrote: ↑Thu Aug 23, 2018 10:26 am Recently I noticed a post in which John Stanback, author of Wasp, wrote about the following algorithm: when a thread starts searching a new depth, check if more than some percentage of threads already search to this depth or higher. If so, skip that depth. In Rodent, I started using 50% as a skip threshold, and this modification scored 54% at 6 threads.

Also, there is a possibility that a threas starts lagging behind. I skip depth if a thread wants to search at depth lower than max depth reached - 1.

I implemented this and it seems to work well. I have a counter per iteration depth that increments each time a thread searches that depth. Succeeding threads will check if the count for that depth is >= 50% and will skip that depth. Lagging threads is not an issue with my implementation as I don't decrement the iteration depth count table. So lagging threads will catch up to the highest depth being searched and will probably skip to the next depth if that iteration has enough threads already.

I'll probably mix this approach with ABDADA and we will see if it works.

Edsel Apostol · Post by **Edsel Apostol** » Thu Aug 23, 2018 9:26 pm

F. Bluemers wrote: ↑Wed Aug 22, 2018 4:52 pm
I'll probably implement a modified ABDADA.
Some time ago Tom Kernighan posted his abdada modification in here .His webpage on it: http://www.tckerrigan.com/Chess/Paralle ... ed_ABDADA/

Thanks. I'll definitely try that.

Graham Banks · Post by **Graham Banks** » Fri Aug 24, 2018 12:27 am

Hi Edsel,

is Hannibal still being developed?

All the best with your new engine.

Graham.

Edsel Apostol · Post by **Edsel Apostol** » Fri Aug 24, 2018 9:24 am

Graham Banks wrote: ↑Fri Aug 24, 2018 12:27 am Hi Edsel,

is Hannibal still being developed?

All the best with your new engine.

Graham.

Hi Graham, thanks. It is not actively being developed anymore but I'll probably release one more version for the last time with a new SMP algo, just so it wouldn't get clobbered in tournaments with large number of threads.

-Edsel

Edsel Apostol · Post by **Edsel Apostol** » Sat Sep 01, 2018 6:57 am

So here are the test results from my implementation of ABDADA and Lazy SMP from the new engine and the modified YBWC from Hannibal.

I'm using the following set of positions (from Hannibal games and from some positions posted here in CCC):

Code: Select all

"r3k2r/pbpnqp2/1p1ppn1p/6p1/2PP4/2PBPNB1/P4PPP/R2Q1RK1 w kq - 2 12",
"2kr3r/pbpn1pq1/1p3n2/3p1R2/3P3p/2P2Q2/P1BN2PP/R3B2K w - - 4 22",
"r2n1rk1/1pq2ppp/p2pbn2/8/P3Pp2/2PBB2P/2PNQ1P1/1R3RK1 w - - 0 17",
"1r2r2k/1p4qp/p3bp2/4p2R/n3P3/2PB4/2PB1QPK/1R6 w - - 1 32",
"1b3r1k/rb1q3p/pp2pppP/3n1n2/1P2N3/P2B1NPQ/1B3P2/2R1R1K1 b - - 1 32",
"1r1r1qk1/pn1p2p1/1pp1npBp/8/2PB2QP/4R1P1/P4PK1/3R4 w - - 0 1",
"3rr1k1/1b2nnpp/1p1q1p2/pP1p1P2/P1pP2P1/2N1P1QP/3N1RB1/2R3K1 w - - 0 1",
"rn3rq1/p5k1/2p2bp1/1p4p1/8/2P1B1PQ/5PK1/3R3R w - - 0 1",
"1r3rk1/3bb1pp/1qn1p3/3pP3/3P1N2/2Q2N2/2P3PP/R1BR3K w - - 0 1",
"rn1q1rk1/2pbb3/pn2p3/1p1pPpp1/3P4/1PNBBN2/P1P1Q1PP/R4R1K w - - 0 1"

The positions are chosen based on their complexity where the engine would need at least a minute or two to complete the depth 20 iteration on single core.

Results are from a dual Intel Xeon E5-2698V3 https://ark.intel.com/products/81060/In ... e-2_30-GHz
The machine is 32 cores with 64 threads.

Hannibal modified YBWC searched with hash size of 512MB and fixed depth of 25 for thread values, 1,2,4,8,16,32,64. Values are summed and divided by the number of positions and then divided by the result of the single thread search. This doesn't take into account the turbo boost from the single core run, so the result is probably a bit lower than the correct value. Every start of the test ucinewgame is being issued, so the search is being started from scratch as hashes are cleared. The result for the Threads: 1 has average time spent in seconds and nodes in kNPS. The succeeding values for higher thread counts are multipliers for nodes, and the inverse for time.

Code: Select all

Threads: 1 time: 74.990600 nodes: 937.200000
Threads: 2 time: 2.149378 nodes: 1.942089
Threads: 4 time: 3.346066 nodes: 3.753011
Threads: 8 time: 5.226519 nodes: 6.973204
Threads: 16 time: 6.626765 nodes: 12.046264
Threads: 32 time: 8.139148 nodes: 16.006453
Threads: 64 time: 4.880303 nodes: 13.368858

As can be seen Hannibal NPS scaling is only good up to 16 cores. Maybe due to the machine being dual CPUs the engine struggled with NUMA due to Hannibal internal design of handling split points and repetition detection. Hannibal was only tested in an 8 core machine before this.

This is the result for the new chess engine with the modified ABDADA as implemented by Tom Kerrigan. http://www.tckerrigan.com/Chess/Paralle ... ed_ABDADA/
Searched with 512MB hash and fixed depth of 20.

Code: Select all

Threads: 1 time: 108.299200 nodes: 3127.100000
Threads: 2 time: 2.168862 nodes: 1.787663
Threads: 4 time: 3.603405 nodes: 3.492227
Threads: 8 time: 6.197661 nodes: 6.876950
Threads: 16 time: 8.440115 nodes: 13.613154
Threads: 32 time: 10.724592 nodes: 26.575884
Threads: 64 time: 11.759754 nodes: 31.082229

NPS scaling is not perfect due probably to turbo boost in single core and the signalling to the threads to quit current iteration upon one thread completing that iteration. This is to synchronize and focus the effort into the next iteration. This is done without waiting for any threads.

This is the result for the LazySMP:

Code: Select all

Threads: 1 time: 109.826200 nodes: 3084.100000
Threads: 2 time: 3.847632 nodes: 1.844580
Threads: 4 time: 6.209981 nodes: 3.616316
Threads: 8 time: 6.069562 nodes: 7.212488
Threads: 16 time: 7.001232 nodes: 14.255216
Threads: 32 time: 13.694476 nodes: 27.848897
Threads: 64 time: 10.592256 nodes: 31.428827

LazySMP is implemented with 50% of the threads searching depth and another 50% on depth+1. As can be seen there is some kind of super linear speedup in Threads 2 and 4. This is probably why Lazy is so strong in 4 cores which is currently the standard with the rating lists.

Invictus chess engine source code can be found here:
ABDADA https://github.com/ed-apostol/InvictusChess
LazySMP https://github.com/ed-apostol/InvictusC ... ee/LazySMP

Michael Sherwin · Post by **Michael Sherwin** » Sun Sep 16, 2018 5:07 am

Hi Edsel, I have an experiment you can try. That is if your new engine is still primarily a material searcher with pst. I promise it will be interesting!

Anyway, it is very simple. Disable the pst so it only has a material only search. Then count the number of beta cutoffs for each root move for both sides. You can subtract the amounts or divide to create a ratio. Then all that is done to select a move is to play the best material score using the beta cutoff values as a tiebreak. There is a lot of chess 'knowledge' in that simple beta cutoff count. It's play will surprise you. If that simple experiment is interesting then mate counts can be added and capture moves at the root can have their count modified so they are not automatically made. Anyway good to see you still around! And thanks.

Edsel Apostol · Post by **Edsel Apostol** » Sat Sep 29, 2018 9:56 pm

Michael Sherwin wrote: ↑Sun Sep 16, 2018 5:07 am Hi Edsel, I have an experiment you can try. That is if your new engine is still primarily a material searcher with pst. I promise it will be interesting!

Anyway, it is very simple. Disable the pst so it only has a material only search. Then count the number of beta cutoffs for each root move for both sides. You can subtract the amounts or divide to create a ratio. Then all that is done to select a move is to play the best material score using the beta cutoff values as a tiebreak. There is a lot of chess 'knowledge' in that simple beta cutoff count. It's play will surprise you. If that simple experiment is interesting then mate counts can be added and capture moves at the root can have their count modified so they are not automatically made. Anyway good to see you still around! And thanks.

Good to see you Mike. It's hard to leave this chess programming hobby. I keep coming back.

I've just released my first version (see General Topics). It should be around Romi strength. It should be a good sparring partner.

As for your idea, it seems like very similar to some search algos used by NN engines.

Lazy SMP ideas

Lazy SMP ideas

Re: Lazy SMP ideas

Re: Lazy SMP ideas

Re: Lazy SMP ideas

Re: Lazy SMP ideas

Re: Lazy SMP ideas

Re: Lazy SMP ideas

Re: Lazy SMP ideas

Re: Lazy SMP ideas

Re: Lazy SMP ideas