Lazy SMP ideas

PK · Post by PK » Thu Aug 23, 2018 10:26 am

Recently I noticed a post in which John Stanback, author of Wasp, wrote about the following algorithm: when a thread starts searching a new depth, check if more than some percentage of threads already search to this depth or higher. If so, skip that depth. In Rodent, I started using 50% as a skip threshold, and this modification scored 54% at 6 threads.

Also, there is a possibility that a threas starts lagging behind. I skip depth if a thread wants to search at depth lower than max depth reached - 1.

Edsel Apostol · Post by **Edsel Apostol** » Thu Aug 23, 2018 9:23 pm

PK wrote: ↑Thu Aug 23, 2018 10:26 am Recently I noticed a post in which John Stanback, author of Wasp, wrote about the following algorithm: when a thread starts searching a new depth, check if more than some percentage of threads already search to this depth or higher. If so, skip that depth. In Rodent, I started using 50% as a skip threshold, and this modification scored 54% at 6 threads.

Also, there is a possibility that a threas starts lagging behind. I skip depth if a thread wants to search at depth lower than max depth reached - 1.

I implemented this and it seems to work well. I have a counter per iteration depth that increments each time a thread searches that depth. Succeeding threads will check if the count for that depth is >= 50% and will skip that depth. Lagging threads is not an issue with my implementation as I don't decrement the iteration depth count table. So lagging threads will catch up to the highest depth being searched and will probably skip to the next depth if that iteration has enough threads already.

I'll probably mix this approach with ABDADA and we will see if it works.

Edsel Apostol · Post by **Edsel Apostol** » Thu Aug 23, 2018 9:26 pm

F. Bluemers wrote: ↑Wed Aug 22, 2018 4:52 pm
I'll probably implement a modified ABDADA.
Some time ago Tom Kernighan posted his abdada modification in here .His webpage on it: http://www.tckerrigan.com/Chess/Paralle ... ed_ABDADA/

Thanks. I'll definitely try that.

Graham Banks · Post by **Graham Banks** » Fri Aug 24, 2018 12:27 am

Hi Edsel,

is Hannibal still being developed?

All the best with your new engine.

Graham.

Edsel Apostol · Post by **Edsel Apostol** » Fri Aug 24, 2018 9:24 am

Graham Banks wrote: ↑Fri Aug 24, 2018 12:27 am Hi Edsel,

is Hannibal still being developed?

All the best with your new engine.

Graham.

Hi Graham, thanks. It is not actively being developed anymore but I'll probably release one more version for the last time with a new SMP algo, just so it wouldn't get clobbered in tournaments with large number of threads.

-Edsel

Edsel Apostol · Post by **Edsel Apostol** » Sat Sep 01, 2018 6:57 am

So here are the test results from my implementation of ABDADA and Lazy SMP from the new engine and the modified YBWC from Hannibal.

I'm using the following set of positions (from Hannibal games and from some positions posted here in CCC):

Code: Select all

"r3k2r/pbpnqp2/1p1ppn1p/6p1/2PP4/2PBPNB1/P4PPP/R2Q1RK1 w kq - 2 12",
"2kr3r/pbpn1pq1/1p3n2/3p1R2/3P3p/2P2Q2/P1BN2PP/R3B2K w - - 4 22",
"r2n1rk1/1pq2ppp/p2pbn2/8/P3Pp2/2PBB2P/2PNQ1P1/1R3RK1 w - - 0 17",
"1r2r2k/1p4qp/p3bp2/4p2R/n3P3/2PB4/2PB1QPK/1R6 w - - 1 32",
"1b3r1k/rb1q3p/pp2pppP/3n1n2/1P2N3/P2B1NPQ/1B3P2/2R1R1K1 b - - 1 32",
"1r1r1qk1/pn1p2p1/1pp1npBp/8/2PB2QP/4R1P1/P4PK1/3R4 w - - 0 1",
"3rr1k1/1b2nnpp/1p1q1p2/pP1p1P2/P1pP2P1/2N1P1QP/3N1RB1/2R3K1 w - - 0 1",
"rn3rq1/p5k1/2p2bp1/1p4p1/8/2P1B1PQ/5PK1/3R3R w - - 0 1",
"1r3rk1/3bb1pp/1qn1p3/3pP3/3P1N2/2Q2N2/2P3PP/R1BR3K w - - 0 1",
"rn1q1rk1/2pbb3/pn2p3/1p1pPpp1/3P4/1PNBBN2/P1P1Q1PP/R4R1K w - - 0 1"

The positions are chosen based on their complexity where the engine would need at least a minute or two to complete the depth 20 iteration on single core.

Results are from a dual Intel Xeon E5-2698V3 https://ark.intel.com/products/81060/In ... e-2_30-GHz
The machine is 32 cores with 64 threads.

Hannibal modified YBWC searched with hash size of 512MB and fixed depth of 25 for thread values, 1,2,4,8,16,32,64. Values are summed and divided by the number of positions and then divided by the result of the single thread search. This doesn't take into account the turbo boost from the single core run, so the result is probably a bit lower than the correct value. Every start of the test ucinewgame is being issued, so the search is being started from scratch as hashes are cleared. The result for the Threads: 1 has average time spent in seconds and nodes in kNPS. The succeeding values for higher thread counts are multipliers for nodes, and the inverse for time.

Code: Select all

Threads: 1 time: 74.990600 nodes: 937.200000
Threads: 2 time: 2.149378 nodes: 1.942089
Threads: 4 time: 3.346066 nodes: 3.753011
Threads: 8 time: 5.226519 nodes: 6.973204
Threads: 16 time: 6.626765 nodes: 12.046264
Threads: 32 time: 8.139148 nodes: 16.006453
Threads: 64 time: 4.880303 nodes: 13.368858

As can be seen Hannibal NPS scaling is only good up to 16 cores. Maybe due to the machine being dual CPUs the engine struggled with NUMA due to Hannibal internal design of handling split points and repetition detection. Hannibal was only tested in an 8 core machine before this.

This is the result for the new chess engine with the modified ABDADA as implemented by Tom Kerrigan. http://www.tckerrigan.com/Chess/Paralle ... ed_ABDADA/
Searched with 512MB hash and fixed depth of 20.

Code: Select all

Threads: 1 time: 108.299200 nodes: 3127.100000
Threads: 2 time: 2.168862 nodes: 1.787663
Threads: 4 time: 3.603405 nodes: 3.492227
Threads: 8 time: 6.197661 nodes: 6.876950
Threads: 16 time: 8.440115 nodes: 13.613154
Threads: 32 time: 10.724592 nodes: 26.575884
Threads: 64 time: 11.759754 nodes: 31.082229

NPS scaling is not perfect due probably to turbo boost in single core and the signalling to the threads to quit current iteration upon one thread completing that iteration. This is to synchronize and focus the effort into the next iteration. This is done without waiting for any threads.

This is the result for the LazySMP:

Code: Select all

Threads: 1 time: 109.826200 nodes: 3084.100000
Threads: 2 time: 3.847632 nodes: 1.844580
Threads: 4 time: 6.209981 nodes: 3.616316
Threads: 8 time: 6.069562 nodes: 7.212488
Threads: 16 time: 7.001232 nodes: 14.255216
Threads: 32 time: 13.694476 nodes: 27.848897
Threads: 64 time: 10.592256 nodes: 31.428827

LazySMP is implemented with 50% of the threads searching depth and another 50% on depth+1. As can be seen there is some kind of super linear speedup in Threads 2 and 4. This is probably why Lazy is so strong in 4 cores which is currently the standard with the rating lists.

Invictus chess engine source code can be found here:
ABDADA https://github.com/ed-apostol/InvictusChess
LazySMP https://github.com/ed-apostol/InvictusC ... ee/LazySMP

Michael Sherwin · Post by **Michael Sherwin** » Sun Sep 16, 2018 5:07 am

Hi Edsel, I have an experiment you can try. That is if your new engine is still primarily a material searcher with pst. I promise it will be interesting!

Anyway, it is very simple. Disable the pst so it only has a material only search. Then count the number of beta cutoffs for each root move for both sides. You can subtract the amounts or divide to create a ratio. Then all that is done to select a move is to play the best material score using the beta cutoff values as a tiebreak. There is a lot of chess 'knowledge' in that simple beta cutoff count. It's play will surprise you. If that simple experiment is interesting then mate counts can be added and capture moves at the root can have their count modified so they are not automatically made. Anyway good to see you still around! And thanks.

Edsel Apostol · Post by **Edsel Apostol** » Sat Sep 29, 2018 9:56 pm

Michael Sherwin wrote: ↑Sun Sep 16, 2018 5:07 am Hi Edsel, I have an experiment you can try. That is if your new engine is still primarily a material searcher with pst. I promise it will be interesting!

Anyway, it is very simple. Disable the pst so it only has a material only search. Then count the number of beta cutoffs for each root move for both sides. You can subtract the amounts or divide to create a ratio. Then all that is done to select a move is to play the best material score using the beta cutoff values as a tiebreak. There is a lot of chess 'knowledge' in that simple beta cutoff count. It's play will surprise you. If that simple experiment is interesting then mate counts can be added and capture moves at the root can have their count modified so they are not automatically made. Anyway good to see you still around! And thanks.

Good to see you Mike. It's hard to leave this chess programming hobby. I keep coming back.

I've just released my first version (see General Topics). It should be around Romi strength. It should be a good sparring partner.

As for your idea, it seems like very similar to some search algos used by NN engines.

lucasart · Post by **lucasart** » Sun Sep 30, 2018 11:33 am

Edsel Apostol wrote: ↑Sat Sep 01, 2018 6:57 am So here are the test results from my implementation of ABDADA and Lazy SMP from the new engine and the modified YBWC from Hannibal.

I'm using the following set of positions (from Hannibal games and from some positions posted here in CCC):
Code: Select all
"r3k2r/pbpnqp2/1p1ppn1p/6p1/2PP4/2PBPNB1/P4PPP/R2Q1RK1 w kq - 2 12",
"2kr3r/pbpn1pq1/1p3n2/3p1R2/3P3p/2P2Q2/P1BN2PP/R3B2K w - - 4 22",
"r2n1rk1/1pq2ppp/p2pbn2/8/P3Pp2/2PBB2P/2PNQ1P1/1R3RK1 w - - 0 17",
"1r2r2k/1p4qp/p3bp2/4p2R/n3P3/2PB4/2PB1QPK/1R6 w - - 1 32",
"1b3r1k/rb1q3p/pp2pppP/3n1n2/1P2N3/P2B1NPQ/1B3P2/2R1R1K1 b - - 1 32",
"1r1r1qk1/pn1p2p1/1pp1npBp/8/2PB2QP/4R1P1/P4PK1/3R4 w - - 0 1",
"3rr1k1/1b2nnpp/1p1q1p2/pP1p1P2/P1pP2P1/2N1P1QP/3N1RB1/2R3K1 w - - 0 1",
"rn3rq1/p5k1/2p2bp1/1p4p1/8/2P1B1PQ/5PK1/3R3R w - - 0 1",
"1r3rk1/3bb1pp/1qn1p3/3pP3/3P1N2/2Q2N2/2P3PP/R1BR3K w - - 0 1",
"rn1q1rk1/2pbb3/pn2p3/1p1pPpp1/3P4/1PNBBN2/P1P1Q1PP/R4R1K w - - 0 1"
The positions are chosen based on their complexity where the engine would need at least a minute or two to complete the depth 20 iteration on single core.

Results are from a dual Intel Xeon E5-2698V3 https://ark.intel.com/products/81060/In ... e-2_30-GHz
The machine is 32 cores with 64 threads.

Hannibal modified YBWC searched with hash size of 512MB and fixed depth of 25 for thread values, 1,2,4,8,16,32,64. Values are summed and divided by the number of positions and then divided by the result of the single thread search. This doesn't take into account the turbo boost from the single core run, so the result is probably a bit lower than the correct value. Every start of the test ucinewgame is being issued, so the search is being started from scratch as hashes are cleared. The result for the Threads: 1 has average time spent in seconds and nodes in kNPS. The succeeding values for higher thread counts are multipliers for nodes, and the inverse for time.
Code: Select all
Threads: 1 time: 74.990600 nodes: 937.200000
Threads: 2 time: 2.149378 nodes: 1.942089
Threads: 4 time: 3.346066 nodes: 3.753011
Threads: 8 time: 5.226519 nodes: 6.973204
Threads: 16 time: 6.626765 nodes: 12.046264
Threads: 32 time: 8.139148 nodes: 16.006453
Threads: 64 time: 4.880303 nodes: 13.368858
As can be seen Hannibal NPS scaling is only good up to 16 cores. Maybe due to the machine being dual CPUs the engine struggled with NUMA due to Hannibal internal design of handling split points and repetition detection. Hannibal was only tested in an 8 core machine before this.

This is the result for the new chess engine with the modified ABDADA as implemented by Tom Kerrigan. http://www.tckerrigan.com/Chess/Paralle ... ed_ABDADA/
Searched with 512MB hash and fixed depth of 20.
Code: Select all
Threads: 1 time: 108.299200 nodes: 3127.100000
Threads: 2 time: 2.168862 nodes: 1.787663
Threads: 4 time: 3.603405 nodes: 3.492227
Threads: 8 time: 6.197661 nodes: 6.876950
Threads: 16 time: 8.440115 nodes: 13.613154
Threads: 32 time: 10.724592 nodes: 26.575884
Threads: 64 time: 11.759754 nodes: 31.082229
NPS scaling is not perfect due probably to turbo boost in single core and the signalling to the threads to quit current iteration upon one thread completing that iteration. This is to synchronize and focus the effort into the next iteration. This is done without waiting for any threads.

This is the result for the LazySMP:
Code: Select all
Threads: 1 time: 109.826200 nodes: 3084.100000
Threads: 2 time: 3.847632 nodes: 1.844580
Threads: 4 time: 6.209981 nodes: 3.616316
Threads: 8 time: 6.069562 nodes: 7.212488
Threads: 16 time: 7.001232 nodes: 14.255216
Threads: 32 time: 13.694476 nodes: 27.848897
Threads: 64 time: 10.592256 nodes: 31.428827
LazySMP is implemented with 50% of the threads searching depth and another 50% on depth+1. As can be seen there is some kind of super linear speedup in Threads 2 and 4. This is probably why Lazy is so strong in 4 cores which is currently the standard with the rating lists.

Invictus chess engine source code can be found here:
ABDADA https://github.com/ed-apostol/InvictusChess
LazySMP https://github.com/ed-apostol/InvictusC ... ee/LazySMP

Good luck with your new engine. Looks like you've done a lot of work on the subject.

In your experience, does ABDADA gain any elo ? (not time to depth, I'm not interested in that metric)

For me, ABDADA is a regression compared to standard lazy SMP, including Tom K's form of it. So messing up the coding base to lose elo, no thanks...

What worked best for me is:
* work balancing: 1/2 workers at depth d, 1/2 workers at depth d+1.
* stop useless iterations: when (any) worker completes a given depth d, signal all other workers running depth <= d to stop immediately, and report back to base, where they can be assigned useful work (ie. >= d+1).

Edsel Apostol · Post by **Edsel Apostol** » Mon Oct 01, 2018 3:10 am

lucasart wrote: ↑Sun Sep 30, 2018 11:33 am
Edsel Apostol wrote: ↑Sat Sep 01, 2018 6:57 am So here are the test results from my implementation of ABDADA and Lazy SMP from the new engine and the modified YBWC from Hannibal.

I'm using the following set of positions (from Hannibal games and from some positions posted here in CCC):
Code: Select all
"r3k2r/pbpnqp2/1p1ppn1p/6p1/2PP4/2PBPNB1/P4PPP/R2Q1RK1 w kq - 2 12",
"2kr3r/pbpn1pq1/1p3n2/3p1R2/3P3p/2P2Q2/P1BN2PP/R3B2K w - - 4 22",
"r2n1rk1/1pq2ppp/p2pbn2/8/P3Pp2/2PBB2P/2PNQ1P1/1R3RK1 w - - 0 17",
"1r2r2k/1p4qp/p3bp2/4p2R/n3P3/2PB4/2PB1QPK/1R6 w - - 1 32",
"1b3r1k/rb1q3p/pp2pppP/3n1n2/1P2N3/P2B1NPQ/1B3P2/2R1R1K1 b - - 1 32",
"1r1r1qk1/pn1p2p1/1pp1npBp/8/2PB2QP/4R1P1/P4PK1/3R4 w - - 0 1",
"3rr1k1/1b2nnpp/1p1q1p2/pP1p1P2/P1pP2P1/2N1P1QP/3N1RB1/2R3K1 w - - 0 1",
"rn3rq1/p5k1/2p2bp1/1p4p1/8/2P1B1PQ/5PK1/3R3R w - - 0 1",
"1r3rk1/3bb1pp/1qn1p3/3pP3/3P1N2/2Q2N2/2P3PP/R1BR3K w - - 0 1",
"rn1q1rk1/2pbb3/pn2p3/1p1pPpp1/3P4/1PNBBN2/P1P1Q1PP/R4R1K w - - 0 1"
The positions are chosen based on their complexity where the engine would need at least a minute or two to complete the depth 20 iteration on single core.

Results are from a dual Intel Xeon E5-2698V3 https://ark.intel.com/products/81060/In ... e-2_30-GHz
The machine is 32 cores with 64 threads.

Hannibal modified YBWC searched with hash size of 512MB and fixed depth of 25 for thread values, 1,2,4,8,16,32,64. Values are summed and divided by the number of positions and then divided by the result of the single thread search. This doesn't take into account the turbo boost from the single core run, so the result is probably a bit lower than the correct value. Every start of the test ucinewgame is being issued, so the search is being started from scratch as hashes are cleared. The result for the Threads: 1 has average time spent in seconds and nodes in kNPS. The succeeding values for higher thread counts are multipliers for nodes, and the inverse for time.
Code: Select all
Threads: 1 time: 74.990600 nodes: 937.200000
Threads: 2 time: 2.149378 nodes: 1.942089
Threads: 4 time: 3.346066 nodes: 3.753011
Threads: 8 time: 5.226519 nodes: 6.973204
Threads: 16 time: 6.626765 nodes: 12.046264
Threads: 32 time: 8.139148 nodes: 16.006453
Threads: 64 time: 4.880303 nodes: 13.368858
As can be seen Hannibal NPS scaling is only good up to 16 cores. Maybe due to the machine being dual CPUs the engine struggled with NUMA due to Hannibal internal design of handling split points and repetition detection. Hannibal was only tested in an 8 core machine before this.

This is the result for the new chess engine with the modified ABDADA as implemented by Tom Kerrigan. http://www.tckerrigan.com/Chess/Paralle ... ed_ABDADA/
Searched with 512MB hash and fixed depth of 20.
Code: Select all
Threads: 1 time: 108.299200 nodes: 3127.100000
Threads: 2 time: 2.168862 nodes: 1.787663
Threads: 4 time: 3.603405 nodes: 3.492227
Threads: 8 time: 6.197661 nodes: 6.876950
Threads: 16 time: 8.440115 nodes: 13.613154
Threads: 32 time: 10.724592 nodes: 26.575884
Threads: 64 time: 11.759754 nodes: 31.082229
NPS scaling is not perfect due probably to turbo boost in single core and the signalling to the threads to quit current iteration upon one thread completing that iteration. This is to synchronize and focus the effort into the next iteration. This is done without waiting for any threads.

This is the result for the LazySMP:
Code: Select all
Threads: 1 time: 109.826200 nodes: 3084.100000
Threads: 2 time: 3.847632 nodes: 1.844580
Threads: 4 time: 6.209981 nodes: 3.616316
Threads: 8 time: 6.069562 nodes: 7.212488
Threads: 16 time: 7.001232 nodes: 14.255216
Threads: 32 time: 13.694476 nodes: 27.848897
Threads: 64 time: 10.592256 nodes: 31.428827
LazySMP is implemented with 50% of the threads searching depth and another 50% on depth+1. As can be seen there is some kind of super linear speedup in Threads 2 and 4. This is probably why Lazy is so strong in 4 cores which is currently the standard with the rating lists.

Invictus chess engine source code can be found here:
ABDADA https://github.com/ed-apostol/InvictusChess
LazySMP https://github.com/ed-apostol/InvictusC ... ee/LazySMP
Good luck with your new engine. Looks like you've done a lot of work on the subject.

In your experience, does ABDADA gain any elo ? (not time to depth, I'm not interested in that metric)

For me, ABDADA is a regression compared to standard lazy SMP, including Tom K's form of it. So messing up the coding base to lose elo, no thanks...

What worked best for me is:
* work balancing: 1/2 workers at depth d, 1/2 workers at depth d+1.
* stop useless iterations: when (any) worker completes a given depth d, signal all other workers running depth <= d to stop immediately, and report back to base, where they can be assigned useful work (ie. >= d+1).

I've implemented Lazy SMP the same as what you've described and it is about the same strength as my ABDADA. When deferring moves in ABDADA you should also take note of the move order for correct LMR/P but you probably already know that. Texel I think is also ABDADA but it doesn't create new hash entries for a new position being tagged as busy so it doesn't take advantage of the full capability of the algo. I have a separate hash table for just that purpose and have a replacement scheme based on depth, unlike Tom's where he just replace the first entry always. I will stick with ABDADA for now. I think that I can still improve it further along the way, the same as what I did in Hannibal with a DTS like YBWC.

Lazy SMP ideas

Re: Lazy SMP ideas

Re: Lazy SMP ideas

Re: Lazy SMP ideas

Re: Lazy SMP ideas

Re: Lazy SMP ideas

Re: Lazy SMP ideas

Re: Lazy SMP ideas

Re: Lazy SMP ideas

Re: Lazy SMP ideas

Re: Lazy SMP ideas