So here are the test results from my implementation of ABDADA and Lazy SMP from the new engine and the modified YBWC from Hannibal.
I'm using the following set of positions (from Hannibal games and from some positions posted here in CCC):
Code: Select all
"r3k2r/pbpnqp2/1p1ppn1p/6p1/2PP4/2PBPNB1/P4PPP/R2Q1RK1 w kq - 2 12",
"2kr3r/pbpn1pq1/1p3n2/3p1R2/3P3p/2P2Q2/P1BN2PP/R3B2K w - - 4 22",
"r2n1rk1/1pq2ppp/p2pbn2/8/P3Pp2/2PBB2P/2PNQ1P1/1R3RK1 w - - 0 17",
"1r2r2k/1p4qp/p3bp2/4p2R/n3P3/2PB4/2PB1QPK/1R6 w - - 1 32",
"1b3r1k/rb1q3p/pp2pppP/3n1n2/1P2N3/P2B1NPQ/1B3P2/2R1R1K1 b - - 1 32",
"1r1r1qk1/pn1p2p1/1pp1npBp/8/2PB2QP/4R1P1/P4PK1/3R4 w - - 0 1",
"3rr1k1/1b2nnpp/1p1q1p2/pP1p1P2/P1pP2P1/2N1P1QP/3N1RB1/2R3K1 w - - 0 1",
"rn3rq1/p5k1/2p2bp1/1p4p1/8/2P1B1PQ/5PK1/3R3R w - - 0 1",
"1r3rk1/3bb1pp/1qn1p3/3pP3/3P1N2/2Q2N2/2P3PP/R1BR3K w - - 0 1",
"rn1q1rk1/2pbb3/pn2p3/1p1pPpp1/3P4/1PNBBN2/P1P1Q1PP/R4R1K w - - 0 1"
The positions are chosen based on their complexity where the engine would need at least a minute or two to complete the depth 20 iteration on single core.
Results are from a dual Intel Xeon E5-2698V3
https://ark.intel.com/products/81060/In ... e-2_30-GHz
The machine is 32 cores with 64 threads.
Hannibal modified YBWC searched with hash size of 512MB and fixed depth of 25 for thread values, 1,2,4,8,16,32,64. Values are summed and divided by the number of positions and then divided by the result of the single thread search. This doesn't take into account the turbo boost from the single core run, so the result is probably a bit lower than the correct value. Every start of the test ucinewgame is being issued, so the search is being started from scratch as hashes are cleared. The result for the Threads: 1 has average time spent in seconds and nodes in kNPS. The succeeding values for higher thread counts are multipliers for nodes, and the inverse for time.
Code: Select all
Threads: 1 time: 74.990600 nodes: 937.200000
Threads: 2 time: 2.149378 nodes: 1.942089
Threads: 4 time: 3.346066 nodes: 3.753011
Threads: 8 time: 5.226519 nodes: 6.973204
Threads: 16 time: 6.626765 nodes: 12.046264
Threads: 32 time: 8.139148 nodes: 16.006453
Threads: 64 time: 4.880303 nodes: 13.368858
As can be seen Hannibal NPS scaling is only good up to 16 cores. Maybe due to the machine being dual CPUs the engine struggled with NUMA due to Hannibal internal design of handling split points and repetition detection. Hannibal was only tested in an 8 core machine before this.
This is the result for the new chess engine with the modified ABDADA as implemented by Tom Kerrigan.
http://www.tckerrigan.com/Chess/Paralle ... ed_ABDADA/
Searched with 512MB hash and fixed depth of 20.
Code: Select all
Threads: 1 time: 108.299200 nodes: 3127.100000
Threads: 2 time: 2.168862 nodes: 1.787663
Threads: 4 time: 3.603405 nodes: 3.492227
Threads: 8 time: 6.197661 nodes: 6.876950
Threads: 16 time: 8.440115 nodes: 13.613154
Threads: 32 time: 10.724592 nodes: 26.575884
Threads: 64 time: 11.759754 nodes: 31.082229
NPS scaling is not perfect due probably to turbo boost in single core and the signalling to the threads to quit current iteration upon one thread completing that iteration. This is to synchronize and focus the effort into the next iteration. This is done without waiting for any threads.
This is the result for the LazySMP:
Code: Select all
Threads: 1 time: 109.826200 nodes: 3084.100000
Threads: 2 time: 3.847632 nodes: 1.844580
Threads: 4 time: 6.209981 nodes: 3.616316
Threads: 8 time: 6.069562 nodes: 7.212488
Threads: 16 time: 7.001232 nodes: 14.255216
Threads: 32 time: 13.694476 nodes: 27.848897
Threads: 64 time: 10.592256 nodes: 31.428827
LazySMP is implemented with 50% of the threads searching depth and another 50% on depth+1. As can be seen there is some kind of super linear speedup in Threads 2 and 4. This is probably why Lazy is so strong in 4 cores which is currently the standard with the rating lists.
Invictus chess engine source code can be found here:
ABDADA
https://github.com/ed-apostol/InvictusChess
LazySMP
https://github.com/ed-apostol/InvictusC ... ee/LazySMP