Empirical results with Lazy SMP, YBWC, DTS

Laskos · Post by **Laskos** » Thu Apr 16, 2015 2:46 pm

Playing a bit with Andscacs implementation of Lazy SMP, I decided to compare the most common SMP schemes used in strong engines. All engines here are the latest version I saw.
On 150 positions with depths used for each engine so that the time used on 1 core is about 1-2 minutes per position, I measured average time-to-depth and average NPS.

Time-to-Depth Speedup:

Code: Select all

   Name               1 thread  2 thread  4 thread  8 thread 

Zappa (DTS)             1.00      1.99      3.65      5.62
Stockfish (YBWC)        1.00      1.93      3.04      4.76
Houdini (YBWC)          1.00      1.81      2.78      4.13
Cheng (Lazy SMP)        1.00      1.57      2.13      2.42
Andscacs (Lazy SMP)     1.00      1.29      1.74      1.54
Komodo (Lazy SMP ?)     1.00      1.41      1.54      1.83

First thing observed: Lazy SMP engines seem to behave differently from two more conventional parallelization implentations. Komodo seems to fall into Lazy SMP on this characteristic.
For Zappa, Stockfish, Houdini, time-to-depth is equal with Effective Speedup, as they don't widen their search. But Lazy SMP engines (and Komodo) do widen a lot during parallel search, going from 1 to 8 threads. Effective Speedup is larger in their case than time-to-depth speedup shown here. The widening is shown here at fixed depth in cutechess-cli:
H1=40 Elo points, H0=0 points, alpha=beta=0.05.

Score of Cheng 8 threads vs Cheng 1 thread: 91 - 51 - 71 [0.594] 213
ELO difference: 66
SPRT: H1 was accepted
Finished match

Score of Andscacs 8 threads vs Andscacs 1 thread: 16 - 1 - 13 [0.750] 30
ELO difference: 191
SPRT: H1 was accepted
Finished match

Score of Komodo 8 threads vs Komodo 1 thread: 29 - 13 - 63 [0.576] 105
ELO difference: 53
SPRT: H1 was accepted
Finished match

Komodo again seems similar to Lazy SMP engines in widening.

Then, NPS speedup came as colateral, and useful limiting value of Effective Speedup.

NPS Speedup:

Code: Select all

   Name               1 thread  2 thread  4 thread  8 thread 

Zappa (DTS)             1.00      1.96      3.59      5.60
Stockfish (YBWC)        1.00      1.97      3.75      6.85
Houdini (YBWC)          1.00      1.97      3.79      6.33
Cheng (Lazy SMP)        1.00      1.98      3.92      7.22
Andscacs (Lazy SMP)     1.00      1.93      3.71      7.23
Komodo (Lazy SMP ?)     1.00      1.99      3.93      7.09

Lazy SMP engines seem to scale the best on NPS. For Zappa (DTS) the NPS speedup matched within error margins to the total Effective Speedup given by time-to-depth. I don't understand why Lazy SMP engines with shared Hash and TT do widen copiously, and go deeper only partially.

bob · Post by **bob** » Thu Apr 16, 2015 11:11 pm

Lazy smp is just that "lazy". It makes little attempt to reduce search overhead, which is exactly the opposite of what DTS and YBW tries to do.

BTW as of my last discussion with Don off-line, Komodo was doing a traditional parallel search based on YBW. He simply let some things "go" which tends to widen the search significantly at split points.

cdani · Post by **cdani** » Fri Apr 17, 2015 12:44 am

Thanks for the analysis!

Laskos wrote:I don't understand why Lazy SMP engines with shared Hash and TT do widen copiously, and go deeper only partially.

I tried to address this partially introducing increased depth in some of the threads, so I hope to obtain some of the benefits of other algorithms, added to the benefits of widening of lazy mp, which seem to be more possibilities to find moves that in one thread search are being overlooked.

Adam Hair · Post by **Adam Hair** » Fri Apr 17, 2015 3:33 am

Here is some data that I have to go along with Kai's:

average nps scaling from 1 core to 8 cores (50 positions, *3 minute searches)

Code: Select all

cheng4 0.38	   7.89
Crafty 24.1	   7.32
Exchess 7.51b	 7.52
Gaviota 1.0	   3.97
Houdini 4	     6.44
Komodo 8	      7.58
Senpai 1.0	    2.49
Texel 1.05	    7.64
Zappa Mexico II  6.91
Stockfish 6	   6.57
Stockfish 230315 6.81

**median ttd speedup from 1 core to 8 cores (500 positions, depth chosen so that 1 core version searches for at least 5 minutes)

Code: Select all

cheng4 0.38	   3.38
Crafty 24.1	   5.36
Exchess 7.51b	 4.43
Gaviota 1.0	   3.38
Houdini 4	     4.91
Komodo 8	      2.37
Senpai 1.0	    0.99
Texel 1.05	    5.29
Zappa Mexico II  ***na
Stockfish 6	   5.21
Stockfish 230315 5.58

*Zappa does not obey "go movetime" correctly and Gaviota does not give total nodes search when sending the best move. In both cases, I used the last update for time and nodes for each position to determine the nps.

**I am using the median (geometric mean could be used also) because ttd(1 core)/ttd(8 cores) is a right skewed distribution (long right hand tail). Due to this, the median is a better indicator of ttd(1 core)/ttd(8 cores) for a random position than the average is.

***I am using a modified version of the similarity tester to generate the nps and ttd data. Zappa does not obey "go depth" correctly and so I do not have ttd data for that engine.

Lastly, I am trying to generate the effective speedup data when going from 1 core to 8 cores. By effective, I mean the ratio timecontrol(1 core)/timecontrol(8 cores) such that the 1 core version is approximately equal in strength to the 8 core version, where the 8 core version uses 60"+0.05" for its time control (or a time control such that the 8 core version is approximately equal in strength to SF 230315(8 cores) @ 60"+0.05"). So far, I have the effective speedup for 1 engine. In a 2000 game match (1000 positions selected from the pool of TCEC 8 move openings prepared by Nelson Hernandez and myself, reversed colors), SF 230315(1 core) @ 240"+0.2" scored +231-231=1538 against SF 230315(8 cores) @ 60"+0.05". Thus, SF 230315 has an effective speedup of ~4 under my conditions (my computer is 33% faster than the CCRL reference computer).

Laskos · Post by **Laskos** » Fri Apr 17, 2015 11:24 am

Very good Adam, a lot of data!

Adam Hair wrote:Here is some data that I have to go along with Kai's:

average nps scaling from 1 core to 8 cores (50 positions, *3 minute searches)
Code: Select all
cheng4 0.38	   7.89
Crafty 24.1	   7.32
Exchess 7.51b	 7.52
Gaviota 1.0	   3.97
Houdini 4	     6.44
Komodo 8	      7.58
Senpai 1.0	    2.49
Texel 1.05	    7.64
Zappa Mexico II  6.91
Stockfish 6	   6.57
Stockfish 230315 6.81
**median ttd speedup from 1 core to 8 cores (500 positions, depth chosen so that 1 core version searches for at least 5 minutes)
Code: Select all
cheng4 0.38	   3.38
Crafty 24.1	   5.36
Exchess 7.51b	 4.43
Gaviota 1.0	   3.38
Houdini 4	     4.91
Komodo 8	      2.37
Senpai 1.0	    0.99
Texel 1.05	    5.29
Zappa Mexico II  ***na
Stockfish 6	   5.21
Stockfish 230315 5.58
*Zappa does not obey "go movetime" correctly and Gaviota does not give total nodes search when sending the best move. In both cases, I used the last update for time and nodes for each position to determine the nps.

**I am using the median (geometric mean could be used also) because ttd(1 core)/ttd(8 cores) is a right skewed distribution (long right hand tail). Due to this, the median is a better indicator of ttd(1 core)/ttd(8 cores) for a random position than the average is.

I took the average in Shredder GUI, the only available option there and to me. The median is surely better, and it should be very close to geometric mean. I estimate that on 8 cores, for TTD, the average is by some 10% larger than the median, on 1 core by maybe 5%, so my TTD speedup might be 5% undervalued due to taking arithmetic mean. The TTD frequencies are well modeled by beta distribution and its probability density function. To exemplify, I took a typical spread of TTD to same depth for "many positions" (in fact the distribution is continuous) as a beta distribution with certain shape parameters, with probability density function:

Average here is 4/3 = 1.333.., median is 1.181.

***I am using a modified version of the similarity tester to generate the nps and ttd data. Zappa does not obey "go depth" correctly and so I do not have ttd data for that engine.

Zappa UCI didn't work for me too, but WB communication worked, in Shredder GUI.

Lastly, I am trying to generate the effective speedup data when going from 1 core to 8 cores. By effective, I mean the ratio timecontrol(1 core)/timecontrol(8 cores) such that the 1 core version is approximately equal in strength to the 8 core version, where the 8 core version uses 60"+0.05" for its time control (or a time control such that the 8 core version is approximately equal in strength to SF 230315(8 cores) @ 60"+0.05"). So far, I have the effective speedup for 1 engine. In a 2000 game match (1000 positions selected from the pool of TCEC 8 move openings prepared by Nelson Hernandez and myself, reversed colors), SF 230315(1 core) @ 240"+0.2" scored +231-231=1538 against SF 230315(8 cores) @ 60"+0.05". Thus, SF 230315 has an effective speedup of ~4 under my conditions (my computer is 33% faster than the CCRL reference computer).

The complete Effective Speedup is a royal pain to measure with some reasonable error margins, it needs many multicore games at slower than ultra-short TC. A shortcut for non-widening engines would be TTD. I usually look at CCRL and CEGT multicore results to have some picture. It would be very useful if you manage to cleanly measure these values.

Laskos · Post by **Laskos** » Fri Apr 17, 2015 12:23 pm

bob wrote:Lazy smp is just that "lazy". It makes little attempt to reduce search overhead, which is exactly the opposite of what DTS and YBW tries to do.

If this SMP scheme is so primitive compared to DTS and YBWC, then I don't understand how it performs so well. Cheng on 4 cores gains ~100 Elo points in CCRL 40/40, on a par with engines using much more involved SMP implementation (70-110 Elo points). The results of Dani with Andscacs show a healthy increase, comparable to conventional SMP benefit again. Seems that the copious overhead is not wasted, and engines widen "usefully", compensating for lack of depth.

Adam Hair · Post by **Adam Hair** » Fri Apr 17, 2015 4:56 pm

I forgot that Zappa understands CECP. I will try again with wb2uci.

It is definitely a royal pain measuring the effective speedup when you can only play one game at a time. But my new job does not leave me much free time, so my computer is free for this measurement. SF 6 is being measured now, and Komodo is next. Maybe Gaviota, Crafty, and Zappa also, but it will take a long time to measure their speedup.

bob · Post by **bob** » Sat Apr 18, 2015 2:34 am

Laskos wrote:
bob wrote:Lazy smp is just that "lazy". It makes little attempt to reduce search overhead, which is exactly the opposite of what DTS and YBW tries to do.
If this SMP scheme is so primitive compared to DTS and YBWC, then I don't understand how it performs so well. Cheng on 4 cores gains ~100 Elo points in CCRL 40/40, on a par with engines using much more involved SMP implementation (70-110 Elo points). The results of Dani with Andscacs show a healthy increase, comparable to conventional SMP benefit again. Seems that the copious overhead is not wasted, and engines widen "usefully", compensating for lack of depth.

Many things work well with 2 cores. Fewer with 4. even fewer with 8 and beyond, which is where the interesting stuff is.

EvgeniyZh · Post by **EvgeniyZh** » Thu Jul 09, 2015 4:41 pm

Would be interesting to see how they behave on really high number of cores, like 32 or 64 cored server. Does DTS still scales well?

bob · Post by **bob** » Thu Jul 09, 2015 7:23 pm

EvgeniyZh wrote:Would be interesting to see how they behave on really high number of cores, like 32 or 64 cored server. Does DTS still scales well?

That's a poorly phrased question. DTS +can+ scale pretty well. But like every computer algorithm, it can be coded in ways that are less efficient. I might have actually beaten DTS with the new algorithm. I am running big tests right now to see if this holds up.

Empirical results with Lazy SMP, YBWC, DTS

Empirical results with Lazy SMP, YBWC, DTS

Re: Empirical results with Lazy SMP, YBWC, DTS

Re: Empirical results with Lazy SMP, YBWC, DTS

Re: Empirical results with Lazy SMP, YBWC, DTS

Re: Empirical results with Lazy SMP, YBWC, DTS

Re: Empirical results with Lazy SMP, YBWC, DTS

Re: Empirical results with Lazy SMP, YBWC, DTS

Re: Empirical results with Lazy SMP, YBWC, DTS

Re: Empirical results with Lazy SMP, YBWC, DTS

Re: Empirical results with Lazy SMP, YBWC, DTS