Threads factor: Komodo, Houdini, Stockfish and Zappa

fastgm · Post by **fastgm** » Sat May 17, 2014 9:56 am

Conditions:
Hardware: Dual AMD Opteron 6376, 32 x 2.3 GHz (Turbo Core off)
OS: Windows 7 Pro 64-Bit
GUI: no
Settings: all engines default settings
Large Tables: no
Position: starting position
Time: 20 seconds

UCI commands:
setoption name threads value 1 (to 32)
go movetime 20000

The tests were run in console mode.

Here the values from 1 to 32 threads, starting position, with 20 seconds of computing time.
nps = nodes per second

Komodo, Houdini and Zappa are almost equal up to 16 threads (factor 11.94 - 11.37 - 11.34).
Stockfish DD and also the latest Stockfish version lies somewhat behind (factor 8.01 - 9.79).

Komodo scales still excellent beyond 16 threads. Also Zappa shows a very good SMP implementation.
Beyond 16 threads Houdini and Stockfish DD benefit much less than the other tested engines.

Increase from 16 to 32 threads:

Komodo TCECr (11,94 - 20,60 = 73%)
Zappa Mexico II (11,37 - 16,46 = 45%)
Stockfish 140513 ( 9,79 - 14,21 = 45%)
Stockfish DD ( 8,01 - 10,18 = 27%)
Houdini 4 Pro (11,34 - 13,48 = 19%)

zullil · Post by **zullil** » Sat May 17, 2014 11:34 am

Your Stockfish data seem remarkably monotone as a function of the number of threads. The factor increases each time you increment the number of threads (except once, when the number of threads goes from 21 to 22). Are you recording an average of multiple runs for each number of threads? If so, how many runs with each threads setting?

I ask because these are fixed-time searches, so essentially you are recording the total number of nodes searched in the twenty seconds. I'd imagine that this would vary quite a lot from run to run, even with no change in the threads setting. The trees searched might differ significantly, with even the best move changing from search to search. For example, here are two consecutive Stockfish searches, each done with 16 threads. Look at how much they differ:

Code: Select all

info depth 24 seldepth 36 score cp 22 nodes 151240061 nps 7561246 time 20002 multipv 1 pv e2e4 c7c5 b1c3 d7d6 g1f3 e7e5 f1c4 f8e7 a2a3 g8f6 e1g1 e8g8 b2b4 b8d7 d2d3 a7a6 c1d2 b7b5 c4d5 f6d5 c3d5 c8b7 b4c5 d7c5 d2a5 d8d7 d5b6 d7d8
info nodes 151240061 time 20002
bestmove e2e4 ponder c7c5

Code: Select all

info depth 25 seldepth 33 score cp 29 nodes 182162118 nps 9106739 time 20003 multipv 1 pv d2d4 d7d5 c2c4 e7e6 g1f3 d5c4 b1c3 b8c6 e2e4
info nodes 182162118 time 20003
bestmove d2d4 ponder d7d5

fastgm · Post by **fastgm** » Sat May 17, 2014 11:47 am

Are you recording an average of multiple runs for each number of
threads?

Yes.

If so, how many runs with each threads setting?

5 runs per thread setting, overall 800 runs!

lucasart · Post by **lucasart** » Sat May 17, 2014 12:46 pm

Thanks Andreas. Your posts are always interesting, especially this one!
* Impressive scaling by Komodo
* Great improvement thanks to Joona's "late join" patch

Uri Blass · Post by **Uri Blass** » Sat May 17, 2014 1:02 pm

Interesting information but the target of chess programs is not to search more nodes but to earn playing strength.

Nodes are not proportional to playing strength and I guess that for the same engine,
the same number of nodes with 1 thread is better than the same number of nodes with many threads.

Laskos · Post by **Laskos** » Sat May 17, 2014 1:06 pm

These are NPS. Hard to tell strength-wise, or effective speed-up. Time to depth (TTD) won't help too much either, as even SF with Joona's patch widens a bit, without talking of Komodo.

lucasart · Post by **lucasart** » Sat May 17, 2014 2:07 pm

Uri Blass wrote:Interesting information but the target of chess programs is not to search more nodes but to earn playing strength.

Nodes are not proportional to playing strength and I guess that for the same engine,
the same number of nodes with 1 thread is better than the same number of nodes with many threads.

Good point. TTD would be a better measure than NPS. The ideal measure ie ELO but it's extremely costly to calculate with good enough precision.

syzygy · Post by **syzygy** » Sat May 17, 2014 4:32 pm

Uri Blass wrote:Interesting information but the target of chess programs is not to search more nodes but to earn playing strength.

Nodes are not proportional to playing strength and I guess that for the same engine,
the same number of nodes with 1 thread is better than the same number of nodes with many threads.

This is of course true, but it does show that SF and H4 quite likely still have room for improvement here.

An interesting question is whether Komdo's smp implementation is comparable at all with that of Zappa, SF and H4 (which are all YBWC tree splitters with some further refinements). As Richard Vida mentioned on the fishcooking forum, it might be that Komodo uses a "lazy smp"-like approach:
http://talkchess.com/forum/viewtopic.php?t=46858
http://talkchess.com/forum/viewtopic.ph ... 350#504350

Isaac · Post by **Isaac** » Sat May 17, 2014 5:21 pm

I think it would be interesting to repeat the exact same test with a different FEN, particularly an end game FEN.

I expect Komodo to earn a lot (in the TCEC I have seen it having 56 Mnps in end game, compared to 16 Mnps in early game. So surely, more cores = better) while SF DD having a totally different graph (passes from 16 Mnps in early game up to 7 Mnps in end game. A quad core is faster than 16 cores. So more cores = worse performance.).
It would be interesting to see how the newer SF dev versions are doing compared to SF DD.
As a side-note Critter had a pentium 4 performance in end game running on 16 cores: 750 kN/s.

michiguel · Post by **michiguel** » Sat May 17, 2014 5:33 pm

Uri Blass wrote:Interesting information but the target of chess programs is not to search more nodes but to earn playing strength.

Nodes are not proportional to playing strength and I guess that for the same engine,
the same number of nodes with 1 thread is better than the same number of nodes with many threads.

But that is not the point of the experiment. This tells us about the upper limit of scalability, which is useful to know. In addition, it tells us how that upper limits suffers from addition of cores. For instance, Houdini starts to have problems after exactly 16 cores. Before that, it is among the best.

Miguel

Threads factor: Komodo, Houdini, Stockfish and Zappa

Threads factor: Komodo, Houdini, Stockfish and Zappa

Re: Threads factor: Komodo, Houdini, Stockfish and Zappa

Re: Threads factor: Komodo, Houdini, Stockfish and Zappa

Re: Threads factor: Komodo, Houdini, Stockfish and Zappa

Re: Threads factor: Komodo, Houdini, Stockfish and Zappa

Re: Threads factor: Komodo, Houdini, Stockfish and Zappa

Re: Threads factor: Komodo, Houdini, Stockfish and Zappa

Re: Threads factor: Komodo, Houdini, Stockfish and Zappa

Re: Threads factor: Komodo, Houdini, Stockfish and Zappa

Re: Threads factor: Komodo, Houdini, Stockfish and Zappa