Threads test incl. Stockfish 5 and Komodo 8

fastgm · Post by **fastgm** » Thu Oct 09, 2014 5:25 pm

The thread test was supplemented by the latest versions of Stockfish and Komodo.

In Stockfish 5 we see an improvement in the SMP implementation at 8 threads compared to the previous version.
The patch which was added since Stockfish DD and causes this improvement is called "late-join".
Reference: https://github.com/mcostalba/Stockfish/ ... d80168c705
Under these test conditions the doubling of threads from 8 to 16 shows no improvement, even not for Stockfish 5.

This is different with Komodo. Here a continuous increase up to 16 threads is measurable.
Moreover, the SMP implementation in Komodo 8 has been improved again.

Here the data of the test and the graphical presentation, see also: http://www.fastgm.de/threads2.html including the test conditions.

Jouni · Post by **Jouni** » Thu Oct 09, 2014 6:31 pm

Very interesting. So Komodo is clear favorite in TCEC with 16 threads, if there is no improvements in SMP implemantation in latest SF!

zullil · Post by **zullil** » Thu Oct 09, 2014 7:18 pm

fastgm wrote: Under these test conditions the doubling of threads from 8 to 16 shows no improvement, even not for Stockfish 5.

I wonder if increasing Min Split Depth from its default value (of 7) when using 16 threads would result in some improvement.

bob · Post by **bob** » Thu Oct 09, 2014 7:30 pm

zullil wrote:
fastgm wrote: Under these test conditions the doubling of threads from 8 to 16 shows no improvement, even not for Stockfish 5.
I wonder if increasing Min Split Depth from its default value (of 7) when using 16 threads would result in some improvement.

Do they use the crafty-like "minimum thread group" or something similar to limit how many threads work at one split point? That will make a difference on machines with more cores.

zullil · Post by **zullil** » Fri Oct 10, 2014 12:41 am

bob wrote:
zullil wrote:
fastgm wrote: Under these test conditions the doubling of threads from 8 to 16 shows no improvement, even not for Stockfish 5.
I wonder if increasing Min Split Depth from its default value (of 7) when using 16 threads would result in some improvement.
Do they use the crafty-like "minimum thread group" or something similar to limit how many threads work at one split point? That will make a difference on machines with more cores.

Once there was a UCI option

Code: Select all

option name Max Threads per Split Point type spin default 5 min 4 max 8

but it is no longer present in Stockfish. Perhaps someone more knowledgeable than I can discuss how the number of threads per split point is currently limited/allocated.

Laskos · Post by **Laskos** » Fri Oct 10, 2014 7:48 pm

Thank you, Andreas, very important tests. I imagine that testing engines on 16 threads is time consuming as hell.

You bust to pieces two of Bob Hyatt loud claims:

1) That Komodo implementation of SMP is "buggy", "quick and dirty", and so on. It seems one of the best to 16 threads.

2) That formula for SMP improvement withe the number of cores is linear. He gave "his" mastermind formula (N-1)*0.7 + 1, IIRC. Nothing linear here, not for a single engine.

When I will reply to Bob, I will quote your results.

Thanks again.

bob · Post by **bob** » Fri Oct 10, 2014 11:42 pm

Laskos wrote:Thank you, Andreas, very important tests. I imagine that testing engines on 16 threads is time consuming as hell.

You bust to pieces two of Bob Hyatt loud claims:

1) That Komodo implementation of SMP is "buggy", "quick and dirty", and so on. It seems one of the best to 16 threads.

2) That formula for SMP improvement withe the number of cores is linear. He gave "his" mastermind formula (N-1)*0.7 + 1, IIRC. Nothing linear here, not for a single engine.

When I will reply to Bob, I will quote your results.

Thanks again.

Didn't bust a THING I said. I said "if a search "widens the tree", which is YOUR term, that it has a bug that can be fixed." I said nothing more or nothing less. If a program plays stronger using N cpus to a fixed depth than it plays using one CPU, it has a bug that can be fixed to improve the performance of the single-thread version. That YOU don't understand that simple statement doesn't mean you can "shoot it to pieces". It just shows YOU do not understand the issues of parallel search. This ranks right up there with the super-linear speedup nonsense that comes up on occasion. It does NOT happen unless the sequential program has a problem that can be fixed. period.

And will you please stop misquoting what I said about that speedup formula. I did NOT say it was a highly accurate fit to the observed data. I said it was a fairly accurate estimate that is quite easy for anyone to compute. Nothing more, nothing less. And it is pretty accurate through 16 cores for sure, and even beyond but with less testing data to support it. When you grow up and learn to read, you might understand the term "linear approximation" or "simple approximation" etc.

If you look back through old CCC archives, you can see ANOTHER discussion about this formula. Martin Fierz took a bunch of 1/2/4/8 core test data I ran for him and compared it to my formula. His discovery was that my formula was too pessimistic. But I didn't develop it to be optimistic or pessimistic. Just something that approximates the speedup for a rough estimate.

please...

And for the record, my approximation had NOTHING to do with predicting Elo. Just raw SMP speedup measured time to depth.

Laskos · Post by **Laskos** » Fri Oct 10, 2014 11:53 pm

bob wrote:
Laskos wrote:Thank you, Andreas, very important tests. I imagine that testing engines on 16 threads is time consuming as hell.

You bust to pieces two of Bob Hyatt loud claims:

1) That Komodo implementation of SMP is "buggy", "quick and dirty", and so on. It seems one of the best to 16 threads.

2) That formula for SMP improvement withe the number of cores is linear. He gave "his" mastermind formula (N-1)*0.7 + 1, IIRC. Nothing linear here, not for a single engine.

When I will reply to Bob, I will quote your results.

Thanks again.
Didn't bust a THING I said. I said "if a search "widens the tree", which is YOUR term, that it has a bug that can be fixed." I said nothing more or nothing less. If a program plays stronger using N cpus to a fixed depth than it plays using one CPU, it has a bug that can be fixed to improve the performance of the single-thread version. That YOU don't understand that simple statement doesn't mean you can "shoot it to pieces". It just shows YOU do not understand the issues of parallel search. This ranks right up there with the super-linear speedup nonsense that comes up on occasion. It does NOT happen unless the sequential program has a problem that can be fixed. period.

And will you please stop misquoting what I said about that speedup formula. I did NOT say it was a highly accurate fit to the observed data. I said it was a fairly accurate estimate that is quite easy for anyone to compute. Nothing more, nothing less. And it is pretty accurate through 16 cores for sure, and even beyond but with less testing data to support it. When you grow up and learn to read, you might understand the term "linear approximation" or "simple approximation" etc.

If you look back through old CCC archives, you can see ANOTHER discussion about this formula. Martin Fierz took a bunch of 1/2/4/8 core test data I ran for him and compared it to my formula. His discovery was that my formula was too pessimistic. But I didn't develop it to be optimistic or pessimistic. Just something that approximates the speedup for a rough estimate.

please...

And for the record, my approximation had NOTHING to do with predicting Elo. Just raw SMP speedup measured time to depth.

Sure, besides your ad hominem, then:

Imagine how strong single-core Komodo 8 would be "fixing the bug".
That you fitted your linear approximation of effective sped-up to 3 data points is showing that overfitting is all what you learnt at your PhD.

In the future, please don't come with (N-1)*0.7 + 1 crap to teach others on how SMP behaves.

bob · Post by **bob** » Sat Oct 11, 2014 1:58 am

Laskos wrote:
bob wrote:
Laskos wrote:Thank you, Andreas, very important tests. I imagine that testing engines on 16 threads is time consuming as hell.

You bust to pieces two of Bob Hyatt loud claims:

1) That Komodo implementation of SMP is "buggy", "quick and dirty", and so on. It seems one of the best to 16 threads.

2) That formula for SMP improvement withe the number of cores is linear. He gave "his" mastermind formula (N-1)*0.7 + 1, IIRC. Nothing linear here, not for a single engine.

When I will reply to Bob, I will quote your results.

Thanks again.
Didn't bust a THING I said. I said "if a search "widens the tree", which is YOUR term, that it has a bug that can be fixed." I said nothing more or nothing less. If a program plays stronger using N cpus to a fixed depth than it plays using one CPU, it has a bug that can be fixed to improve the performance of the single-thread version. That YOU don't understand that simple statement doesn't mean you can "shoot it to pieces". It just shows YOU do not understand the issues of parallel search. This ranks right up there with the super-linear speedup nonsense that comes up on occasion. It does NOT happen unless the sequential program has a problem that can be fixed. period.

And will you please stop misquoting what I said about that speedup formula. I did NOT say it was a highly accurate fit to the observed data. I said it was a fairly accurate estimate that is quite easy for anyone to compute. Nothing more, nothing less. And it is pretty accurate through 16 cores for sure, and even beyond but with less testing data to support it. When you grow up and learn to read, you might understand the term "linear approximation" or "simple approximation" etc.

If you look back through old CCC archives, you can see ANOTHER discussion about this formula. Martin Fierz took a bunch of 1/2/4/8 core test data I ran for him and compared it to my formula. His discovery was that my formula was too pessimistic. But I didn't develop it to be optimistic or pessimistic. Just something that approximates the speedup for a rough estimate.

please...

And for the record, my approximation had NOTHING to do with predicting Elo. Just raw SMP speedup measured time to depth.
Sure, besides your ad hominem, then:

Imagine how strong single-core Komodo 8 would be "fixing the bug".
That you fitted your linear approximation of effective sped-up to 3 data points is showing that overfitting is all what you learnt at your PhD.

In the future, please don't come with (N-1)*0.7 + 1 crap to teach others on how SMP behaves.

Sorry, I fitted my simple linear approximation to MUCH more than just three data points. 1,2,3,4,5,6,7,8 cpus, plus, since then, ditto for 1-12.

For 1-8 that linear approximation is somewhat LOW. I gave you a hint on finding the data that someone ELSE used to compute actual speedup numbers. My formula suggests 3.1 for 4 cpus. The actual data showed 3.4, for example.

Not crap at all. Just a linear approximation to something that is not quite linear.

bob · Post by **bob** » Sat Oct 11, 2014 9:17 pm

BTW if you don't want me to reply to YOUR posts, politely stop mentioning my name. You invited me to comment because of your idiotic comment.

While you are looking around, look up the definition of "approximation" and "exact" and discover the differences...

Threads test incl. Stockfish 5 and Komodo 8

Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8