Threads test incl. Stockfish 5 and Komodo 8

michiguel · Post by **michiguel** » Tue Oct 14, 2014 10:51 pm

bob wrote:
Adam Hair wrote:I think that the differences in the draw rates makes Zappa look like it performs better at 2, 4 , and 8 cores than Stockfish and Komodo. Draw rates are positively correlated to the average strength of the opponents. Higher draw rates contract Elo differences. So, the higher strength of Stockfish and Komodo causes the Elo difference measurements to be smaller as compared to Zappa.

So, I recomputed the ratings differences into Wilo differences (wins and losses only) and plotted them:

Is it REALLY true that stockfish gets nothing going from 8 to 16? That defies any logic. I could see that going from 256 to 512, but 8 to 16? I wonder if anyone bothered tuning stockfish for that test? I saw the craftyish minimum thread depth and max threads per split point idea mentioned regarding stockfish. those become more important as the number of cores increases.

I also have a hard time interpreting the results. +200 Elo for komodo going from 8 to 16? That is super-linear territory and doesn't make any sense at all.

If you read the post, you will find it is *NOT* elo. It is a different way of calculating a rating, as Adam mentioned (it is taken into account W/L). It is proportional, but not the same scale.

Miguel

michiguel · Post by **michiguel** » Tue Oct 14, 2014 10:52 pm

Adam Hair wrote:I think that the differences in the draw rates makes Zappa look like it performs better at 2, 4 , and 8 cores than Stockfish and Komodo. Draw rates are positively correlated to the average strength of the opponents. Higher draw rates contract Elo differences. So, the higher strength of Stockfish and Komodo causes the Elo difference measurements to be smaller as compared to Zappa.

So, I recomputed the ratings differences into Wilo differences (wins and losses only) and plotted them:

Fantastic observation Adam! Great catch.

Miguel

michiguel · Post by **michiguel** » Tue Oct 14, 2014 10:57 pm

bob wrote:
michiguel wrote:
ernest wrote:
syzygy wrote:How come absolutely nobody here with some knowledge of parallel programming agrees with you...
Well, at least on the theoretical side, what Bob says here doesn't sound so stupid...
In paper. Only in paper.

But in practice, linearizing a parallel algorithm would be a mess of gigantic proportions, and I cannot even imagine the problems and the unreadability. So, saying that Komodo could be "fixed" in single core is silly. Even in paper the "fix" may not give any improvement as I demonstrated in another post. It could be equal.

The fact is that whatever Komodo is doing, it is scaling sustainably better than others. Whether widening the search is a cause or correlation is unknown, but clearly, the whole approach is good.

Miguel
Sorry, but more than one program already has this capability. I think I saw it in stockfish, where they do a "pseudo-paralle-search" to debug, by doing the usual multiplexing operation.

Never heard of this and it does not make sense for debugging. You lose all the randomness. Maybe a SF developer can clarify, but I do not believe it. Even if this exists, does not have any overhead?

Miguel

That is NOT a "messy" approach. When a program can be "fixed" by multiplexing, it can be fixed without multiplexing. The most common reason multiplexing will help is if you have poor move ordering where multiplexing lets you get to the better move while the first (inferior) move is still in progress. Of course, there are solutions that address that without multiplexing. But to date, NO algorithm has been found that improves the search above and beyond the basic idea of splitting the tree into parallel searches. How you split has lots of choices, but splitting to gain additional depth is the goal of parallel searching.

It is possible to write a program where it gets stronger as it goes deeper, in a sometimes non-linear way. Measuring Elo only between two versions doesn't say why/where the improvement comes from, just that it is there. One can measure absolutely anything, but quite often the measurement doesn't reveal anything useful...

mjlef · Post by **mjlef** » Tue Oct 14, 2014 11:13 pm

For some limited publicly available data go here:

http://www.husvankempen.de/nunn/40_4_Ra ... ion/3.html

keeping in mind these are fast games, but the 8 core vs 8 core result of SF% vs K8 is 49.7 % for SF5.

Now looking at the 12 core results for SF% here:

http://www.husvankempen.de/nunn/40_4_Ra ... ion/1.html

show 46.5 % of 12 core SF% vs 12 core K8. These are fast games and just a few hundred, but it does suggest less or no gain when going above 8 cores.

More testing at longer time control and more cores would be very interesting.

Adam Hair · Post by **Adam Hair** » Tue Oct 14, 2014 11:44 pm

bob wrote:
Laskos wrote:
bob wrote:
Laskos wrote:
bob wrote:
1. Where is there ANY mention about Crafty's speedup or Elo gain in the original post?

Assuming 90 Elo points for doubling time at 60''+0.05'', from Andreas results I plotted the effective speed-up for Crafty (your linear 1+(Ncpus -1)*0.7), Komodo 8 and Stockfish 5.

It seems all top engines are very weak in their SMP implementation. They are buggy. If you Bob will go to 32 threads or so, Crafty, with such unbelievable SMP performance (linear), will be the strongest engine around. Good job, Bob.

Congratulations for improving Komodo by 70 Elo points (single or SMP or both).
Always "I assume" or "I think" or "I believe" or "I heard" or "I saw somewhere" and such.

Notice my formula is right with Komodo through 8. Highly inaccurate. That it diverges a bit at 16 is cause for a major panic? When I have specifically stated that it is really well-tested only through 16? Where I have specifically stated that it is also architecturally dependent since multi-core chips have a bottleneck that single-core multi-chip machines do not. NO formula will predict SMP performance with two decimal place accuracy across all existing platforms using Intel/AMD processors. Nobody in their right mind would expect them to do so.

Those that actually know how to measure speedup would do the following, something you have not done, probably were not aware of needing to do, etc.

1. measure NPS with one thread nothing else running.

2. if you have N cores, run N instances of a single thread search and measure the NPS. If you add the N NPS values together, it might or might not be equal to N times the NPS from 1.

Now you know the hardware scaling limit. Assume a REALLY good implementation of the hardware so that for 2 you really do get N * number 1, which means raw speed scales perfectly.

3. Run a set of positions to fixed depth using 1 and N threads. For the N thread version, run them several times and average. Divide 1 thread total time by N thread total time. You have an actual SMP speedup. The JICCA DTS article gives this data for Cray Blitz for 1, 2, 4, 8 and 16 processors on a good architecture. You can find the numbers, or I can post them later. There was a thread, started by Vincent when he was not happy with the speedup he got on a supercomputer he was going to use for one of the WCCC events, and he was complaining that my numbers were grossly exaggerated. I ran a bunch of positions at the AMD development lab, using 1,2,4 and 8 processors (no multi-core) and gave the data to Fierz. He discovered that my formula was a low estimate and that the actual numbers were a bit better.

I don't waste all of my time trying to measure parallel speedup unless I make significant changes to the parallel search. Could it be better or worse today? quite possible since my search (not the parallel part) has changed a lot with more aggressive LMR, more aggressive null-move pruning, more aggressive forward pruning, singular extensions, and on and on. And one day I will take the time to test again and see if the formula needs an update. Until then, it is an ESTIMATE.

However, I consider your plot to be bogus, because you are mixing apples and oranges. My formula is purely SMP speedup. You are using Elo, an estimate of elo per doubling, to compute an estimate of an estimate for what Komodo's parallel speedup would be. There is a real, scientific, accurate, no-guesswork method to compute parallel speedup. You are not using it.

Again, I don't see the point in (a) your broken calculations or (b) your concern with an estimate that actually matches your broken data almost perfectly through 8 processors and diverges later. Does the speedup grow linearly? No, and I have never claimed it did. The estimate is a linear function because one can compute that in their head to get a quick estimate. Want an accurate number. Spend the time and run the tests. That is how _I_ measure and report speedup. ALWAYS.
On 1) and 3) there is Andreas'
http://www.talkchess.com/forum/viewtopic.php?p=570955

As for my data is bogus because I am using Elos, it's completely non-bogus if you will see that doubling in time at that TC means 90-100 Elo pints, with 80 it will go superlinear. And is consistent with other empirical data. And it doesn't matter too much, Crafty anyway will be on top by far on 16 threads. The question is: is SMP implementation in Crafty so much superior?

Do you believe doubling in time is 90-100 is true for ALL programs? In fact, it seems to be quite a bit higher than the number most have been using for years, namely 50-70.

At the very least it is approximately true for these engines at the OP's time control and computer.

Adam Hair · Post by **Adam Hair** » Wed Oct 15, 2014 12:09 am

bob wrote:
Adam Hair wrote:I think that the differences in the draw rates makes Zappa look like it performs better at 2, 4 , and 8 cores than Stockfish and Komodo. Draw rates are positively correlated to the average strength of the opponents. Higher draw rates contract Elo differences. So, the higher strength of Stockfish and Komodo causes the Elo difference measurements to be smaller as compared to Zappa.

So, I recomputed the ratings differences into Wilo differences (wins and losses only) and plotted them:

Is it REALLY true that stockfish gets nothing going from 8 to 16? That defies any logic. I could see that going from 256 to 512, but 8 to 16? I wonder if anyone bothered tuning stockfish for that test? I saw the craftyish minimum thread depth and max threads per split point idea mentioned regarding stockfish. those become more important as the number of cores increases.

It could be that the tc (60" + 0.05") is too fast for Stockfish with 16 cores to work well. And perhaps it does need to be tuned for 16 cores. Andreas does not mention anything about adjusting the threads per split point.

bob wrote:
I also have a hard time interpreting the results. +200 Elo for komodo going from 8 to 16? That is super-linear territory and doesn't make any sense at all.

The ratings I produced ignored the draws. Therefore, they have a different scale than Elo. Miguel and I call them Wilo ratings since they are calculated from wins and losses only.

syzygy · Post by **syzygy** » Wed Oct 15, 2014 12:47 am

michiguel wrote:
bob wrote:
michiguel wrote:
ernest wrote:
syzygy wrote:How come absolutely nobody here with some knowledge of parallel programming agrees with you...
Well, at least on the theoretical side, what Bob says here doesn't sound so stupid...
In paper. Only in paper.

But in practice, linearizing a parallel algorithm would be a mess of gigantic proportions, and I cannot even imagine the problems and the unreadability. So, saying that Komodo could be "fixed" in single core is silly. Even in paper the "fix" may not give any improvement as I demonstrated in another post. It could be equal.

The fact is that whatever Komodo is doing, it is scaling sustainably better than others. Whether widening the search is a cause or correlation is unknown, but clearly, the whole approach is good.

Miguel
Sorry, but more than one program already has this capability. I think I saw it in stockfish, where they do a "pseudo-paralle-search" to debug, by doing the usual multiplexing operation.

Never heard of this and it does not make sense for debugging. You lose all the randomness. Maybe a SF developer can clarify, but I do not believe it. Even if this exists, does not have any overhead?

SF certainly does not have anything remotely comparable to a single-threaded simulation of a multithreaded search.

But if it had, it would obviously have a major overhead. Bob surely understands this.

Even if it could be done without overhead, it would have nothing to do with the topic of this thread. Nobody is claiming that Komodo's 4-core search is as strong as Komodo's single-threaded search with 4x the time.

bob · Post by **bob** » Wed Oct 15, 2014 1:30 am

michiguel wrote:
bob wrote:
michiguel wrote:
ernest wrote:
syzygy wrote:How come absolutely nobody here with some knowledge of parallel programming agrees with you...
Well, at least on the theoretical side, what Bob says here doesn't sound so stupid...
In paper. Only in paper.

But in practice, linearizing a parallel algorithm would be a mess of gigantic proportions, and I cannot even imagine the problems and the unreadability. So, saying that Komodo could be "fixed" in single core is silly. Even in paper the "fix" may not give any improvement as I demonstrated in another post. It could be equal.

The fact is that whatever Komodo is doing, it is scaling sustainably better than others. Whether widening the search is a cause or correlation is unknown, but clearly, the whole approach is good.

Miguel
Sorry, but more than one program already has this capability. I think I saw it in stockfish, where they do a "pseudo-paralle-search" to debug, by doing the usual multiplexing operation.

Never heard of this and it does not make sense for debugging. You lose all the randomness. Maybe a SF developer can clarify, but I do not believe it. Even if this exists, does not have any overhead?

Miguel

That is NOT a "messy" approach. When a program can be "fixed" by multiplexing, it can be fixed without multiplexing. The most common reason multiplexing will help is if you have poor move ordering where multiplexing lets you get to the better move while the first (inferior) move is still in progress. Of course, there are solutions that address that without multiplexing. But to date, NO algorithm has been found that improves the search above and beyond the basic idea of splitting the tree into parallel searches. How you split has lots of choices, but splitting to gain additional depth is the goal of parallel searching.

It is possible to write a program where it gets stronger as it goes deeper, in a sometimes non-linear way. Measuring Elo only between two versions doesn't say why/where the improvement comes from, just that it is there. One can measure absolutely anything, but quite often the measurement doesn't reveal anything useful...

First, whether you believe it or not doesn't matter to me. I simply reported what I saw. If you want more details, look at search.cpp, I didn't examine it very carefully. but here is a comment:

// Set to true to force running with one thread. Used for debugging
const bool FakeSplit = false;

As far as overhead, of course there is some. Not a lot, but the code to flip back and forth between threads is not completely free. This was the way I debugged my first parallel chess program back in 1978 or so, as getting access to a real parallel machine back then was not easy.

bob · Post by **bob** » Wed Oct 15, 2014 1:34 am

syzygy wrote:
michiguel wrote:
bob wrote:
michiguel wrote:
ernest wrote:
syzygy wrote:How come absolutely nobody here with some knowledge of parallel programming agrees with you...
Well, at least on the theoretical side, what Bob says here doesn't sound so stupid...
In paper. Only in paper.

But in practice, linearizing a parallel algorithm would be a mess of gigantic proportions, and I cannot even imagine the problems and the unreadability. So, saying that Komodo could be "fixed" in single core is silly. Even in paper the "fix" may not give any improvement as I demonstrated in another post. It could be equal.

The fact is that whatever Komodo is doing, it is scaling sustainably better than others. Whether widening the search is a cause or correlation is unknown, but clearly, the whole approach is good.

Miguel
Sorry, but more than one program already has this capability. I think I saw it in stockfish, where they do a "pseudo-paralle-search" to debug, by doing the usual multiplexing operation.

Never heard of this and it does not make sense for debugging. You lose all the randomness. Maybe a SF developer can clarify, but I do not believe it. Even if this exists, does not have any overhead?
SF certainly does not have anything remotely comparable to a single-threaded simulation of a multithreaded search.

But if it had, it would obviously have a major overhead. Bob surely understands this.

Even if it could be done without overhead, it would have nothing to do with the topic of this thread. Nobody is claiming that Komodo's 4-core search is as strong as Komodo's single-threaded search with 4x the time.

Go look at what this code triggers: this directly from stockfish search.cpp:

// Set to true to force running with one thread. Used for debugging
const bool FakeSplit = false;

I didn't look at it at all, just noticed it in passing. If you don't understand the multiplexing issue and why it was brought up, then please stay out of the discussion. It is DIRECTLY related to the "widening" issue that keeps getting broached here. If someone gains from this "widening" they could get the same benefit from multiplexing the search.

Adam Hair · Post by **Adam Hair** » Wed Oct 15, 2014 3:09 am

bob wrote:
syzygy wrote:
michiguel wrote:
bob wrote:
michiguel wrote:
ernest wrote:
syzygy wrote:How come absolutely nobody here with some knowledge of parallel programming agrees with you...
Well, at least on the theoretical side, what Bob says here doesn't sound so stupid...
In paper. Only in paper.

But in practice, linearizing a parallel algorithm would be a mess of gigantic proportions, and I cannot even imagine the problems and the unreadability. So, saying that Komodo could be "fixed" in single core is silly. Even in paper the "fix" may not give any improvement as I demonstrated in another post. It could be equal.

The fact is that whatever Komodo is doing, it is scaling sustainably better than others. Whether widening the search is a cause or correlation is unknown, but clearly, the whole approach is good.

Miguel
Sorry, but more than one program already has this capability. I think I saw it in stockfish, where they do a "pseudo-paralle-search" to debug, by doing the usual multiplexing operation.

Never heard of this and it does not make sense for debugging. You lose all the randomness. Maybe a SF developer can clarify, but I do not believe it. Even if this exists, does not have any overhead?
SF certainly does not have anything remotely comparable to a single-threaded simulation of a multithreaded search.

But if it had, it would obviously have a major overhead. Bob surely understands this.

Even if it could be done without overhead, it would have nothing to do with the topic of this thread. Nobody is claiming that Komodo's 4-core search is as strong as Komodo's single-threaded search with 4x the time.
Go look at what this code triggers: this directly from stockfish search.cpp:

// Set to true to force running with one thread. Used for debugging
const bool FakeSplit = false;

FakeSplit is no longer in Stockfish. The following link shows the code that contained FakeSplit and the changes made.

https://github.com/zamar/Stockfish/comm ... a50004f810

This is the code that contained FakeSplit in Stockfish 4:

Code: Select all

      // Step 19. Check for splitting the search
      if (   !SpNode
          &&  depth >= Threads.minimumSplitDepth
          &&  Threads.available_slave&#40;thisThread&#41;
          &&  thisThread->splitPointsSize < MAX_SPLITPOINTS_PER_THREAD&#41;
      &#123;
          assert&#40;bestValue < beta&#41;;

          thisThread->split<FakeSplit>&#40;pos, ss, alpha, beta, &bestValue, &bestMove,
                                       depth, threatMove, moveCount, &mp, NT, cutNode&#41;;
          if &#40;bestValue >= beta&#41;
              break;
      &#125;
    &#125;

    if &#40;SpNode&#41;
        return bestValue;

What other programs are capable of multiplexing?

bob wrote: I didn't look at it at all, just noticed it in passing. If you don't understand the multiplexing issue and why it was brought up, then please stay out of the discussion. It is DIRECTLY related to the "widening" issue that keeps getting broached here. If someone gains from this "widening" they could get the same benefit from multiplexing the search.

Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8