Threads test incl. Stockfish 5 and Komodo 8

lkaufman · Post by **lkaufman** » Wed Oct 15, 2014 6:39 pm

x3 wrote:SF5 8-16 cores +37 elo 3200 games
K8 8-16 cores +45 elo 1600 games
SF5 16-24 cores +7 elo 1600 games
K8 16-24 cores +18 elo 1480 gsames

3m+1s
8 move book repeating

x3

It's interesting that Andreas reported only 4 elo more for SF5 on 16 cores than on 8 (each running against 1 core), while you got 37 in direct match, both at least 3k games. Way beyond error margin. I wonder if the difference is due to direct match vs. matches with 1 core, or to different time limit, or something else? Any idea?

bob · Post by **bob** » Wed Oct 15, 2014 6:58 pm

michiguel wrote:
bob wrote:
Adam Hair wrote:
bob wrote:
syzygy wrote:
michiguel wrote:
bob wrote:
michiguel wrote:
ernest wrote:
syzygy wrote:How come absolutely nobody here with some knowledge of parallel programming agrees with you...
Well, at least on the theoretical side, what Bob says here doesn't sound so stupid...
In paper. Only in paper.

But in practice, linearizing a parallel algorithm would be a mess of gigantic proportions, and I cannot even imagine the problems and the unreadability. So, saying that Komodo could be "fixed" in single core is silly. Even in paper the "fix" may not give any improvement as I demonstrated in another post. It could be equal.

The fact is that whatever Komodo is doing, it is scaling sustainably better than others. Whether widening the search is a cause or correlation is unknown, but clearly, the whole approach is good.

Miguel
Sorry, but more than one program already has this capability. I think I saw it in stockfish, where they do a "pseudo-paralle-search" to debug, by doing the usual multiplexing operation.

Never heard of this and it does not make sense for debugging. You lose all the randomness. Maybe a SF developer can clarify, but I do not believe it. Even if this exists, does not have any overhead?
SF certainly does not have anything remotely comparable to a single-threaded simulation of a multithreaded search.

But if it had, it would obviously have a major overhead. Bob surely understands this.

Even if it could be done without overhead, it would have nothing to do with the topic of this thread. Nobody is claiming that Komodo's 4-core search is as strong as Komodo's single-threaded search with 4x the time.
Go look at what this code triggers: this directly from stockfish search.cpp:

// Set to true to force running with one thread. Used for debugging
const bool FakeSplit = false;
FakeSplit is no longer in Stockfish. The following link shows the code that contained FakeSplit and the changes made.

https://github.com/zamar/Stockfish/comm ... a50004f810

This is the code that contained FakeSplit in Stockfish 4:
Code: Select all
      // Step 19. Check for splitting the search
      if (   !SpNode
          &&  depth >= Threads.minimumSplitDepth
          &&  Threads.available_slave(thisThread)
          &&  thisThread->splitPointsSize < MAX_SPLITPOINTS_PER_THREAD)
      {
          assert(bestValue < beta);

          thisThread->split<FakeSplit>(pos, ss, alpha, beta, &bestValue, &bestMove,
                                       depth, threatMove, moveCount, &mp, NT, cutNode);
          if (bestValue >= beta)
              break;
      }
    }

    if (SpNode)
        return bestValue;
What other programs are capable of multiplexing?
I don't peruse other programs so no idea. A couple of versions of Crafty did, when I first started the parallel search code, although I don't think any were ever released. The 1983 version of Cray Blitz did as we had to simulate testing while waiting on Cray to get the SMP libraries and fortran compiler changes completed.

The only sensible reason for multiplexing is debugging when you don't have a decent debugger than can handle threads. gdb does this quite nicely today making multiplexing unnecessary. The multiplexing concept, to shoot down consistent super-linear speedups and such is a simple idea. It is NOT the way a rational person would "fix" the problem. It is mainly used as a simple and easy to understand methodology to debunk the occasional bogus claims that surface here and there.
So, the only sensible reason for it is debugging, not production code. Thank you.

Miguel

bob wrote: I didn't look at it at all, just noticed it in passing. If you don't understand the multiplexing issue and why it was brought up, then please stay out of the discussion. It is DIRECTLY related to the "widening" issue that keeps getting broached here. If someone gains from this "widening" they could get the same benefit from multiplexing the search.

Yes. If it improves your program, you have something to work on. It is simply a way of explaining why some things simply do not happen as claimed. Super-linear parallel speedups have a specific cause - poor move ordering. That can be solved by multiplexing, but there are much better solutions (actually improving the order rather than searching the moves out of order to solve the issue).

If a "wider" tree helps a parallel searcher, that same wider tree would also help the non-parallel searcher. multiplexing would show the same gain, but it is not the best solution. It just illustrates the problem. Which in this case would be excessive pruning that could be toned down.

bob · Post by **bob** » Wed Oct 15, 2014 7:00 pm

Adam Hair wrote:
bob wrote:
Adam Hair wrote:
bob wrote:
syzygy wrote:
michiguel wrote:
bob wrote:
michiguel wrote:
ernest wrote:
syzygy wrote:How come absolutely nobody here with some knowledge of parallel programming agrees with you...
Well, at least on the theoretical side, what Bob says here doesn't sound so stupid...
In paper. Only in paper.

But in practice, linearizing a parallel algorithm would be a mess of gigantic proportions, and I cannot even imagine the problems and the unreadability. So, saying that Komodo could be "fixed" in single core is silly. Even in paper the "fix" may not give any improvement as I demonstrated in another post. It could be equal.

The fact is that whatever Komodo is doing, it is scaling sustainably better than others. Whether widening the search is a cause or correlation is unknown, but clearly, the whole approach is good.

Miguel
Sorry, but more than one program already has this capability. I think I saw it in stockfish, where they do a "pseudo-paralle-search" to debug, by doing the usual multiplexing operation.

Never heard of this and it does not make sense for debugging. You lose all the randomness. Maybe a SF developer can clarify, but I do not believe it. Even if this exists, does not have any overhead?
SF certainly does not have anything remotely comparable to a single-threaded simulation of a multithreaded search.

But if it had, it would obviously have a major overhead. Bob surely understands this.

Even if it could be done without overhead, it would have nothing to do with the topic of this thread. Nobody is claiming that Komodo's 4-core search is as strong as Komodo's single-threaded search with 4x the time.
Go look at what this code triggers: this directly from stockfish search.cpp:

// Set to true to force running with one thread. Used for debugging
const bool FakeSplit = false;
FakeSplit is no longer in Stockfish. The following link shows the code that contained FakeSplit and the changes made.

https://github.com/zamar/Stockfish/comm ... a50004f810

This is the code that contained FakeSplit in Stockfish 4:
Code: Select all
      // Step 19. Check for splitting the search
      if (   !SpNode
          &&  depth >= Threads.minimumSplitDepth
          &&  Threads.available_slave(thisThread)
          &&  thisThread->splitPointsSize < MAX_SPLITPOINTS_PER_THREAD)
      {
          assert(bestValue < beta);

          thisThread->split<FakeSplit>(pos, ss, alpha, beta, &bestValue, &bestMove,
                                       depth, threatMove, moveCount, &mp, NT, cutNode);
          if (bestValue >= beta)
              break;
      }
    }

    if (SpNode)
        return bestValue;
What other programs are capable of multiplexing?
I don't peruse other programs so no idea. A couple of versions of Crafty did, when I first started the parallel search code, although I don't think any were ever released. The 1983 version of Cray Blitz did as we had to simulate testing while waiting on Cray to get the SMP libraries and fortran compiler changes completed.
I asked the question because you stated that "more than one already has this capability".

The only sensible reason for multiplexing is debugging when you don't have a decent debugger than can handle threads. gdb does this quite nicely today making multiplexing unnecessary. The multiplexing concept, to shoot down consistent super-linear speedups and such is a simple idea. It is NOT the way a rational person would "fix" the problem. It is mainly used as a simple and easy to understand methodology to debunk the occasional bogus claims that surface here and there.

bob wrote: I didn't look at it at all, just noticed it in passing. If you don't understand the multiplexing issue and why it was brought up, then please stay out of the discussion. It is DIRECTLY related to the "widening" issue that keeps getting broached here. If someone gains from this "widening" they could get the same benefit from multiplexing the search.

I gave you the two I know of. Others have used it in the past for testing/debugging. Monty Newborn, is one example, there are others. Nothing current since the code is not in crafty any longer, since there are direct solutions to debug parallel programs today (gdb, for example).

bob · Post by **bob** » Wed Oct 15, 2014 7:05 pm

Laskos wrote:
bob wrote:
Laskos wrote: Agreeing for at least 90 points per doubling at this TC on this hardware (Dual AMD Opteron 6376 2x16 Cores) for Komodo 8, it's "easy" to check for Andreas whether Bob's linear formula, which gives ~85 Elo points gain from 16 to 32 threads or my log formula, which gives ~38 Elo points gain, is correct.

Bob, let's have a bet, whoever in his prediction is closer, wins it.

Andreas, could you check in the same conditions 32 threads vs. 1 thread for Komodo 8? I know it's not easy to run 3,000 games with an engine on 32 threads, but Bob made here some claims.

Bob stated that on 64 core Itanium box, he got with Crafty an effective speed-up of 32, without even tuning the engine. That is a VERY high number for effective speed-up, and I begin to wonder about exceptionality of Crafty.

I hope Andreas sees this post.
You are so far behind in parallel search topics I don't know where to start. 50% speedup is not "very high". You might want to check some of Cozzie's numbers on large numbers of processors and then report back. I've never considered 50% as acceptable. You can find my 1994-era 16 cpu results with Cray Blitz. And some 32 cpu results on the T932 we used in 1994 as well. And with 32 cpus we were NOT searching less than 16x faster.

Here's the 1990ish DTS numbers from the C90-16 we used around that time frame:
Code: Select all
          +-------------+-----+-----+-----+-----+------+
          |# processors |  1  |  2  |  4  |  8  |  16  |
          +-------------+-----+-----+-----+-----+------+
          |speedup      | 1.0 | 2.0 | 3.7 | 6.6 | 11.1 |
          +-------------+-----+-----+-----+-----+------+
                 Table 3 DTS performance results
Those are Cray Blitz numbers that I happen to have handy (the DTS paper on my web page in fact). I can only report what I have actually measured. But apparently there are a LOT of parallel results floating around from quite a few programs, results which you have no idea about.

Certainly the graph I questioned was wrong, because it shows a super-linear property that is impossible. IE NOBODY is going to get a bigger gain going from 8 to 16 than when going from 4 to 8. That's the poster-child for "absurd" and is exactly the kind of nonsense that needs debunking when it happens. And before you harp on my linear approximation again, I will continue to remind you that (a) it was always claimed to be a simple approximation that is pretty close and (b) was NEVER claimed to extrapolate beyond 32 processors. There's not a parallel search person on the planet that has ever claimed that there was no diminishing return with alpha/beta. I wrote a paper to explain exactly why there is such a problem, in fact. All caused by the alpha/beta algorithm itself. Nothing about the tree grows linearly, most especially the search overhead. It is all exponential, and always has.
I was not talking of DTS and that sort of architecture machines. Just Komodo gain from 16 to 32 cores on Andreas machine, at that time control. Also, are you sure you are not mangling NPS speed-up with effective speed-up? Huge NPS speed-up I saw all around, huge effective speed-up never. You claim ~1.97 effective speed-up from 16 to 32 cores, which is about 85-90 Elo gain, l am claiming ~1.34 effective speed-up, which is about 38-40 Elo gain, on Andreas machine at that time control. If only Andreas would bother to test Komodo 32 threads versus Komodo 1 thread in 3,000 games. Or directly Komodo 32 threads versus Komodo 16 threads, to minimize Elo error.

No I am not mangling the two terms. Before you can compute the time-to-depth speedup, you need the NPS speedup. If your NPS speedup on 8 cores is only 6X, then you just established an asymptote for your max time-to-depth speedup, which would be 6x. You have also established something that needs work. However, with todays boxes, linear NPS speedups are not common because of the internal bottlenecks (16 scores sharing one L3 cache and one path to memory is a killer, for example). We used to have a Sequent 30-cpu machine but for every "real" program we tested, max speedup was 16x, because the bus became the bottleneck in getting to memory, and it could be saturated by 16 processors...

Where do "I claim 1.9x speedup from 16 cores to 32 cores?" Never have, never will. Unless you talk about my simple approximation, which has NEVER been a claim of anything, just a simple estimate. You misuse the word "claim" badly here. I claim that is just a rough estimate. and I've made the claim much weaker for 16-32 and beyond as well.

mvk · Post by **mvk** » Wed Oct 15, 2014 7:08 pm

Adam Hair wrote:What other programs are capable of multiplexing?

The simplest way to multiplex on a single cpu is to launch more than one thread or process. The OS then does the multiplexing for you by means of preemptive task-switching.

syzygy · Post by **syzygy** » Wed Oct 15, 2014 7:52 pm

mvk wrote:
Adam Hair wrote:What other programs are capable of multiplexing?
The simplest way to multiplex on a single cpu is to launch more than one thread or process. The OS then does the multiplexing for you by means of preemptive task-switching.

True, this would be an easy way to do a very coarse-grained multiplexing. I was thinking of fine-grained multiplexing to simulate what is really happening in a multithreaded search. Note that Bob has in the past claimed that one could multiplex between two trees one node at a time resulting in the identical tree the parallel search would grow.

The SF4 FakeSplit debugging option is not an example of multiplexing. It merely uses SF's smp data structures to split a node, then anyway searches all moves serially and in sequence. I guess it may be somewhat useful for debugging on a system that lacks a debugger capable of handling multiple threads.

Btw, note how Bob now keeps clinging to "super-linear speedup = bug" when nobody has made a claim of super-linear speedup for Komodo or any other "widening" smp implementation.

michiguel · Post by **michiguel** » Wed Oct 15, 2014 8:05 pm

bob wrote:
michiguel wrote:
bob wrote:
Adam Hair wrote:
bob wrote:
syzygy wrote:
michiguel wrote:
bob wrote:
michiguel wrote:
ernest wrote:
syzygy wrote:How come absolutely nobody here with some knowledge of parallel programming agrees with you...
Well, at least on the theoretical side, what Bob says here doesn't sound so stupid...
In paper. Only in paper.

But in practice, linearizing a parallel algorithm would be a mess of gigantic proportions, and I cannot even imagine the problems and the unreadability. So, saying that Komodo could be "fixed" in single core is silly. Even in paper the "fix" may not give any improvement as I demonstrated in another post. It could be equal.

The fact is that whatever Komodo is doing, it is scaling sustainably better than others. Whether widening the search is a cause or correlation is unknown, but clearly, the whole approach is good.

Miguel
Sorry, but more than one program already has this capability. I think I saw it in stockfish, where they do a "pseudo-paralle-search" to debug, by doing the usual multiplexing operation.

Never heard of this and it does not make sense for debugging. You lose all the randomness. Maybe a SF developer can clarify, but I do not believe it. Even if this exists, does not have any overhead?
SF certainly does not have anything remotely comparable to a single-threaded simulation of a multithreaded search.

But if it had, it would obviously have a major overhead. Bob surely understands this.

Even if it could be done without overhead, it would have nothing to do with the topic of this thread. Nobody is claiming that Komodo's 4-core search is as strong as Komodo's single-threaded search with 4x the time.
Go look at what this code triggers: this directly from stockfish search.cpp:

// Set to true to force running with one thread. Used for debugging
const bool FakeSplit = false;
FakeSplit is no longer in Stockfish. The following link shows the code that contained FakeSplit and the changes made.

https://github.com/zamar/Stockfish/comm ... a50004f810

This is the code that contained FakeSplit in Stockfish 4:
Code: Select all
      // Step 19. Check for splitting the search
      if (   !SpNode
          &&  depth >= Threads.minimumSplitDepth
          &&  Threads.available_slave(thisThread)
          &&  thisThread->splitPointsSize < MAX_SPLITPOINTS_PER_THREAD)
      {
          assert(bestValue < beta);

          thisThread->split<FakeSplit>(pos, ss, alpha, beta, &bestValue, &bestMove,
                                       depth, threatMove, moveCount, &mp, NT, cutNode);
          if (bestValue >= beta)
              break;
      }
    }

    if (SpNode)
        return bestValue;
What other programs are capable of multiplexing?
I don't peruse other programs so no idea. A couple of versions of Crafty did, when I first started the parallel search code, although I don't think any were ever released. The 1983 version of Cray Blitz did as we had to simulate testing while waiting on Cray to get the SMP libraries and fortran compiler changes completed.

The only sensible reason for multiplexing is debugging when you don't have a decent debugger than can handle threads. gdb does this quite nicely today making multiplexing unnecessary. The multiplexing concept, to shoot down consistent super-linear speedups and such is a simple idea. It is NOT the way a rational person would "fix" the problem. It is mainly used as a simple and easy to understand methodology to debunk the occasional bogus claims that surface here and there.
So, the only sensible reason for it is debugging, not production code. Thank you.

Miguel

bob wrote: I didn't look at it at all, just noticed it in passing. If you don't understand the multiplexing issue and why it was brought up, then please stay out of the discussion. It is DIRECTLY related to the "widening" issue that keeps getting broached here. If someone gains from this "widening" they could get the same benefit from multiplexing the search.
Yes. If it improves your program, you have something to work on. It is simply a way of explaining why some things simply do not happen as claimed. Super-linear parallel speedups have a specific cause - poor move ordering. That can be solved by multiplexing, but there are much better solutions (actually improving the order rather than searching the moves out of order to solve the issue).

right, the only sensible reason for it is debugging or explaining things, not production code.

If a "wider" tree helps a parallel searcher, that same wider tree would also help the non-parallel searcher. multiplexing would show the same gain, but it is not the best solution. It just illustrates the problem. Which in this case would be excessive pruning that could be toned down.

Not if the widening takes advantage of parallel inefficiencies of the "non-widening" approach. For instance, you search extra nodes instead of remaining idle in certain threads.

Miguel
PS: You keep mentioning super linear speed up like it has anything to do with this.

mvk · Post by **mvk** » Wed Oct 15, 2014 8:30 pm

syzygy wrote:Btw, note how Bob now keeps clinging to "super-linear speedup = bug" when nobody has made a claim of super-linear speedup for Komodo or any other "widening" smp implementation.

These statements make sense if you view traversal and tree shaping as separate processes: first you define a tree, and then you search it as efficiently as possible. The search, in essence, doesn't impact the effective tree. This is "depth thinking". But that is not how modern programs work. Modern programs dynamically shape the tree through iteration-to-iteration interaction. Their parallel counterparts accept that splitting is going to be imperfect and inevitably generates excess nodes. But then they make an extra step and try to influence where these excess nodes will go, and change the effective tree shape, so at least the can get something in return. The tree shaping vs. traversal interaction is not new, it is what propelled programs such as Shredder and followers ahead of the pack. Newton vs quantum mechanics. Both view are right, in their own domain. (But here we are less interested in the domain of 1980s style programs of course.)

lkaufman · Post by **lkaufman** » Wed Oct 15, 2014 8:32 pm

It seems to me that you are saying that the optimum LMR reduction schedule for a single core engine should also be the optimum for a 16 core (for example) SMP version. Do you believe this to be true, and if so can you prove it? If the 16 core version got a 16 to one speedup that would seem to follow, but if it only gets a ten to one speedup (for example), I don't see why it would have to be true. Note that I am not saying that this is what Komodo does.

x3 · Post by x3 » Wed Oct 15, 2014 9:11 pm

2 noteable differences
1 About 4 times longer time Control
2 My 8 v 16 core test is with xeon 2687w, about 2 times faster than AMD Opteron

The result for Zappa 16 cores are probably incorrect because it stops working every 4 or 5 games.

x3

Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8