back to the Komodo SMP issue

Don · Post by **Don** » Tue Jul 02, 2013 11:12 pm

bob wrote:
Laskos wrote:
Don wrote:
The textbook definition is idealized so that it's easy to reason about formally. Usually what one does is takes some serial problem such as sorting, n-body simulation or some other things and then speed it up on multiple processors. That is FAR easier to reason about because the only variable is speed which is trivially measured and it's also the easiest to deal with formally.

But often, even in the case of pure speed, you will choose a different algorithm if you intend to run it on multiple processors. The algorithm you choose may be just as good or perhaps not quite as good but have much more parallelism to exploit. I think even Bob should understand this but I'm not sure that he does.

The MIT Cilk team chose a different sort algorithm for example that was not quite as efficient as the unix sort routing but was more amenable to parallelism. They did this as a kind of demo project to show off Cilk a long time ago. Of course they didn't make any claims that they "sped" up shell sort with MP, but they did speed up the sort utility more than they could have if they had tried to parallelize the shell sort that was in the "sort" utility.
I don't think we even got to this point of the discussion yet, here we will be stuck forever with Bob. It was just a matter of what metric to use to measure SMP efficiency as defined by Bob. I can pass here, as it's not so important, and a matter of definition, the important thing is spelled by you in the above paragraph.
But chess has a metric that is far more appropriate here than speed and that is ELO. It's a lot more awkward to reason about formally and much harder to measure and it's also not easily treated in a formal way mathematically. I think Bob is taking the point of view that it is not relevant unless it has these characteristics or else he just wants to bicker about some formal definition that only he cares about.
Yes, I will abandon the argument on this, I just wanted to show that the manual or his definition is flawed, for example 1->4 core speedup of 0.8 having larger "efficiency" than 1->8 core 1.2 speedup. Then, for chess, all these scalings need logarithms to transform into Elos or plies.
Strawman argument, however. NOBODY is trying to scale to depth or Elo, directly. Step one is "how much faster can I go?" I am 100% certain that if you increase the speed of a processor 1%, my program gets stronger. How much stronger? That's a point where one might measure via test games. But faster = stronger, without exception, until we reach the end of the game. Can you gain Elo by going wider? Maybe or maybe not. If going wider slows you down, you have a gain and a loss that have to be measured. And combined.

My question at the start of this new thread was quite specific. Please explain to me how going wider can be better, when a parallel search is the thing doing the searching...

Given a small number of cores where a parallel speedup should be reasonable in the first place. If you have a poorly scaling algorithm, then going wider would seem to be worse since the thing will slow down. Significantly.

I think Komodo refutes this argument as we appear to be getting the same ELO improvement everyone else gets on a small number of cores. How do you reconcile that?

I asked nothing more, nothing less. Yet it turns into insults rather than technical answers???

Typical....

bob · Post by **bob** » Tue Jul 02, 2013 11:14 pm

Laskos wrote:
bob wrote:Here's the key point that nobody has even begun to address. Let's take a hypothetical position, where a program takes exactly 10 minutes to search to depth D. And when it finishes we measure an EBF of 2.0 exactly (to keep the numbers simple).

You run this with 4 cpus, and discover that to search to depth D, we get a time of 7 minutes. Speedup = 10 / 7 = 1.4x. SMP efficiency = 1.4 / 4.0 = .35 (or 35%)

Now, since the SMP efficiency is so low (as is the speedup) let's relax the rules a bit to make the search wider. This eliminates errors due to pruning/reductions, but it slows the search down, and now our serial search is obviously going to take much longer than 10 minutes. If you make the tree 2x larger, which is probably the minimum you would expect to produce a +130 Elo gain, you just doubled the search time for one cpu to 20 minutes. 20 / 1.4 = 14.2 minutes for the parallel search. So we made the search wider, only to give us a larger tree that we don't search very efficiently, and the time is longer than the original serial search, whcih means it is not going to gain a thing.
14.2 minutes as the single cored 20 minutes to the same _depth_. But it's twice wider at the same depth. I see no theoretical problem in going from 1->4 cores being always 2 times wider and time-to-depth being always 1.4 instead of 2.8. The gain IS there, although theoretically sub-optimal, but only theoretically, because practically EBF could vary a little without changing the strength.

What was the LOSS in elo from going to 10 minutes per move to 14.2? Extra width gains Elo if time is ignored. But for real Elo calculations, time is critical since the games are timed. You are now getting to the crux of my question. If you go wider, the search slows down, because the SMP efficiency is not changing one scintilla...

Don · Post by **Don** » Tue Jul 02, 2013 11:28 pm

bob wrote:
michiguel wrote:
bob wrote:
Laskos wrote:
bob wrote:
Laskos wrote:
Two observations:

1) With EBF = (Total nodes searched) ^ (1/depth), 4 cores has higher EBF than 1 core in Komodo.

2) You wrote "SMP efficiency = 1.4 / 4.0 = .35 (or 35%)". A bit strange formula, I guess it should be log(1.4)/log(4) ~ 24%. The speed-up of 1 is no speed-up, not 1/4=25% efficiency.
No, the formula I gave is the one referenced in SMP publications. 100% means that all 4 (in this case) of the processors are busy doing nothing but useful work. No idle time, no extra work.
Then you have lousy definitions there. From 1.4 to 1.96 the efficiency increases by a factor of 2=log(1.96)/log(1.4). For speedup 4 you still have 100%=log(4)/log(4). Anyway if that definition quoted by you is the standard, let's use it, although it's crappy.
"An introduction to parallel programming" by Peter Pacheco, page 58, near the bottom.

speedup = T(serial) / T(parallel)

efficiency = speedup / nprocs

As I said before, if I am allowed to be pedantic, this definitions may not apply to chess engines. The formula assume the same task was done for the time "Ts" and "Tp". We know that is not that we have, or not necessarily the same. Time to depth could be an approximation, but it may break (certainly the trees are not the same). In fact, that was the point of the original thread, that it breaks in Komodo's case.

Miguel
Back to my original question. In a parallel model, HOW can you search wider, given the basic assumption that the parallel search only produces a 1.4x speedup? If you increase the EBF, you increase the size of the tree, which hurts the parallel search even more.

There is a quite a bit of wiggle room here. Look at the EBF of Stockfish and compare it to Houdini or Komodo - it is less than both and yet all 3 programs are on top. So it's obvious to me that there is more than one way to skin a cat.

Komodo does in fact lose a lot of ELO due to pruning and LMR although it seems to be a favorable trade-off in single processor performance due to the speed gain and extra depth. But a lot of this is a close call - so fattening up the tree is not going to be a big loss even in SP mode as you will pretty much get back in ELO what you lose in speed. If I can burn a little CPU time doing JUST the fattening in fact it would be a big win.

I have a practical analogy to illustrate this. Sometimes in your travels you will have a choice of two ways to get to your destination, one by superhighway but it is much longer, the other is by side roads which is much shorter. Maybe in the end it comes out the same, do I go long and fast or short and slow? With Komodo we are simply choosing to not use the superhighway but instead taking the side roads. I'm not going as fast as you are but I'll bet I get there in the same amount of time.

That is the question I posed, and which absolutely no one has addressed. If the parallel speedup on this "wider tree" is 1.4x, one can't possibly claim it would be worse on a narrower tree. So how does one actually use the extra processing power doing something OTHER than searching in parallel???

It's not a matter of liking or not liking a term, it is one that works, is clear, and understandable. If you use 4 cpus to search twice as fast, your efficiency is 50%, where if you use 4 cpus to search 4x as fast, your efficiency is 100%. Zero lost effort...

I realize the higher EBF in Komodo. No doubt about it. The only doubt I have is "how does it help?"
Lower EBF is not necessarily better than higher EBF. Maybe Komodo is performing close to optimally with EBF 1.8 on one core and EBF 1.9 on 4 cores. Even on one core, maybe both EBF's are close to optimal. Don mentioned that he could make Komodo 1 ply higher or lower (lower or higher EBF) without changing the strength.

SMP efficiency is defined as speedup/#processors. Optimal is obviously 1.0 (100%).

If you use one cpu, the time taken for a serial search = time taken for the one-cpu parallel search, the speedup is 1.0 which is no speedup at all. #processors / speedup = 100% which is expected.
The problem with a larger EBF is it is a double-whammy. It FURTHER slows the smp search, since its efficiency is fixed by the algorithm used. Make it wider, you make it slower, period...

I don't see how that can be beneficial, particularly down at 4 cores...

Laskos · Post by **Laskos** » Tue Jul 02, 2013 11:45 pm

bob wrote:
Laskos wrote:
bob wrote:Here's the key point that nobody has even begun to address. Let's take a hypothetical position, where a program takes exactly 10 minutes to search to depth D. And when it finishes we measure an EBF of 2.0 exactly (to keep the numbers simple).

You run this with 4 cpus, and discover that to search to depth D, we get a time of 7 minutes. Speedup = 10 / 7 = 1.4x. SMP efficiency = 1.4 / 4.0 = .35 (or 35%)

Now, since the SMP efficiency is so low (as is the speedup) let's relax the rules a bit to make the search wider. This eliminates errors due to pruning/reductions, but it slows the search down, and now our serial search is obviously going to take much longer than 10 minutes. If you make the tree 2x larger, which is probably the minimum you would expect to produce a +130 Elo gain, you just doubled the search time for one cpu to 20 minutes. 20 / 1.4 = 14.2 minutes for the parallel search. So we made the search wider, only to give us a larger tree that we don't search very efficiently, and the time is longer than the original serial search, whcih means it is not going to gain a thing.
14.2 minutes as the single cored 20 minutes to the same _depth_. But it's twice wider at the same depth. I see no theoretical problem in going from 1->4 cores being always 2 times wider and time-to-depth being always 1.4 instead of 2.8. The gain IS there, although theoretically sub-optimal, but only theoretically, because practically EBF could vary a little without changing the strength.
What was the LOSS in elo from going to 10 minutes per move to 14.2? Extra width gains Elo if time is ignored. But for real Elo calculations, time is critical since the games are timed. You are now getting to the crux of my question. If you go wider, the search slows down, because the SMP efficiency is not changing one scintilla...

1->4 cores Komodo also gains nps, say a factor 3-3.5, therefore to the same time, it searches 3-3.5 more nodes. I am not saying that with equal speed or with 4x time on one core, the parallel search is a gain, but with 3-3.5 speedup in _nps_ it's a gain in _strength_ compared to to 1 core.

Laskos · Post by **Laskos** » Wed Jul 03, 2013 1:16 am

I computed 1->4 cores times-to-depth for several top engines, which on rating lists have comparable Elo gain from 1 to 4 cores.
150 positions, 30 positions repeated 5 times. Each run lasts one-two hours.

Code: Select all

Houdini 3     3.05
Stockfish 3   2.91
Rybka 4.1     2.32
Komodo 5.1    1.68

Crafty 23.5?  3.20 &#40;computed by R.Hyatt&#41;

Komodo seems to be the most extreme in deviating from the rule, but Rybka 4.1 seems a bit off too. Time-to-depth correlates very weakly with the Elo gain, and is not a very useful indicator of it.

Don · Post by **Don** » Wed Jul 03, 2013 1:21 am

Laskos wrote:I computed 1->4 cores times-to-depth for several top engines, which on rating lists have comparable gain from 1 to 4 cores.
150 positions, 30 positions repeated 5 times. Each run lasts one-two hours.
Code: Select all
Houdini 3     3.05
Stockfish 3   2.91
Rybka 4.1     2.32
Komodo 5.1    1.68

Crafty 23.5?  3.20 &#40;computed by R.Hyatt&#41;
Komodo seems to be the most extreme is deviating from the rule, but Rybka 4.1 seems a bit off too. Time-to-depth correlates very weakly with the Elo gain, and is not a very useful indicator of it.

Larry doesn't know anything about the Rybka implementation of MP but he has told me many times that at fixed depths the MP implementation of Rybka plays stronger, unlike most program and very much like Komodo but not as extreme.

syzygy · Post by **syzygy** » Wed Jul 03, 2013 1:23 am

bob wrote:"An introduction to parallel programming" by Peter Pacheco, page 58, near the bottom.

speedup = T(serial) / T(parallel)

efficiency = speedup / nprocs

It's not a matter of liking or not liking a term, it is one that works, is clear, and understandable. If you use 4 cpus to search twice as fast, your efficiency is 50%, where if you use 4 cpus to search 4x as fast, your efficiency is 100%. Zero lost effort...

The definition clearly assumes that the same work is done.

When does a 4-core search perform the same work as a 1-core search?

You can say "same work is same depth". For Komodo, this is very clearly not true. For other engines, it is iffy.

I can say "same work is same quality of the search". Now Komodo's "SMP efficiency" is suddenly comparable to that of other top engines.

For chess, it is only Elo increase per added core that really matters.

bob wrote:"How can this possibly work."

Your explanation for why it cannot work has gaping holes. I am not going to point them out once again, because you're just being blind to reason here.

Let crafty play Komodo on 8 cores or so and find out yourself whether it works or not.

Laskos · Post by **Laskos** » Wed Jul 03, 2013 2:33 am

Don wrote:
Laskos wrote:I computed 1->4 cores times-to-depth for several top engines, which on rating lists have comparable gain from 1 to 4 cores.
150 positions, 30 positions repeated 5 times. Each run lasts one-two hours.
Code: Select all
Houdini 3     3.05
Stockfish 3   2.91
Rybka 4.1     2.32
Komodo 5.1    1.68

Crafty 23.5?  3.20 &#40;computed by R.Hyatt&#41;
Komodo seems to be the most extreme is deviating from the rule, but Rybka 4.1 seems a bit off too. Time-to-depth correlates very weakly with the Elo gain, and is not a very useful indicator of it.
Larry doesn't know anything about the Rybka implementation of MP but he has told me many times that at fixed depths the MP implementation of Rybka plays stronger, unlike most program and very much like Komodo but not as extreme.

Right, I played 1,000 games fixed depth 8 (equivalent to 11) Rybka 4.1

Code: Select all

    Program                                         Score     %      Elo

  1 Rybka 4.1  4 threads                       &#58; 534.5/1000  53.4     12    
  2 Rybka 4.1  1 thread                        &#58; 465.5/1000  46.6    -12

24 +/- 15 points (95% confidence) 4 threaded is stronger than 1 threaded to fixed depth 8. Not as extreme as 70-80 points of Komodo, but still, it's beyond doubt that MP Rybka plays stronger than single cored to the same depth. That explains its lower time-to-depth ratio.

Don · Post by **Don** » Wed Jul 03, 2013 4:16 am

bob wrote: Nothing new here either. In fact, there are two factions in performance analysis. I am in the group that says you compare the best parallel algorithm on N cores to the best serial algorithm on 1 core, and that is your speedup. Some prefer to compare the best parallel algorithm on N cores and compare that to the same (inferior to best serial algorithm) algorithm on 1 core to compute the speedup. I (and many) believe this is artificial.

If you are willing to bend this definition just a little then we can agree. You say, "time to depth" but really I think you mean "time to accomplish a given task." If I were sorting something I would want to accomplish that task in a short time. So the metric is time after all if you make the transformation to cast the problem in the right way.

So what is this "task" that we want to accomplish in the least amount of time? It's a level of play, not some arbitrary depth. Depth only works if they are actually ELO equivalent but who says they have to be? Do you know any two unrelated program that are? So really you are correct if you simply take a more flexible view of what it is you are optimizing the time for.

lkaufman · Post by **lkaufman** » Wed Jul 03, 2013 5:13 am

Laskos wrote:
Don wrote:
Laskos wrote:I computed 1->4 cores times-to-depth for several top engines, which on rating lists have comparable gain from 1 to 4 cores.
150 positions, 30 positions repeated 5 times. Each run lasts one-two hours.
Code: Select all
Houdini 3     3.05
Stockfish 3   2.91
Rybka 4.1     2.32
Komodo 5.1    1.68

Crafty 23.5?  3.20 &#40;computed by R.Hyatt&#41;
Komodo seems to be the most extreme is deviating from the rule, but Rybka 4.1 seems a bit off too. Time-to-depth correlates very weakly with the Elo gain, and is not a very useful indicator of it.
Larry doesn't know anything about the Rybka implementation of MP but he has told me many times that at fixed depths the MP implementation of Rybka plays stronger, unlike most program and very much like Komodo but not as extreme.
Right, I played 1,000 games fixed depth 8 (equivalent to 11) Rybka 4.1
Code: Select all
    Program                                         Score     %      Elo

  1 Rybka 4.1  4 threads                       &#58; 534.5/1000  53.4     12    
  2 Rybka 4.1  1 thread                        &#58; 465.5/1000  46.6    -12   
24 +/- 15 points (95% confidence) 4 threaded is stronger than 1 threaded to fixed depth 8. Not as extreme as 70-80 points of Komodo, but still, it's beyond doubt that MP Rybka plays stronger than single cored to the same depth. That explains its lower time-to-depth ratio.

"

That sounds about like what I recall from testing older Rybka versions. I think it had a time-to-depth ratio on 4 cores vs. 1 core about two tenths lower than Fritz, which (like most programs) did not test better at fixed depth on 4 cores vs 1.Vas even told me once some along the lines of "I can't think of any sensible SMP implementation that would not play stronger on MP than on SP at fixed depth". Well, I guess either he was wrong, or the implementations on almost all engines other than Komodo and Rybka are not sensible!

back to the Komodo SMP issue

Re: back to the Komodo SMP issue

Re: back to the Komodo SMP issue

Re: back to the Komodo SMP issue

Re: back to the Komodo SMP issue

Re: back to the Komodo SMP issue

Re: back to the Komodo SMP issue

Re: back to the Komodo SMP issue

Re: back to the Komodo SMP issue

Re: back to the Komodo SMP issue

Re: back to the Komodo SMP issue