back to the Komodo SMP issue

M ANSARI · Post by **M ANSARI** » Wed Jul 03, 2013 7:46 am

I remember testing Rybka at fixed depth (which is actual depth +3) during Houdini 2 era, and it beat every engine at fixed depth on 8 cores. At the time I had thought that depth engine testing was the best way to test evaluation strength of an engine, but really I don't think that means anything as that only tests one aspect of an engines performance and depth is calculated differently between engines. What I also found peculiar with Rybka testing was that it took way too long to initialize before a move and would suffer at super fast time controls. At the time I felt that was due to using processes rather than threads and I remember Vas saying something like the advantage of a process based engine would change into favorable once engines started using a lot of cores (whatever that means). I have a feeling that many of Rybka's bugs (especially the stall bug) were due to child processes misbehaving under the Windows OS.

Laskos · Post by **Laskos** » Wed Jul 03, 2013 2:02 pm

lkaufman wrote:
That sounds about like what I recall from testing older Rybka versions. I think it had a time-to-depth ratio on 4 cores vs. 1 core about two tenths lower than Fritz, which (like most programs) did not test better at fixed depth on 4 cores vs 1.Vas even told me once some along the lines of "I can't think of any sensible SMP implementation that would not play stronger on MP than on SP at fixed depth". Well, I guess either he was wrong, or the implementations on almost all engines other than Komodo and Rybka are not sensible!

A bit larger test to illustrate the difference.
Fixed depth typical engines:

Code: Select all

 
    Program                            Score      %       Elo 

  1 Houdini 3  4 threads            : 492.0/1000  49.2     -3 
  2 Houdini 3  1 thread             : 508.0/1000  50.8      3 

-6 +/- 11 Elo points 
time-to-depth factor: 3.05 


    Program                            Score      %       Elo 

  1 Stockfish 3  4 threads          : 510.0/1000  51.0      3 
  2 Stockfish 3  1 thread           : 490.0/1000  49.0     -3 

+6 +/- 13 Elo points 
time-to-depth factor: 2.91

Fixed depth atypical engines:

Code: Select all

    Program                            Score      %      Elo 

  1 Komodo 5.1  4 threads           : 576.0/1000  57.6     27  
  2 Komodo 5.1  1 thread            : 424.0/1000  42.4    -27 

+54 +/- 12 Elo points 
time-to-depth factor: 1.68 

  
    Program                            Score      %      Elo 

  1 Rybka 4.1  4 threads            : 534.5/1000  53.4     12    
  2 Rybka 4.1  1 thread             : 465.5/1000  46.6    -12  

+24 +/- 15 Elo points 
time-to-depth factor: 2.32

Komodo is the most extreme, and generally time-to-depth is not showing well scalability with more cores Elo-wise.

bob · Post by **bob** » Thu Jul 04, 2013 8:38 am

Don wrote:
bob wrote:
Laskos wrote:
Don wrote:
The textbook definition is idealized so that it's easy to reason about formally. Usually what one does is takes some serial problem such as sorting, n-body simulation or some other things and then speed it up on multiple processors. That is FAR easier to reason about because the only variable is speed which is trivially measured and it's also the easiest to deal with formally.

But often, even in the case of pure speed, you will choose a different algorithm if you intend to run it on multiple processors. The algorithm you choose may be just as good or perhaps not quite as good but have much more parallelism to exploit. I think even Bob should understand this but I'm not sure that he does.

The MIT Cilk team chose a different sort algorithm for example that was not quite as efficient as the unix sort routing but was more amenable to parallelism. They did this as a kind of demo project to show off Cilk a long time ago. Of course they didn't make any claims that they "sped" up shell sort with MP, but they did speed up the sort utility more than they could have if they had tried to parallelize the shell sort that was in the "sort" utility.
I don't think we even got to this point of the discussion yet, here we will be stuck forever with Bob. It was just a matter of what metric to use to measure SMP efficiency as defined by Bob. I can pass here, as it's not so important, and a matter of definition, the important thing is spelled by you in the above paragraph.
But chess has a metric that is far more appropriate here than speed and that is ELO. It's a lot more awkward to reason about formally and much harder to measure and it's also not easily treated in a formal way mathematically. I think Bob is taking the point of view that it is not relevant unless it has these characteristics or else he just wants to bicker about some formal definition that only he cares about.
Yes, I will abandon the argument on this, I just wanted to show that the manual or his definition is flawed, for example 1->4 core speedup of 0.8 having larger "efficiency" than 1->8 core 1.2 speedup. Then, for chess, all these scalings need logarithms to transform into Elos or plies.
Strawman argument, however. NOBODY is trying to scale to depth or Elo, directly. Step one is "how much faster can I go?" I am 100% certain that if you increase the speed of a processor 1%, my program gets stronger. How much stronger? That's a point where one might measure via test games. But faster = stronger, without exception, until we reach the end of the game. Can you gain Elo by going wider? Maybe or maybe not. If going wider slows you down, you have a gain and a loss that have to be measured. And combined.

My question at the start of this new thread was quite specific. Please explain to me how going wider can be better, when a parallel search is the thing doing the searching...

Given a small number of cores where a parallel speedup should be reasonable in the first place. If you have a poorly scaling algorithm, then going wider would seem to be worse since the thing will slow down. Significantly.
I think Komodo refutes this argument as we appear to be getting the same ELO improvement everyone else gets on a small number of cores. How do you reconcile that?

I asked nothing more, nothing less. Yet it turns into insults rather than technical answers???

Typical....

NO answer as of yet. I can only speculate, which is pointless. If it works (wider) then there is a reason for it. At the moment, I don't see it looking at it from a purely mathematical perspective. For example, if the serial EBF is 2.0 and you reach depth=24 in 3 minutes, what happens to either the depth (or time) if you increase EBF to 2.1??? That's easy to mathematically solve. And the bottom line is that it either reduces the depth, or increases the time. Either of which would seem to weaken the program...

Hence my question about discussing exactly how this might work. If the SMP search gets a 1.4x speedup with 4 cores, why? But beyond that, it is going to get a 1.4x speedup regardless of what the EBF is.

bob · Post by **bob** » Thu Jul 04, 2013 8:47 am

Don wrote:
bob wrote:
michiguel wrote:
bob wrote:
Laskos wrote:
bob wrote:
Laskos wrote:
Two observations:

1) With EBF = (Total nodes searched) ^ (1/depth), 4 cores has higher EBF than 1 core in Komodo.

2) You wrote "SMP efficiency = 1.4 / 4.0 = .35 (or 35%)". A bit strange formula, I guess it should be log(1.4)/log(4) ~ 24%. The speed-up of 1 is no speed-up, not 1/4=25% efficiency.
No, the formula I gave is the one referenced in SMP publications. 100% means that all 4 (in this case) of the processors are busy doing nothing but useful work. No idle time, no extra work.
Then you have lousy definitions there. From 1.4 to 1.96 the efficiency increases by a factor of 2=log(1.96)/log(1.4). For speedup 4 you still have 100%=log(4)/log(4). Anyway if that definition quoted by you is the standard, let's use it, although it's crappy.
"An introduction to parallel programming" by Peter Pacheco, page 58, near the bottom.

speedup = T(serial) / T(parallel)

efficiency = speedup / nprocs

As I said before, if I am allowed to be pedantic, this definitions may not apply to chess engines. The formula assume the same task was done for the time "Ts" and "Tp". We know that is not that we have, or not necessarily the same. Time to depth could be an approximation, but it may break (certainly the trees are not the same). In fact, that was the point of the original thread, that it breaks in Komodo's case.

Miguel
Back to my original question. In a parallel model, HOW can you search wider, given the basic assumption that the parallel search only produces a 1.4x speedup? If you increase the EBF, you increase the size of the tree, which hurts the parallel search even more.
There is a quite a bit of wiggle room here. Look at the EBF of Stockfish and compare it to Houdini or Komodo - it is less than both and yet all 3 programs are on top. So it's obvious to me that there is more than one way to skin a cat.

Nothing new there. The more you prune, the more depth you gain. And vice-versa. There MUST be lots of "even trades" in that decision, that doesn't help nor hurt, just changes the shape of the tree.

Komodo does in fact lose a lot of ELO due to pruning and LMR although it seems to be a favorable trade-off in single processor performance due to the speed gain and extra depth. But a lot of this is a close call - so fattening up the tree is not going to be a big loss even in SP mode as you will pretty much get back in ELO what you lose in speed. If I can burn a little CPU time doing JUST the fattening in fact it would be a big win.

I have a practical analogy to illustrate this. Sometimes in your travels you will have a choice of two ways to get to your destination, one by superhighway but it is much longer, the other is by side roads which is much shorter. Maybe in the end it comes out the same, do I go long and fast or short and slow? With Komodo we are simply choosing to not use the superhighway but instead taking the side roads. I'm not going as fast as you are but I'll bet I get there in the same amount of time.

Still leaves the same question as before. WIth a 1.4x speedup, assuming this is the actual number as quoted by someone else, you are still forced to live in a box that is just so big. There's no direct way to use the extra 260% of the processing power that is currently lost in the SMP search and do something else. If you widen the search, you still use the SAME smp search to search that tree. If you make it much bigger at all, you lose rather than gain. In the example someone else mentioned, suppose you have an EBF of 2.0, and can reach depth=24 in 3 minutes. A simple measure is 2.0 ^ 24 = 32M "units of search work". If you increase the EBF to 2.1, you now have 2.1 ^ 24 = 54 "units of search work." The EBF 2.1 tree size is 1.7x larger. You only search 1.4x faster. That is a net LOSS in search space, or a net INCREASE in search time, compared to the serial search.

I don't see a way to make those numbers mesh with this "I search wider with 4 than with 1..."

Feel free to explain the flaw, mathematically...

BTW, if you don't want to increase the size of the tree, you can reduce the depth to 23.3 plies (lose 2/3 of a ply to keep the tree at about 32M nodes). Is that worth it? If so why isn't it worth it to do the same for the serial search?

If I am missing something, I don't see what at the moment...

That is the question I posed, and which absolutely no one has addressed. If the parallel speedup on this "wider tree" is 1.4x, one can't possibly claim it would be worse on a narrower tree. So how does one actually use the extra processing power doing something OTHER than searching in parallel???

It's not a matter of liking or not liking a term, it is one that works, is clear, and understandable. If you use 4 cpus to search twice as fast, your efficiency is 50%, where if you use 4 cpus to search 4x as fast, your efficiency is 100%. Zero lost effort...

I realize the higher EBF in Komodo. No doubt about it. The only doubt I have is "how does it help?"
Lower EBF is not necessarily better than higher EBF. Maybe Komodo is performing close to optimally with EBF 1.8 on one core and EBF 1.9 on 4 cores. Even on one core, maybe both EBF's are close to optimal. Don mentioned that he could make Komodo 1 ply higher or lower (lower or higher EBF) without changing the strength.

SMP efficiency is defined as speedup/#processors. Optimal is obviously 1.0 (100%).

If you use one cpu, the time taken for a serial search = time taken for the one-cpu parallel search, the speedup is 1.0 which is no speedup at all. #processors / speedup = 100% which is expected.
The problem with a larger EBF is it is a double-whammy. It FURTHER slows the smp search, since its efficiency is fixed by the algorithm used. Make it wider, you make it slower, period...

I don't see how that can be beneficial, particularly down at 4 cores...

bob · Post by **bob** » Thu Jul 04, 2013 8:57 am

syzygy wrote:
bob wrote:"An introduction to parallel programming" by Peter Pacheco, page 58, near the bottom.

speedup = T(serial) / T(parallel)

efficiency = speedup / nprocs

It's not a matter of liking or not liking a term, it is one that works, is clear, and understandable. If you use 4 cpus to search twice as fast, your efficiency is 50%, where if you use 4 cpus to search 4x as fast, your efficiency is 100%. Zero lost effort...
The definition clearly assumes that the same work is done.

When does a 4-core search perform the same work as a 1-core search?

The answer is obvious. It searches a position to a fixed depth. The search has overhead (extra nodes) that reduces the speedup and therefore the SMP efficiency.

That seems completely obvious to me. Most ANY parallel algorithm does work that is unnecessary in a serial search, from simple load-balancing issues to synchronization that a serial search doesn't need.

That's why one has to choose some unit of work and then measure the time to complete that with 1 cpu and then with N. It is pretty obvious that all parallel search programs search extra stuff. It is also trivially obvious that they gain no Elo from doing so, since, by definition, this overhead is work that doesn't need to be done by the serial search.

You can say "same work is same depth". For Komodo, this is very clearly not true. For other engines, it is iffy.

"iffy"? How? Others seem to gain zero by using more cores on a fixed depth search. Which seems perfectly explainable mathematically.

I can say "same work is same quality of the search". Now Komodo's "SMP efficiency" is suddenly comparable to that of other top engines.

With an ambiguous term "quality" which no one really wants to deal with in precise discussions.

For chess, it is only Elo increase per added core that really matters.

bob wrote:"How can this possibly work."
Your explanation for why it cannot work has gaping holes. I am not going to point them out once again, because you're just being blind to reason here.

Please point them out. Otherwise, there's no reason to make such a silly claim with nothing to support it... I can provide about 35 years or so of publications by MANY different people that follow what I have been saying precisely...

Let crafty play Komodo on 8 cores or so and find out yourself whether it works or not.

bob · Post by **bob** » Thu Jul 04, 2013 9:00 am

Laskos wrote:I computed 1->4 cores times-to-depth for several top engines, which on rating lists have comparable Elo gain from 1 to 4 cores.
150 positions, 30 positions repeated 5 times. Each run lasts one-two hours.
Code: Select all
Houdini 3     3.05
Stockfish 3   2.91
Rybka 4.1     2.32
Komodo 5.1    1.68

Crafty 23.5?  3.20 (computed by R.Hyatt)
Komodo seems to be the most extreme in deviating from the rule, but Rybka 4.1 seems a bit off too. Time-to-depth correlates very weakly with the Elo gain, and is not a very useful indicator of it.

I would say that for Houdini, stockfish, and Crafty, time to depth correlates almost perfectly with Elo gain. What do you expect to gain from a search that is 3x faster? Whether it is one core or 3, the Elo gain is the same. Rybka is worse. Why? Who knows. But perhaps due to a less efficient search. Any good data on Rybka's Elo gain from 1-4 compared to stockfish or houdini???

Henk · Post by **Henk** » Thu Jul 04, 2013 9:11 am

bob wrote:
Laskos wrote:I computed 1->4 cores times-to-depth for several top engines, which on rating lists have comparable Elo gain from 1 to 4 cores.
150 positions, 30 positions repeated 5 times. Each run lasts one-two hours.
Code: Select all
Houdini 3     3.05
Stockfish 3   2.91
Rybka 4.1     2.32
Komodo 5.1    1.68

Crafty 23.5?  3.20 (computed by R.Hyatt)
Komodo seems to be the most extreme in deviating from the rule, but Rybka 4.1 seems a bit off too. Time-to-depth correlates very weakly with the Elo gain, and is not a very useful indicator of it.
I would say that for Houdini, stockfish, and Crafty, time to depth correlates almost perfectly with Elo gain. What do you expect to gain from a search that is 3x faster? Whether it is one core or 3, the Elo gain is the same. Rybka is worse. Why? Who knows. But perhaps due to a less efficient search. Any good data on Rybka's Elo gain from 1-4 compared to stockfish or houdini???

If you limit the candidates to one searching N times faster gives the same result. So Rybka is probably limiting the number of candidates too much.
Other possibilities: A waiting on B. B on C. C on A etc. Or A checking all the time "are you ready".

bob · Post by **bob** » Thu Jul 04, 2013 9:13 am

Don wrote:
bob wrote: Nothing new here either. In fact, there are two factions in performance analysis. I am in the group that says you compare the best parallel algorithm on N cores to the best serial algorithm on 1 core, and that is your speedup. Some prefer to compare the best parallel algorithm on N cores and compare that to the same (inferior to best serial algorithm) algorithm on 1 core to compute the speedup. I (and many) believe this is artificial.
If you are willing to bend this definition just a little then we can agree. You say, "time to depth" but really I think you mean "time to accomplish a given task." If I were sorting something I would want to accomplish that task in a short time. So the metric is time after all if you make the transformation to cast the problem in the right way.

So what is this "task" that we want to accomplish in the least amount of time? It's a level of play, not some arbitrary depth. Depth only works if they are actually ELO equivalent but who says they have to be? Do you know any two unrelated program that are? So really you are correct if you simply take a more flexible view of what it is you are optimizing the time for.

Here's the problem. "Task" is normally a measure of a specific job, in tree searching, searching a tree to some fixed depth. I simply have not seen ANYTHING in 35 years of reading about and working on parallel search, where this unusual effect has been addressed by ANYONE. Not that it can't happen. But to date, it has NOT happened and been published. Ergo, the usual thing to measure is time to depth. Because that is an easier thing to measure than Elo improvement. Much easier to test a couple of hundred or couple of thousand positions, as opposed to playing many thousands of full games. There's certainly nothing wrong with measuring Elo gain due to SMP search. I've run this test many times on my program, just to double-check after a time-to-depth test shows that a new idea improved the speed, but I want to confirm that it didn't break something non-obvious.

Anyone can measure time to depth and compute a speedup that makes sense. Computing the Elo is a daunting task requiring a large number of games to get accurate results. Large number of games -> REALLY large levels of processor time.

I do parallel search tweaks fairly often. Time-to-depth clearly shows (and is confirmed by Elo) whether the change is good or bad, requiring a minimal amount of time to test.

As far as depth, notice that I compute MY speedup by searching a set of positions to a fixed depth X. For (say) Stockfish, I would use the SAME positions, but not necessarily the same depth X. Just some X that requires reasonable time. So I am not comparing depths between programs, just how efficiently each program can get to some depth when using N cpus rather than just one... It is not comparing apples and oranges at all, until we reach your case.

My original post was to get someone to try to explain HOW this can happen, with a reasonable explanation. So far, I have not seen any sort of data that clarifies anything...

bob · Post by **bob** » Thu Jul 04, 2013 9:14 am

Henk wrote:
bob wrote:
Laskos wrote:I computed 1->4 cores times-to-depth for several top engines, which on rating lists have comparable Elo gain from 1 to 4 cores.
150 positions, 30 positions repeated 5 times. Each run lasts one-two hours.
Code: Select all
Houdini 3     3.05
Stockfish 3   2.91
Rybka 4.1     2.32
Komodo 5.1    1.68

Crafty 23.5?  3.20 (computed by R.Hyatt)
Komodo seems to be the most extreme in deviating from the rule, but Rybka 4.1 seems a bit off too. Time-to-depth correlates very weakly with the Elo gain, and is not a very useful indicator of it.
I would say that for Houdini, stockfish, and Crafty, time to depth correlates almost perfectly with Elo gain. What do you expect to gain from a search that is 3x faster? Whether it is one core or 3, the Elo gain is the same. Rybka is worse. Why? Who knows. But perhaps due to a less efficient search. Any good data on Rybka's Elo gain from 1-4 compared to stockfish or houdini???
If you limit the candidates to one searching N times faster gives the same result. So Rybka is probably limiting the number of candidates too much.

What do you mean by "candidates"??? Moves at the root or what?

Henk · Post by **Henk** » Thu Jul 04, 2013 9:19 am

bob wrote:
Henk wrote:
bob wrote:
Laskos wrote:I computed 1->4 cores times-to-depth for several top engines, which on rating lists have comparable Elo gain from 1 to 4 cores.
150 positions, 30 positions repeated 5 times. Each run lasts one-two hours.
Code: Select all
Houdini 3     3.05
Stockfish 3   2.91
Rybka 4.1     2.32
Komodo 5.1    1.68

Crafty 23.5?  3.20 (computed by R.Hyatt)
Komodo seems to be the most extreme in deviating from the rule, but Rybka 4.1 seems a bit off too. Time-to-depth correlates very weakly with the Elo gain, and is not a very useful indicator of it.
I would say that for Houdini, stockfish, and Crafty, time to depth correlates almost perfectly with Elo gain. What do you expect to gain from a search that is 3x faster? Whether it is one core or 3, the Elo gain is the same. Rybka is worse. Why? Who knows. But perhaps due to a less efficient search. Any good data on Rybka's Elo gain from 1-4 compared to stockfish or houdini???
If you limit the candidates to one searching N times faster gives the same result. So Rybka is probably limiting the number of candidates too much.
What do you mean by "candidates"??? Moves at the root or what?

Moves near the root.

back to the Komodo SMP issue

Re: back to the Komodo SMP issue

Re: back to the Komodo SMP issue

Re: back to the Komodo SMP issue

Re: back to the Komodo SMP issue

Re: back to the Komodo SMP issue

Re: back to the Komodo SMP issue

Re: back to the Komodo SMP issue

Re: back to the Komodo SMP issue

Re: back to the Komodo SMP issue

Re: back to the Komodo SMP issue