Peculiarity of Komodo 5.1MP

bob · Post by **bob** » Wed Jun 19, 2013 8:17 pm

Uri Blass wrote:
Daniel Shawul wrote:This is misleading because komodo probably compensates for poor parallel implementation by searching wider, otherwise it shouldn't be getting any elos for fixed depth test. You should do the test with time and see how much each gain.
"poor parallel implementation?"

The target of parallel implementation is not to get bigger depth but to play better.

If the implementation is good in helping komodo to play better than I think that it is wrong to call it poor.

Maybe it is the opposite and komodo compensates for poor pruning of lines that it should not prune by parallel implementation that prevent it to prune good moves in the relevant lines.

You didn't read carefully. Making a program search far more nodes for a given depth does NOT improve it. It just makes it look better at fixed depth since the extra nodes do not penalize the depth, and since time is not being measured, so what if it takes 5x longer than the opponent to reach the same depth? This is nothing more than a time-handicap match, which doesn't say a thing about actual Elo. I doubt Larry/Don would claim that parallel komodo is stronger at a fixed TIME limit... ignoring the extra nodes searched...

Comparing T(1cpu) to depth X to T(4cpu) to depth X will give a good estimate for the Elo gain for the SMP search. To my eye, it will be lower than expected because the tree grows too much.

Laskos · Post by **Laskos** » Wed Jun 19, 2013 8:39 pm

bob wrote:
Laskos wrote:
Daniel Shawul wrote:This is misleading because komodo probably compensates for poor parallel implementation by searching wider, otherwise it shouldn't be getting any elos for fixed depth test. You should do the test with time and see how much each gain.
I already wrote that it probably searches wider with the number of threads. But it's different from other top engines, which to given depth are pretty much the same strength on different number of threads. And the rule (as stated by some) that time-to-depth is determining MP efficiency does not apply to Komodo.
I will leave to test groups and individuals to test with time, there are plenty of volunteers. I was just curious about this aspect.
Time-to-depth IS the correct measure. If Komodo searches wider, then it searches less efficiently. One wants to measure the COMPLETE SMP implementation, not just how the tree is split.

I can not imagine wanting to search wider on a parallel search. If it is a good idea, why not search wider on the sequential implementation as well?

Time-to-depth won't work for Komodo. It benefits from a wider tree in the parallel search to the same depth, and time-to-depth would understate its SMP performance Elo-wise. Comparing depths of the 1 threaded Komodo with the depths of 4-threaded Komodo is meaningless given the different tree shapes. Elo is all that matters in SMP efficiency, and I don't know why it has a wider tree in the parallel search, the fact is that 4 vs 1 thread Komodo benefits from two sides: deeper search and wider tree, both contributing substantially. The total contribution is probably comparable to other engines going from 1 to 4 threads, only that Komodo has this peculiar behavior.

syzygy · Post by **syzygy** » Wed Jun 19, 2013 8:41 pm

bob wrote:Time-to-depth IS the correct measure.

Give one reason why decreased time-to-depth is more important that elo gain from using more cores.

If Komodo searches wider, then it searches less efficiently. One wants to measure the COMPLETE SMP implementation, not just how the tree is split.

No, Kai only wanted to compare SMP behaviour of Komodo to SMP behaviour of other programs.

As Uri stated, there are already good indications that Komodo's SMP implementation is doing pretty well: more cores lead to a respectable gain in elo. It is therefore silly to say that Komodo's SMP implementation is bad.

Now the interesting thing is that Kai's tests show that Komodo achieves this partially by lowering the time-to-depth (thereby increasing depth in the same time) and partially by somehow getting more out of each ply. For H3 and Stockfish on the other hand, it appears that gains from SMP must come in the form of lower time-to-depth (since performance per ply does not increase, at least not significantly).

So here we just have some hard facts that point to something interesting going on.

I remember Don writing about trying out a variation of one of the "lazy smp" ideas that were being discussed here. So far I did not think too much of these ideas as they seemed to be not scalable. Now, I really have no idea whether Komodo is using some of these ideas, but it appears to be a fact that Komodo's SMP is not doing badly at all even on 12 cores AND is not using a 'standard' SMP implementation.

Maybe it is not the "lazy smp", maybe it is something else. But it is interesting.

Don · Post by **Don** » Wed Jun 19, 2013 8:54 pm

michiguel wrote:
Daniel Shawul wrote:
michiguel wrote:
Laskos wrote:
yanquis1972 wrote:interesting...is this a unique (in the literal sense) implementation of MP in a chess engine?
First time I encounter, but I only tested very few engines. Some folks like Bob Hyatt even stated bluntly that time-to-depth is the only way to measure MP efficiency. Not for Komodo.
I remember the thread and I did not agree with that. What matters, for any engine, it is the elo gained. Time to depth for a selected number of representative (are they?) positions may or may not be an accurate way to predict how much strength is gained, for several reasons. There are too many assumptions in the process.

Miguel
This is making a mountain out of a molehill. When you say 'parallel efficiency' it is assumed you are comparing the efficiency of the 'parallel implementation' alone and not 'algorithmic efficiency'. Speed up is rightly defined as the time needed to complete the same work by the parallel search compared to the sequential search so time-to-depth is more or less a correct measure. If i suddenly decide my parallel uses alpha-beta while sequential uses lmr+nullmove, then you might as well have done the comparison on sequential search because that has nothing to do with parallel search.
What matters is the elo gained. The question is if the apparent speed up measured with tests (as we know them) predict it well. I do not think it is necessary true for every engine. They may be good approximations, but approximations nonetheless. Approximations could be more precise but less accurate.

Miguel

Indeed, we did a comparison test with Critter and found that on 4 cores Komodo gained about 20 ELO more - but Critter increased depth more than Komodo. We also did some fixed depth searches and found that Critter and Stockfish played exactly the same strength regardless of the number of cores. But Komodo actually gained ELO. I remember Larry once saying the Rybka was the same, it actually played stronger at the same depth when running on more cores.

But in the end, the only thing that matters is the ELO gain going from 1 to many cores. Komodo is doing very well in that respect, much better than I expected and we expect to get more in the future - but I doubt it is ideal and I'm certainly not claiming we have the best MP implementation. In fact I rather doubt it.

Don · Post by **Don** » Wed Jun 19, 2013 9:32 pm

Bob (and others),

How you approach a problem is strongly related to how you DEFINE a problem or task. The goal of MP is to make the program play as well as possible using more cores, to take advantage of the hardware as fully as possible and the only measure that really matters (at least to us) is playing strength. We ALL wish that our program played as if it they were 4x faster when playing on 4 cores - but nobody gets that kind of efficiency.

Some have implied that we search wider because we don't get quite the same increase in depth and that this is a horrible thing. The intuition is that if that if it works you could do it on 1 thread too. However I do not think that follows. The problem is that 4 threads is just not 4x better than 1 thread no matter how good your MP implementation, so it always comes down to how to best utilize what you have. Anything goes in my opinion if it gets more out of extra cores in terms of ELO.

It is well known that programs do not scale well beyond a few processors. Going from 128 to 256 processors gives you very little. One of the things Larry and I want to do is to explore how to better exploit more and more cores and hopefully what we learn will translate to better performance on few cores. It is unlikely to be minor refinements to what we already do.

I'm not breaking new ground as this has been studied for decades and lots of ideas have been tried - I'm not pretending otherwise. But we will look at new original ideas we come up with and revisit old-fashioned ideas that have been tried in the past. A lot of the old ideas which have been rejected have later come back to life, so there is no idea that is off-limits.

As far as whether searching "wider" is a bad thing and isn't real parallelism, I disagree with that. If 4 cores were 4 times better than one I probably would agree but since it isn't, we have to determine where the effort is best spent. If you have 11 people in the room and 10 are doctors and the one who isn't has a heart attack, what is the best use of the doctors time? Not all 10 doctors can give CPR even though normally that would be very best use of their talent, but surely one of them will be calling for an ambulance while another will be doing CPR and the rest will be doing whatever they can to assist, even in little ways.

So if you are getting very little benefit out of additional processors you need to figure out if there is something else they can do to be more useful. Perhaps one thing is to deal with the lines that are severely reduced in the single core program?

lkaufman · Post by **lkaufman** » Wed Jun 19, 2013 10:04 pm

Uri Blass wrote:
Daniel Shawul wrote:This is misleading because komodo probably compensates for poor parallel implementation by searching wider, otherwise it shouldn't be getting any elos for fixed depth test. You should do the test with time and see how much each gain.
"poor parallel implementation?"

The target of parallel implementation is not to get bigger depth but to play better.

If the implementation is good in helping komodo to play better than I think that it is wrong to call it poor.

Maybe it is the opposite and komodo compensates for poor pruning of lines that it should not prune by parallel implementation that prevent it to prune good moves in the relevant lines.

I think Uri got this exactly right. We believe (based on the results and the fact that our MP is really new and still not refined) that it is an efficient use of multi cores to patch up the pruning done on one core. The comments by some that this makes no sense are only valid for a traditional full-width engine, not for one like ours. So time to depth is not relevant for Komodo unless you add the elo equivalent of it to the elo gained at fixed depth. But it is simpler and more accurate just to play timed games, which so far indicate that Komodo's MP is not inferior to other top engines.
In response to another question, I always test Houdini, Stockfish, etc. with default settings for things like split point. Is this wrong? I know theoretically you can try to improve on that for your given machine by certain tests, but I'm not interested in spending time on this, and I doubt that most users or testers do this.

Daniel Shawul · Post by **Daniel Shawul** » Wed Jun 19, 2013 10:06 pm

Some have implied that we search wider because we don't get quite the same increase in depth and that this is a horrible thing. The intuition is that if that if it works you could do it on 1 thread too. However I do not think that follows. The problem is that 4 threads is just not 4x better than 1 thread no matter how good your MP implementation, so it always comes down to how to best utilize what you have. Anything goes in my opinion if it gets more out of extra cores in terms of ELO.

It doesn't matter if 4 threads is not as efficient as as 4x time search. The following statement is provable.
"If your sequential search is as good as it can be, then your parallel implementation that searches wider than sequential search is poorly done".
I will give it a shot as follows:
Assume sequential search is at its best. Give it 3.6x more time (say same as 4 threads search) but change the algorithm so that it can widen its search exactly the same way as the parallel search does. If indeed searching wider is better for parallel search as claimed, then the wide sequential search should show similar improvement over its deeper search counterpart. So we improved the sequential search that we assumed un-improvable, contradiction! The question of difficulty of implementation is irrelevant.

Modern Times · Post by **Modern Times** » Wed Jun 19, 2013 10:10 pm

lkaufman wrote: In response to another question, I always test Houdini, Stockfish, etc. with default settings for things like split point. Is this wrong? I know theoretically you can try to improve on that for your given machine by certain tests, but I'm not interested in spending time on this, and I doubt that most users or testers do this.

Many testers are limited to 4 cores, and for that the defaults are probably optimal. But people who use more cores probably should experiment with different values. When running on 6 cores in HG's monthly tournaments, I use a different value now for Critter for example. Maybe it helps, maybe not. Real benefits from changing the defaults probably aren't seen until you hit 8 or more cores is my guess.

lkaufman · Post by **lkaufman** » Wed Jun 19, 2013 10:18 pm

Daniel Shawul wrote:
Some have implied that we search wider because we don't get quite the same increase in depth and that this is a horrible thing. The intuition is that if that if it works you could do it on 1 thread too. However I do not think that follows. The problem is that 4 threads is just not 4x better than 1 thread no matter how good your MP implementation, so it always comes down to how to best utilize what you have. Anything goes in my opinion if it gets more out of extra cores in terms of ELO.
It doesn't matter if 4 threads is not as efficient as as 4x time search. The following statement is provable.
"If your sequential search is as good as it can be, then your parallel implementation that searches wider than sequential search is poorly done".
I will give it a shot as follows:
Assume sequential search is at its best. Give it 3.6x more time (say same as 4 threads search) but change the algorithm so that it can widen its search exactly the same way as the parallel search does. If indeed searching wider is better for parallel search as claimed, then the wide sequential search should show similar improvement over its deeper search counterpart. So we improved the sequential search that we assumed un-improvable, contradiction! The question of difficulty of implementation is irrelevant.

I think this is wrong, though it's difficult to prove either way. I think the point is that widening the search in parallel may allow for better use of time that would go to waste in a non-widened parallel search. In other words, maybe the speedup is greater for parallelizing a wider search. Why should you expect the parallel speedup to be independent of the degree of pruning?

lkaufman · Post by **lkaufman** » Wed Jun 19, 2013 10:23 pm

Modern Times wrote:
lkaufman wrote: In response to another question, I always test Houdini, Stockfish, etc. with default settings for things like split point. Is this wrong? I know theoretically you can try to improve on that for your given machine by certain tests, but I'm not interested in spending time on this, and I doubt that most users or testers do this.
Many testers are limited to 4 cores, and for that the defaults are probably optimal. But people who use more cores probably should experiment with different values. When running on 6 cores in HG's monthly tournaments, I use a different value now for Critter for example. Maybe it helps, maybe not. Real benefits from changing the defaults probably aren't seen until you hit 8 or more cores is my guess.

Do any engines have defaults that are automatically changed based on the number of cores specified? If not, are there any official recommendations by the authors to change the default based on a higher number of cores? It is possible that my 12 core tests weren't quite fair if the defaults were inappropriate for 12 cores, but unless the authors have so stated I think I did the right thing, as if I changed these values on my own I would be criticized for that.

Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP