Peculiarity of Komodo 5.1MP

lkaufman · Post by **lkaufman** » Wed Jun 19, 2013 11:39 pm

Daniel Shawul wrote:
lkaufman wrote:I am quite sure that neither our parallel nor singe-core search is even close to optimal, just as I am sure this is true for every other engine now in existence. But you haven't convinced me that given an optimal single-core engine, the parallel search must just be faster, not wider. Why can't a wider search get a bigger speedup from parallelism?
When I say optimal it doesn't mean searching the minimal tree but the best sequential search that you have now i.e. un-improvable with the current knowledge that you have (not absolutely un-improvable). Now if you get better search i.e. ELO by searching wider when doing parallel search, it means you can do the same for the sequential search as well. Parallel search is not much different from giving the engine more time. So that means for the sequential search, you can decide ,say to widen with time or something like that, for the same effect. I don't know how that can be implemented in a real engine but that is not my point. Better results with wide SMP=>Better results with wide SEQUENTIAL search given more time=> Sequential search improved if we know how to do it exactly like the parallel search does.
Don acknowledges that but he was arguing that 4 threads is not same as as 4x search, but i don't see how it would be different even if you use 3.6x time.

Don's point is that if you could get four to one speedup by a known parallel algorithm (for 4 cores), there is no way this could be improved upon by changing the sequential search. But what evidence is there that if you get 3 to 1 for 4 cores for your best sequential engine, you can't get 3.2 to 1 for 4 cores by a wider algorithm. This is what we believe, and I think this explains our results.

syzygy · Post by **syzygy** » Wed Jun 19, 2013 11:43 pm

Daniel Shawul wrote:I did you missed it. It assumes no implmentation difficulties.

I'm not talking about implementation difficulties. I'm talking about theoretical feasibility.

With your reasoning I can prove that P=NP.
Take any NP-hard problem. Write a program that solves the problem in polynomial time. "I assume no implementation difficulties." Done.

Anyway, I'm sure you already got the point many posts ago.

michiguel · Post by **michiguel** » Wed Jun 19, 2013 11:52 pm

Daniel Shawul wrote:
Btw, even if you could prove it, it would only show that Komodo's single-core search is suboptimal. It does not prove that Komodo's SMP implementation is suboptimal.
I missed this part of your post. Yes so Komodo should first improve its sequential sub-optimal search if that is the case. But I prefer the other explanation that its parallel search is less efficiently done than its sequential search , and that seems to be the case from Don's and other people's explanations.

Searching "wider" in parallel is not the same as searching "wider" sequentially. Rewriting the sequential search may require to have an algorithm that jumps from one branch of the search to the other asynchronously. Replicating it may require some code that could introduce a huge loss of efficiency. So yes, what you say is theoretically possible, but most likely practically impossible. Then, that would prove that there is a theoretically superior algorithm that works only in paper. I would not be surprised there is one.

This scenario is theoretically possible too: Search A (wide) and B (narrow). B is better than A, but scales worst in parallel. So, if cores are available, it would be better to use search A. This of course may look "too theoretical", but I always wonder whether some of the reductions are good for single core, since they are tested there, but may be suboptimal for SMP (where there is less tests and we do not know).

Miguel

michiguel · Post by **michiguel** » Wed Jun 19, 2013 11:59 pm

Daniel Shawul wrote:
lkaufman wrote:I am quite sure that neither our parallel nor singe-core search is even close to optimal, just as I am sure this is true for every other engine now in existence. But you haven't convinced me that given an optimal single-core engine, the parallel search must just be faster, not wider. Why can't a wider search get a bigger speedup from parallelism?
When I say optimal it doesn't mean searching the minimal tree but the best sequential search that you have now i.e. un-improvable with the current knowledge that you have (not absolutely un-improvable). Now if you get better search i.e. ELO by searching wider when doing parallel search, it means you can do the same for the sequential search as well. Parallel search is not much different from giving the engine more time.

I disagree with this. The fact that we see some positions in which there are superlinear speed ups and some others in which the speed up is terrible makes me think that the complexity of the issue preclude us from doing simple extrapolations.

In a previous you are assuming that strength is governed by the average case. I am not so sure about that.

Miguel

So that means for the sequential search, you can decide ,say to widen with time or something like that, for the same effect. I don't know how that can be implemented in a real engine but that is not my point. Better results with wide SMP=>Better results with wide SEQUENTIAL search given more time=> Sequential search improved if we know how to do it exactly like the parallel search does.
Don acknowledges that but he was arguing that 4 threads is not same as as 4x search, but i don't see how it would be different even if you use 3.6x time.

Daniel Shawul · Post by **Daniel Shawul** » Thu Jun 20, 2013 12:00 am

syzygy wrote:
Daniel Shawul wrote:I did you missed it. It assumes no implmentation difficulties.
I'm not talking about implementation difficulties. I'm talking about theoretical feasibility.

With your reasoning I can prove that P=NP.
Take any NP-hard problem. Write a program that solves the problem in polynomial time. "I assume no implementation difficulties." Done.

Anyway, I'm sure you already got the point many posts ago.

I am not sure Ronald. From my side it looks obvious that parallel search is not more than giving more time to a sequential search. So any statement you can make about improvement in wider parallel search performance, you can make the same by giving the sequential search more time. Parallel algorithms are non-deterministic so clearly you will have implementation difficulties which i discarded.

syzygy · Post by **syzygy** » Thu Jun 20, 2013 12:40 am

Daniel Shawul wrote:
syzygy wrote:
Daniel Shawul wrote:I did you missed it. It assumes no implmentation difficulties.
I'm not talking about implementation difficulties. I'm talking about theoretical feasibility.

With your reasoning I can prove that P=NP.
Take any NP-hard problem. Write a program that solves the problem in polynomial time. "I assume no implementation difficulties." Done.

Anyway, I'm sure you already got the point many posts ago.
I am not sure Ronald. From my side it looks obvious that parallel search is not more than giving more time to a sequential search. So any statement you can make about improvement in wider parallel search performance, you can make the same by giving the sequential search more time. Parallel algorithms are non-deterministic so clearly you will have implementation difficulties which i discarded.

Behind "discarding implementation difficulties" you hide the problem that there is no general transformation from parallel to sequential that can be implemented without loss in speed. The loss in speed is "only" a constant factor, but if you don't care about constant factors then there is no difference between 1 core and 4 cores to begin with.

In a specific case there might be a smart sequential implementation that mimics the parallel behaviour without loss in speed, but your reasoning assumes this is possible in general.

The other point I made was that suboptimality of the single-core search does not imply suboptimality of the SMP implementation. Now you can say that Don should first "fix" the single-core search to make it optimal, but that is not reasonable. If everybody had waited with implementing SMP until the single-core search was "optimal", we would not have had any SMP implementations today.

Of course there is little doubt that both Komodo's single-core search and Komodo's SMP implementation can be improved, and that the same applies to any currently existing program.

Daniel Shawul · Post by **Daniel Shawul** » Thu Jun 20, 2013 1:12 am

Ah you just reminded me of a good example that is similar but not the same ,that may justify ignoring implementation difficulties when talking about algorithms.
SSS* vs alpha-beta!
SSS* gives consistently lower branching factor than alpha-beta, because it is a non-directional algorithm that searches pretty much like a parallel searcher but deterministic ally. The downside for it is it has huge computational requirements which hold it back from replacing alpha-beta practically. But no one holds that against it when discussing algorithms because implementation details are considered secondary. It is always stated that alpha-beta is optimal only within the set of directional algorithms, and that there exists non-directional algorithms that can beat it without specifying implementation difficulties. I will reply to your points later.

bob · Post by **bob** » Thu Jun 20, 2013 2:18 am

Laskos wrote:
bob wrote:
Laskos wrote:
Daniel Shawul wrote:This is misleading because komodo probably compensates for poor parallel implementation by searching wider, otherwise it shouldn't be getting any elos for fixed depth test. You should do the test with time and see how much each gain.
I already wrote that it probably searches wider with the number of threads. But it's different from other top engines, which to given depth are pretty much the same strength on different number of threads. And the rule (as stated by some) that time-to-depth is determining MP efficiency does not apply to Komodo.
I will leave to test groups and individuals to test with time, there are plenty of volunteers. I was just curious about this aspect.
Time-to-depth IS the correct measure. If Komodo searches wider, then it searches less efficiently. One wants to measure the COMPLETE SMP implementation, not just how the tree is split.

I can not imagine wanting to search wider on a parallel search. If it is a good idea, why not search wider on the sequential implementation as well?
Time-to-depth won't work for Komodo. It benefits from a wider tree in the parallel search to the same depth, and time-to-depth would understate its SMP performance Elo-wise. Comparing depths of the 1 threaded Komodo with the depths of 4-threaded Komodo is meaningless given the different tree shapes. Elo is all that matters in SMP efficiency, and I don't know why it has a wider tree in the parallel search, the fact is that 4 vs 1 thread Komodo benefits from two sides: deeper search and wider tree, both contributing substantially. The total contribution is probably comparable to other engines going from 1 to 4 threads, only that Komodo has this peculiar behavior.

Your statement is simply wrong. It doesn't "benefit" from a wider tree. It "suffers from" a wider tree.

Otherwise a wider tree would be better in a non-parallel search as well.

Note that when you see an elo improvement going from 1 to 4 in fixed depth, it is imaginary. Just compare the times. The 4 cpu test is taking longer than it should, getting a time handicap that is imaginary in real chess.

Better to do a few searches with st=10, and use both 1 and 4 cpus and see what happens there. I'd suspect, based on your results, that the 4 cpu is not going to perform as well as one would expect, because part of the extra processing power is wasted searching extra nodes...

bob · Post by **bob** » Thu Jun 20, 2013 2:24 am

syzygy wrote:
bob wrote:Time-to-depth IS the correct measure.
Give one reason why decreased time-to-depth is more important that elo gain from using more cores.

Absolutely trivial to do. When you change the selectivity of a program, but continue to search to the SAME depth, you take more time. That program, at that depth, searches a larger tree, and will play better. A full-width 12 ply search will whip the snot out of a current 12 ply search with LMR, pruning and such enabled. It will also take FAR longer to complete that 12 ply search.

This really is simple to understand I would think. If you make the tree wider, you gain less depth with 4 cores. If the wider search is more important than the extra depth 4x speed should give, then why not gain that extra width in the one-core search.

TIme to get serious here and not start an argument that is totally pointless and founded on flawed reasoning.

If Komodo searches wider, then it searches less efficiently. One wants to measure the COMPLETE SMP implementation, not just how the tree is split.
No, Kai only wanted to compare SMP behaviour of Komodo to SMP behaviour of other programs.

As Uri stated, there are already good indications that Komodo's SMP implementation is doing pretty well: more cores lead to a respectable gain in elo. It is therefore silly to say that Komodo's SMP implementation is bad.

He did NOT compare SMP at all. Notice that for normal programs, a 12 ply search should see no elo change whether you use 1 or 4 cores. You are searching to the SAME depth. If you see a significant Elo improvement, the SMP search is doing extra work. But it is not getting penalized since it is a fixed depth search. In a timed search it will not gain as many plies as expected.

Now the interesting thing is that Kai's tests show that Komodo achieves this partially by lowering the time-to-depth (thereby increasing depth in the same time) and partially by somehow getting more out of each ply. For H3 and Stockfish on the other hand, it appears that gains from SMP must come in the form of lower time-to-depth (since performance per ply does not increase, at least not significantly).

So here we just have some hard facts that point to something interesting going on.

I remember Don writing about trying out a variation of one of the "lazy smp" ideas that were being discussed here. So far I did not think too much of these ideas as they seemed to be not scalable. Now, I really have no idea whether Komodo is using some of these ideas, but it appears to be a fact that Komodo's SMP is not doing badly at all even on 12 cores AND is not using a 'standard' SMP implementation.

Maybe it is not the "lazy smp", maybe it is something else. But it is interesting.

You are simply not thinking about what is going on. This is NOT showing a "good thing". It is showing a "bad thing". Namely that the search is intentionally searching more nodes to reach the same depth, when using more than one thread. NOBODY wants to see that happen. It is built-in inefficiency. Zero gain. Net loss...

bob · Post by **bob** » Thu Jun 20, 2013 2:28 am

lkaufman wrote:
Daniel Shawul wrote:
Some have implied that we search wider because we don't get quite the same increase in depth and that this is a horrible thing. The intuition is that if that if it works you could do it on 1 thread too. However I do not think that follows. The problem is that 4 threads is just not 4x better than 1 thread no matter how good your MP implementation, so it always comes down to how to best utilize what you have. Anything goes in my opinion if it gets more out of extra cores in terms of ELO.
It doesn't matter if 4 threads is not as efficient as as 4x time search. The following statement is provable.
"If your sequential search is as good as it can be, then your parallel implementation that searches wider than sequential search is poorly done".
I will give it a shot as follows:
Assume sequential search is at its best. Give it 3.6x more time (say same as 4 threads search) but change the algorithm so that it can widen its search exactly the same way as the parallel search does. If indeed searching wider is better for parallel search as claimed, then the wide sequential search should show similar improvement over its deeper search counterpart. So we improved the sequential search that we assumed un-improvable, contradiction! The question of difficulty of implementation is irrelevant.
I think this is wrong, though it's difficult to prove either way. I think the point is that widening the search in parallel may allow for better use of time that would go to waste in a non-widened parallel search. In other words, maybe the speedup is greater for parallelizing a wider search. Why should you expect the parallel speedup to be independent of the degree of pruning?

I don't believe there is any connection between the two. You prune as you see fit to search an optimal tree and go as deeply as possible. Any variance from that is a step backward.

Again, a simple test.

Take Komodo and run it at depth=12.

Now take gnuchessx, run it at depth=12, but disable null-move, reductions, pruning and everything else. Gnu will completely crush Komodo. But the wise man would look and say "wait a minute, komodo at depth=12 is searching at < 1 second per move while Gnu is taking > 6 minutes a move. That's not fair. And of course it isn't fair. Neither is comparing a 12 ply search between a SMP and non-SMP search. They should be very close since the depth is the same.

Do you REALLY think that gnu chess version will beat komodo in a single game, played under normal timed conditions? It would likely lose on time within the first move or two. Or else only reach 2-3-4 plies in a blitz game to Komodo's 20.

Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP

Re: Peculiarity of Komodo 5.1MP