Threads test incl. Stockfish 5 and Komodo 8

Laskos · Post by **Laskos** » Sat Oct 11, 2014 9:59 pm

bob wrote:BTW if you don't want me to reply to YOUR posts, politely stop mentioning my name. You invited me to comment because of your idiotic comment.

While you are looking around, look up the definition of "approximation" and "exact" and discover the differences...

My post was

You bust to pieces two of Bob Hyatt loud claims:

1) That Komodo implementation of SMP is "buggy", "quick and dirty", and so on. It seems one of the best to 16 threads.

2) That formula for SMP improvement withe the number of cores is linear. He gave "his" mastermind formula (N-1)*0.7 + 1, IIRC. Nothing linear here, not for a single engine.

So I need to post all the links of Robert Hyatt bluntly saying that Komodo's SMP implementation is buggy? Repeated claims that if SMP widens, then single core should be wide enough already, as to SMP not widen, and depth is the single _correct_ gain of SMP? With logical conclusion that single core Komodo could be improved greatly? With your recent statements that Don in private messages to you admitted doing a quick and dirty job with SMP implementation in Komodo? Well, look at graphs.

Do I need to show the link to YOUR idiotic effective speed-up of (N-1)*0.7 + 1 with the number of cores? Approximation? I don't know what that means? How about BETTER approximations? Your crap could easily be approximated asymptotically to 0.7*N, giving the Bob Hyattest approximation of the effective speed-up. Exact in SMP? Who is EXACT in SMP effective speed-up? But don't come with effective speed-crap. Look at graphs posted by Andreas, to learn something.

x3 · Post by x3 » Sat Oct 11, 2014 10:57 pm

I wrote something in this thread, why is it gone now?

x3

bob · Post by **bob** » Sat Oct 11, 2014 11:27 pm

Laskos wrote:
bob wrote:BTW if you don't want me to reply to YOUR posts, politely stop mentioning my name. You invited me to comment because of your idiotic comment.

While you are looking around, look up the definition of "approximation" and "exact" and discover the differences...
My post was
You bust to pieces two of Bob Hyatt loud claims:

1) That Komodo implementation of SMP is "buggy", "quick and dirty", and so on. It seems one of the best to 16 threads.

2) That formula for SMP improvement withe the number of cores is linear. He gave "his" mastermind formula (N-1)*0.7 + 1, IIRC. Nothing linear here, not for a single engine.
So I need to post all the links of Robert Hyatt bluntly saying that Komodo's SMP implementation is buggy? Repeated claims that if SMP widens, then single core should be wide enough already, as to SMP not widen, and depth is the single _correct_ gain of SMP? With logical conclusion that single core Komodo could be improved greatly? With your recent statements that Don in private messages to you admitted doing a quick and dirty job with SMP implementation in Komodo? Well, look at graphs.

Do I need to show the link to YOUR idiotic effective speed-up of (N-1)*0.7 + 1 with the number of cores? Approximation? I don't know what that means? How about BETTER approximations? Your crap could easily be approximated asymptotically to 0.7*N, giving the Bob Hyattest approximation of the effective speed-up. Exact in SMP? Who is EXACT in SMP effective speed-up? But don't come with effective speed-crap. Look at graphs posted by Andreas, to learn something.

Why would you need to post ANY links. I clearly wrote, ONCE AGAIN, that if a program is stronger searching to the same depth but using SMP, it has a bug that can be fixed to IMPROVE the non-SMP search. This is really not worth discussing. You either understand it or you don't. In this case, you don't.

My approximation is NOT asymptotic to .7N because I have CLEARLY pointed out that this fits Crafty pretty accurately through 16. Which it does. I've said no more or no less. I have not claimed three decimal place accuracy for every data point between 1 and 16.

My speedup approximation is a pretty accurate approximation for 1-16 cores, computing the SMP speedup. Not Elo. Not wish-list. Just raw time-to-depth improvement. It is what it is... It is more accurate or less accurate depending on the architecture being used, but that is by definition what "an approximation" gives. If you want I can take the data and produce a more accurate formula. Perhaps using log(CPUS) in some function. BTW, when casually chatting, what is the log(8) again? I can deal with 1 + .7 * (N-1) much easier and anyone in the conversation can calculate that in their head.

I find it amusing you demand high accuracy from an approximation. Pretty irrational thinking, in fact.

If you want an exact number, only way to produce it is to actually run the tests. An approximation is used to avoid the work.

Adam Hair · Post by **Adam Hair** » Sat Oct 11, 2014 11:55 pm

About 2 1/2 weeks before Don passed away, he gave me a description of Komodo's SMP implementation. Out of respect for Mark and Larry, I will not disclose the description. However, it was clear at that point that Don did not think the implementation was quick, dirty, or buggy.

Adam Hair · Post by **Adam Hair** » Sun Oct 12, 2014 12:23 am

bob wrote: Why would you need to post ANY links. I clearly wrote, ONCE AGAIN, that if a program is stronger searching to the same depth but using SMP, it has a bug that can be fixed to IMPROVE the non-SMP search. This is really not worth discussing. You either understand it or you don't. In this case, you don't.

Is that true for every type of SMP implementation? It seems to me that you are saying that SMP should search the same nodes that the single thread searched when both go to the same depth. In other words, if the single thread search examines N unique nodes when searching to depth D, then the SMP would also examine those N unique nodes when going to the same depth. I assume that is true for YBW and DTS, since you have used both. But is it necessarily true for a shared hash table implementation? Is it necessarily true that an engine that uses a shared hash table implementation must have a bug if it examines N'>N unique nodes when searching to depth D? If so, then it may be a good bug to have. Kai has shown that several engines with this sort of bug have effective (as in gaining Elo when the cores increase) SMP implementations.

bob wrote: My approximation is NOT asymptotic to .7N because I have CLEARLY pointed out that this fits Crafty pretty accurately through 16. Which it does. I've said no more or no less. I have not claimed three decimal place accuracy for every data point between 1 and 16.

My speedup approximation is a pretty accurate approximation for 1-16 cores, computing the SMP speedup. Not Elo. Not wish-list. Just raw time-to-depth improvement. It is what it is... It is more accurate or less accurate depending on the architecture being used, but that is by definition what "an approximation" gives. If you want I can take the data and produce a more accurate formula. Perhaps using log(CPUS) in some function. BTW, when casually chatting, what is the log(8) again? I can deal with 1 + .7 * (N-1) much easier and anyone in the conversation can calculate that in their head.

I find it amusing you demand high accuracy from an approximation. Pretty irrational thinking, in fact.

If you want an exact number, only way to produce it is to actually run the tests. An approximation is used to avoid the work.

Does 1 + .7 * (N-1) come from the DTS data? I have forgotten what I had read.

michiguel · Post by **michiguel** » Sun Oct 12, 2014 12:58 am

Adam Hair wrote:
bob wrote: Why would you need to post ANY links. I clearly wrote, ONCE AGAIN, that if a program is stronger searching to the same depth but using SMP, it has a bug that can be fixed to IMPROVE the non-SMP search. This is really not worth discussing. You either understand it or you don't. In this case, you don't.
Is that true for every type of SMP implementation? It seems to me that you are saying that SMP should search the same nodes that the single thread searched when both go to the same depth. In other words, if the single thread search examines N unique nodes when searching to depth D, then the SMP would also examine those N unique nodes when going to the same depth. I assume that is true for YBW and DTS, since you have used both. But is it necessarily true for a shared hash table implementation? Is it necessarily true that an engine that uses a shared hash table implementation must have a bug if it examines N'>N unique nodes when searching to depth D? If so, then it may be a good bug to have. Kai has shown that several engines with this sort of bug have effective (as in gaining Elo when the cores increase) SMP implementations.

Adam,

Imagine this made up scenario. You can have two searches
A) you aggressively reduce/prune
B) you reduce less
A and B, gives you the same elo in single core. Sometimes this type of things happen.

Now, you realize that B scales better than A in SMP. Then, you use search B for SMP, but A for single core, just because it may be better at blitz, but it could even be a matter of taste. You can even make it more and more B as the cores increases. This will justify results like the search "widens". Yes, in theory you can "linearize" the parallel search, but that is just impractical. It does not take into account all the inefficiencies of coding a horrible convoluted piece of code.

When you choose an algorithm for SMP, you choose what parallelize better, which may not be best for single core.

Miguel

bob wrote: My approximation is NOT asymptotic to .7N because I have CLEARLY pointed out that this fits Crafty pretty accurately through 16. Which it does. I've said no more or no less. I have not claimed three decimal place accuracy for every data point between 1 and 16.

My speedup approximation is a pretty accurate approximation for 1-16 cores, computing the SMP speedup. Not Elo. Not wish-list. Just raw time-to-depth improvement. It is what it is... It is more accurate or less accurate depending on the architecture being used, but that is by definition what "an approximation" gives. If you want I can take the data and produce a more accurate formula. Perhaps using log(CPUS) in some function. BTW, when casually chatting, what is the log(8) again? I can deal with 1 + .7 * (N-1) much easier and anyone in the conversation can calculate that in their head.

I find it amusing you demand high accuracy from an approximation. Pretty irrational thinking, in fact.

If you want an exact number, only way to produce it is to actually run the tests. An approximation is used to avoid the work.
Does 1 + .7 * (N-1) come from the DTS data? I have forgotten what I had read.

bob · Post by **bob** » Sun Oct 12, 2014 3:55 am

Adam Hair wrote:
bob wrote: Why would you need to post ANY links. I clearly wrote, ONCE AGAIN, that if a program is stronger searching to the same depth but using SMP, it has a bug that can be fixed to IMPROVE the non-SMP search. This is really not worth discussing. You either understand it or you don't. In this case, you don't.
Is that true for every type of SMP implementation? It seems to me that you are saying that SMP should search the same nodes that the single thread searched when both go to the same depth. In other words, if the single thread search examines N unique nodes when searching to depth D, then the SMP would also examine those N unique nodes when going to the same depth. I assume that is true for YBW and DTS, since you have used both. But is it necessarily true for a shared hash table implementation? Is it necessarily true that an engine that uses a shared hash table implementation must have a bug if it examines N'>N unique nodes when searching to depth D? If so, then it may be a good bug to have. Kai has shown that several engines with this sort of bug have effective (as in gaining Elo when the cores increase) SMP implementations.

bob wrote: My approximation is NOT asymptotic to .7N because I have CLEARLY pointed out that this fits Crafty pretty accurately through 16. Which it does. I've said no more or no less. I have not claimed three decimal place accuracy for every data point between 1 and 16.

My speedup approximation is a pretty accurate approximation for 1-16 cores, computing the SMP speedup. Not Elo. Not wish-list. Just raw time-to-depth improvement. It is what it is... It is more accurate or less accurate depending on the architecture being used, but that is by definition what "an approximation" gives. If you want I can take the data and produce a more accurate formula. Perhaps using log(CPUS) in some function. BTW, when casually chatting, what is the log(8) again? I can deal with 1 + .7 * (N-1) much easier and anyone in the conversation can calculate that in their head.

I find it amusing you demand high accuracy from an approximation. Pretty irrational thinking, in fact.

If you want an exact number, only way to produce it is to actually run the tests. An approximation is used to avoid the work.
Does 1 + .7 * (N-1) come from the DTS data? I have forgotten what I had read.

Let's talk about what happens. There are two choices:

(1) you search deeper. The traditional reason for using SMP. The deeper you go, the better you play, period. Even if there is diminishing returns, there are returns...

(2) Anything else is nonsense. For example, suppose you claim that you can play better at the same depth, using SMP. If so, then why not just do a multiplexed search and pretend to use two threads? Each pseudo-thread searches one node then you flip to the other. Now you should get that SAME benefit from playing better for two threads.

Parallel search is all about speed. Nothing more. No mystical things about searching nodes you would not search in the 1-thread search. If that is good, search 'em in the 1 thread search and get the SAME benefit. So it boils down to going faster and deeper. There is a TON of literature on parallel search. NONE of 'em buy into this sort of stuff. Having done it for so long, neither do I.

the 1 + 0-7 * (N-1) came from Crafty measurements. I was asked for an estimate back in 1997 or so, and that was a fair representation. DTS was actually better than that for 1-8 cpus, but I never chose to do an iterated search when I started Crafty. Martin Fierz pointed out that 1 + 0.8 * (N-1) was a better number for the 1-2-4-6-8 cpu test data I provided to him quite a while back. I've simply stuck with the 0.7 multiplier as being a reasonable number for achievable numbers of processors. Today what is achievable has changed quite a bit, and on quite a few machines I notice that even the 0.7 might not be so good. a 20 core single chip machine has a huge memory and cache bottleneck, where the original AMD I used to produce the Fierz data was 8 single-core CPUs without such a big bottleneck, just a few NUMA issues that were not hard to address.

For your other question, YES. the goal is to search a minimal tree, which should not change whether you are doing a parallel search or a sequential search. If you can keep all processors busy, and add no extra nodes, you get a clean perfect linear speedup. But if you look at the paper I wrote for the Journal of Parallel Computing somewhere back in the middle 80's, you simply can't search that minimal tree since perfect move ordering is not doable. As far as the shared hash parallel search, you should simply measure one. It is a very easy to implement parallel search, but it does NOT perform reasonably when compared to a traditional parallel search.

And I will remind you, if searching extra nodes is good in the parallel search, you can get the SAME improvement in the non-parallel search. Searching extra nodes is actually easy to do. It is eliminating nodes that is so hard. And which is the goal of every search on the planet.

bob · Post by **bob** » Sun Oct 12, 2014 4:08 am

Adam Hair wrote:About 2 1/2 weeks before Don passed away, he gave me a description of Komodo's SMP implementation. Out of respect for Mark and Larry, I will not disclose the description. However, it was clear at that point that Don did not think the implementation was quick, dirty, or buggy.

Don and I had a similar discussion, but quite a bit earlier. He was quite honest in his "extra node" stuff. I'm not going to post all of his comments, but one particular point he made was that he did not do the reductions and pruning stuff very accurately at a split point. There are issues there regarding move order (since most reduce later moves more than earlier moves, but a parallel search can take N moves at the same time and the order ends up being shuffled if you are not careful.) I've done similar, but I also eventually went back and fixed it so that the moves get reduced identically whether the node is searched serially or in parallel.

There's really nothing new on the planet in terms of parallel search for minimax game trees using alpha/beta. The issues are well-known, and there are no silver bullets that offer large improvements, so long as the basic search works well and the NPS scales reasonably. The tricks are "modest improvements" and revolve around choosing better split points, the main point of the DTS algorithm, but a point that represents a major problem for a recursive search.

If someone told me they had a new sequential search algorithm (i.e. something like B* or whatever) then all bets are off. But so far, everybody is using alpha/beta - minimax, and the properties of those trees have been known since Knuth/Moore's paper on the topic.

There are two kinds of "bugs" here. One is something that can crash or break things. The type I am talking about is the type that makes you go beyond the so-called "minimal tree space" which NOBODY wants to do for reasons that are well-known and theoretically sound.

Hope that helps clear up what I have been talking about. That someone gains by searching extra nodes is well and good. But they COULD gain that without a parallel search. Think about it like this: Suppose I wrote a search that when searching sequential it just lops off the last 1/2 of the moves at any ply, no search or anything. That would play terribly. Suppose my two cpu version adds those moves back in. Now we get a big SMP gain. But we COULD get most of that gain by fixing the sequential search. That's why I consider all of this "widening" stuff to be complete nonsense. It is simply not going to pass muster in any theoretical analysis whatsoever. But one has to actually have the theoretical background to understand why, otherwise all you get is "wow, this thing plays a lot better with 2 cpus than with 1, even when searching to the same depth."

syzygy · Post by **syzygy** » Mon Oct 13, 2014 1:47 am

Adam Hair wrote:
bob wrote: Why would you need to post ANY links. I clearly wrote, ONCE AGAIN, that if a program is stronger searching to the same depth but using SMP, it has a bug that can be fixed to IMPROVE the non-SMP search. This is really not worth discussing. You either understand it or you don't. In this case, you don't.
Is that true for every type of SMP implementation?

1. If a program does exactly what the programmer wants it to do, it is not buggy. Whatever Bob might think is irrelevant.

2. Bob's argument that the non-bug can be "fixed" to improve Elo is simply confused and false. This has been hashed out to death already in one or two threads.

bob · Post by **bob** » Mon Oct 13, 2014 2:26 am

syzygy wrote:
Adam Hair wrote:
bob wrote: Why would you need to post ANY links. I clearly wrote, ONCE AGAIN, that if a program is stronger searching to the same depth but using SMP, it has a bug that can be fixed to IMPROVE the non-SMP search. This is really not worth discussing. You either understand it or you don't. In this case, you don't.
Is that true for every type of SMP implementation?
1. If a program does exactly what the programmer wants it to do, it is not buggy. Whatever Bob might think is irrelevant.

2. Bob's argument that the non-bug can be "fixed" to improve Elo is simply confused and false. This has been hashed out to death already in one or two threads.

Whether the author wants to "fix" the problem or not is up to him. It is not "confused and false." Just because YOU don't understand the basics of the parallel alpha/beta search issues doesn't mean EVERYONE doesn't understand them. NOBODY wants their search to be larger than the "minimal tree" as defined by Knuth/Moore. Absolutely NOBODY. To claim otherwise is simply naive and uninformed. Reducing the size of the tree is anything but trivial, but that does NOT mean it is not the ultimate goal of any parallel search, because that is how you get the maximal depth gain, which is where the real Elo gains exist.

Try again, if you want. But posting nonsense is not going to make it so.

Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8

Re: Threads test incl. Stockfish 5 and Komodo 8