Adam Hair wrote:bob wrote:
Why would you need to post ANY links. I clearly wrote, ONCE AGAIN, that if a program is stronger searching to the same depth but using SMP, it has a bug that can be fixed to IMPROVE the non-SMP search. This is really not worth discussing. You either understand it or you don't. In this case, you don't.
Is that true for every type of SMP implementation? It seems to me that you are saying that SMP should search the same nodes that the single thread searched when both go to the same depth. In other words, if the single thread search examines N unique nodes when searching to depth D, then the SMP would also examine those N unique nodes when going to the same depth. I assume that is true for YBW and DTS, since you have used both. But is it necessarily true for a shared hash table implementation? Is it necessarily true that an engine that uses a shared hash table implementation must have a bug if it examines N'>N unique nodes when searching to depth D? If so, then it may be a good bug to have. Kai has shown that several engines with this sort of bug have effective (as in gaining Elo when the cores increase) SMP implementations.
bob wrote:
My approximation is NOT asymptotic to .7N because I have CLEARLY pointed out that this fits Crafty pretty accurately through 16. Which it does. I've said no more or no less. I have not claimed three decimal place accuracy for every data point between 1 and 16.
My speedup approximation is a pretty accurate approximation for 1-16 cores, computing the SMP speedup. Not Elo. Not wish-list. Just raw time-to-depth improvement. It is what it is... It is more accurate or less accurate depending on the architecture being used, but that is by definition what "an approximation" gives. If you want I can take the data and produce a more accurate formula. Perhaps using log(CPUS) in some function. BTW, when casually chatting, what is the log(8) again? I can deal with 1 + .7 * (N-1) much easier and anyone in the conversation can calculate that in their head.
I find it amusing you demand high accuracy from an approximation. Pretty irrational thinking, in fact.
If you want an exact number, only way to produce it is to actually run the tests. An approximation is used to avoid the work.
Does 1 + .7 * (N-1) come from the DTS data? I have forgotten what I had read.
Let's talk about what happens. There are two choices:
(1) you search deeper. The traditional reason for using SMP. The deeper you go, the better you play, period. Even if there is diminishing returns, there are returns...
(2) Anything else is nonsense. For example, suppose you claim that you can play better at the same depth, using SMP. If so, then why not just do a multiplexed search and pretend to use two threads? Each pseudo-thread searches one node then you flip to the other. Now you should get that SAME benefit from playing better for two threads.
Parallel search is all about speed. Nothing more. No mystical things about searching nodes you would not search in the 1-thread search. If that is good, search 'em in the 1 thread search and get the SAME benefit. So it boils down to going faster and deeper. There is a TON of literature on parallel search. NONE of 'em buy into this sort of stuff. Having done it for so long, neither do I.
the 1 + 0-7 * (N-1) came from Crafty measurements. I was asked for an estimate back in 1997 or so, and that was a fair representation. DTS was actually better than that for 1-8 cpus, but I never chose to do an iterated search when I started Crafty. Martin Fierz pointed out that 1 + 0.8 * (N-1) was a better number for the 1-2-4-6-8 cpu test data I provided to him quite a while back. I've simply stuck with the 0.7 multiplier as being a reasonable number for achievable numbers of processors. Today what is achievable has changed quite a bit, and on quite a few machines I notice that even the 0.7 might not be so good. a 20 core single chip machine has a huge memory and cache bottleneck, where the original AMD I used to produce the Fierz data was 8 single-core CPUs without such a big bottleneck, just a few NUMA issues that were not hard to address.
For your other question, YES. the goal is to search a minimal tree, which should not change whether you are doing a parallel search or a sequential search. If you can keep all processors busy, and add no extra nodes, you get a clean perfect linear speedup. But if you look at the paper I wrote for the Journal of Parallel Computing somewhere back in the middle 80's, you simply can't search that minimal tree since perfect move ordering is not doable. As far as the shared hash parallel search, you should simply measure one. It is a very easy to implement parallel search, but it does NOT perform reasonably when compared to a traditional parallel search.
And I will remind you, if searching extra nodes is good in the parallel search, you can get the SAME improvement in the non-parallel search. Searching extra nodes is actually easy to do. It is eliminating nodes that is so hard. And which is the goal of every search on the planet.